possible to improve paging performance on the windows version of monetdb?

hi monetdb team, could disk paging performance be improved on windows? a query run without versus with any disk paging is kinda like the hulk versus plain old bruce banner. ;) ==background== i'm working on a longer tutorial for http://asdfree.com/ about how to work with census microdata from 74 different countries (yes, 74!) with R and monetdb. i'm planning to release a blog post specifically for the brazilian microdata and then expand that code to work on the other 73 countries https://international.ipums.org/international-action/countries. both brazilian census bureau employees and outside users have taken a serious interest in this work, with official links from onehttp://www.ibge.gov.br/home/estatistica/populacao/trabalhoerendimento/pnad20..., twohttp://www.ibge.gov.br/home/estatistica/indicadores/trabalhoerendimento/pme_..., three http://www.ibge.gov.br/home/xml/pof_2008_2009.shtm of their microdata pages. r and monetdb are obviously great choices for analysts who don't have the resources for expensive computers and/or software. a former employee at insee (the french stat. bureau) who currently consults on international demography projects outlined this problem for me pretty clearly: lower-resource statistical agencies in the developing world need a good, free option like (R + monetdb) for analysis of larger data sets on the (generally cheaper) hardware that they already have. international census bureau employees and other demographers are much more likely to have statistical but not computer programming skills, so as i work on these instructional documents, i'll aim to make the computing side of things as easy as possible. and, most central to this e-mail, much of this new potential audience uses windows. (on march 10th, 2014http://cran-logs.rstudio.com/2014/2014-03-10.csv.gz, 67% of MonetDB.R package downloads were windows users) ==problem== certain sql commands - it's hard for me to pin down which ones - tend to hang. i will run a command, look at my "process manager" and see RAM start getting gobbled up (not by mserver.exe but by some invisible process--i think disk paging?), and then wait. mserver.exe will use between 0% and 1% of the total CPU for a few minutes, then something will "wake up" and suddenly the CPU starts lighting up.. with 70%-90% usage across multiple cores. this 1% CPU usage to 75% CPU usage can jump back and forth a few times until the full command finally finishes. all of the time spent with 0% or 1% CPU usage - waiting for commands to execute - seems like an opportunity for improvement of the windows build? here's an example that took 11 minutes for a query run on 20,000,000 records..
system.time( z <- svymean( ~v0601 , pes.design , byvar = ~v0001 , se = TRUE ) ) user system elapsed 18.47 1.13 *710.86 *
..totally awesome! but can it be improved? ==comparison between two machines== the attached .sql code has been run on two windows machines: on the windows server with 64 gb of ram-- # Serving database 'test', using 4 threads # Compiled for x86_64-pc-winnt/64bit with 64bit OIDs dynamically linked # Found 64.000 GiB available main-memory. on my personal windows laptop-- # Serving database 'test', using 8 threads # Compiled for x86_64-pc-winnt/64bit with 64bit OIDs dynamically linked # Found 7.860 GiB available main-memory. ==processing time comparison== processing time on the 64gb server: 27 tuples *(5.2s)* processing time on my 8gb laptop: 27 tuples *(7m 46s)* *so on a windows machine, if disk paging can be avoided, the processing time is 90x faster for this attached sql code?* ==how to load this reproducible example== ** data file with four columns ** https://drive.google.com/file/d/0B3LuuP081KDYNERWQWx5MEpZU2M/edit?usp=sharin... ** commands to load the data ** CREATE TABLE "sys"."d10" ( "idkey" INTEGER, "v0601" VARCHAR(255), "pes_wgt" DOUBLE, "v0001" VARCHAR(255) ); COPY 20635472 RECORDS INTO d10 FROM 's:/temp/example.txt'; [[see .sql attachment for post-import commands]] ==general diagnostics== # MonetDB 5 server v11.17.11 "Jan2014-SP1" # Serving database 'test', using 4 threads # Compiled for x86_64-pc-winnt/64bit with 64bit OIDs dynamically linked # Found 64.000 GiB available main-memory. # Copyright (c) 1993-July 2008 CWI. # Copyright (c) August 2008-2014 MonetDB B.V., all rights reserved # Visit http://www.monetdb.org/ for further information # Listening for connection requests on mapi:monetdb://127.0.0.1:50000/ # MonetDB/JAQL module loaded # MonetDB/SQL module loaded
==conclusion question== the windows version of monetdb is intensely slow at rather unexpected times. can and will anything be done about this in the near future? thanks for monetdb!

if anyone has ideas on #3-5 below, i'd appreciate it.
first just following up with a few additional snags..
1) this screenshot i've
uploadedhttps://drive.google.com/file/d/0B3LuuP081KDYcFdOUkZkQjRvaHM/edit?usp=sharin...shows
one more example from some demonstrations i gave to a census
consultant last week.. that server has 64gb of ram and still sometimes
behaves like it's paging. :/ note that R (background) is frozen because
it's waiting for monetdb to do something (there's plenty of RAM accessible
but 0% of the cpus are being used). this issue is simply resolved by
waiting.. but it's not ideal, and i don't think it can be blamed on
overloaded memory.
2) i made this commithttps://github.com/ajdamico/usgsd/commit/79eb7142e876270e86e1c40ae89e588beba...to
my code to import the brazilian census. on my 8gb laptop: previously,
the code crashed mserver. now everything works. as you can see from the
additions, the only thing that's different is a few additional server
shutdowns and restarts. this was a pretty painless workaround, but it's
weird to me that something as silly as a server restart would prevent a
crash.
..i hit issues like these all the time. and when i do, i just generally
feel like the windows build could use some more scrutiny from people who
know what should happen. if you could incorporate more windows performance
reviews on machines with 16GB or less, i'd be grateful. sometimes monetdb
just gets finnicky, but that's hard for me to reproduce for your team.
===========
i also think it might be wise to update this
guidehttp://www.asdfree.com/2013/03/column-store-r-or-how-i-learned-to-stop.htmlwith
the answers to a few additional questions:
3) at times, the monetdb-backed data sets i have written about
alreadyhttps://github.com/ajdamico/usgsd/search?q=monetdb.r&ref=cmdformdo
end up with disk paging in certain circumstances. my instructions to
users have generally been "leave your computer overnight and let it run"
but that only works on download/import/cleaning, not during an interactive
data analysis session. can any of you give me some straightforward
guidelines about "how to avoid disk paging" that would be understandable to
people who may not have much experience with SQL? the memory footprint
description https://www.monetdb.org/Documentation/MemoryClaim is helpful
for more advanced users, but for people who have never used a database
before, are there any good rules of thumb like "never go over ten columns x
one million records per gigabyte of RAM on your machine" ? more concrete
pointers might help new users understand why some commands run very quickly
and others clog the computer. i am looking for something like "how to
avoid disk paging for dummies" :)
4) in addition to sheer data table size, are there specific commands that
tend to hog memory that should be avoided or re-written gracefully to
conserve computing resources? as a starting example (but hopefully there
are others as well), could the SQL code from my previous e-mail (the code
that was very slow on my 8gb laptop) be re-written in a more efficient
manner? commands like the ones attached to my previous e-mail are very
very common in the sqlsurvey package's R
codehttp://sqlsurvey.r-forge.r-project.org/,
so if the monetdb team can give me some re-write advice, i would be able to
comb through dr. lumley's code and make some recommended revisions for him
to improve the whole system.
5) can someone just confirm that "how to detect when your disk is paging"
is pretty straightforward on microsoft windows: (a) open task manager (b)
click the "performance" tab (c) look for zero CPU usage and overloaded RAM
usage, and optional (d) click on "resource monitor" and note the heavy
movement under "disk activity" ..i think that's it, but maybe there are
other things to look for?
my apologies for being a pita :/ and thanks again for your work!! i
still get excited when i see a giant, complicated query finish in a few
seconds.
On Tue, Mar 11, 2014 at 1:58 PM, Anthony Damico
hi monetdb team, could disk paging performance be improved on windows? a query run without versus with any disk paging is kinda like the hulk versus plain old bruce banner. ;)
==background==
i'm working on a longer tutorial for http://asdfree.com/ about how to work with census microdata from 74 different countries (yes, 74!) with R and monetdb. i'm planning to release a blog post specifically for the brazilian microdata and then expand that code to work on the other 73 countries https://international.ipums.org/international-action/countries. both brazilian census bureau employees and outside users have taken a serious interest in this work, with official links from onehttp://www.ibge.gov.br/home/estatistica/populacao/trabalhoerendimento/pnad20..., twohttp://www.ibge.gov.br/home/estatistica/indicadores/trabalhoerendimento/pme_..., three http://www.ibge.gov.br/home/xml/pof_2008_2009.shtm of their microdata pages.
r and monetdb are obviously great choices for analysts who don't have the resources for expensive computers and/or software. a former employee at insee (the french stat. bureau) who currently consults on international demography projects outlined this problem for me pretty clearly: lower-resource statistical agencies in the developing world need a good, free option like (R + monetdb) for analysis of larger data sets on the (generally cheaper) hardware that they already have.
international census bureau employees and other demographers are much more likely to have statistical but not computer programming skills, so as i work on these instructional documents, i'll aim to make the computing side of things as easy as possible. and, most central to this e-mail, much of this new potential audience uses windows. (on march 10th, 2014http://cran-logs.rstudio.com/2014/2014-03-10.csv.gz, 67% of MonetDB.R package downloads were windows users)
==problem==
certain sql commands - it's hard for me to pin down which ones - tend to hang. i will run a command, look at my "process manager" and see RAM start getting gobbled up (not by mserver.exe but by some invisible process--i think disk paging?), and then wait. mserver.exe will use between 0% and 1% of the total CPU for a few minutes, then something will "wake up" and suddenly the CPU starts lighting up.. with 70%-90% usage across multiple cores. this 1% CPU usage to 75% CPU usage can jump back and forth a few times until the full command finally finishes.
all of the time spent with 0% or 1% CPU usage - waiting for commands to execute - seems like an opportunity for improvement of the windows build?
here's an example that took 11 minutes for a query run on 20,000,000 records..
system.time( z <- svymean( ~v0601 , pes.design , byvar = ~v0001 , se = TRUE ) ) user system elapsed 18.47 1.13 *710.86 *
..totally awesome! but can it be improved?
==comparison between two machines==
the attached .sql code has been run on two windows machines:
on the windows server with 64 gb of ram--
# Serving database 'test', using 4 threads # Compiled for x86_64-pc-winnt/64bit with 64bit OIDs dynamically linked # Found 64.000 GiB available main-memory.
on my personal windows laptop--
# Serving database 'test', using 8 threads # Compiled for x86_64-pc-winnt/64bit with 64bit OIDs dynamically linked # Found 7.860 GiB available main-memory.
==processing time comparison==
processing time on the 64gb server: 27 tuples *(5.2s)*
processing time on my 8gb laptop: 27 tuples *(7m 46s)*
*so on a windows machine, if disk paging can be avoided, the processing time is 90x faster for this attached sql code?*
==how to load this reproducible example==
** data file with four columns **
https://drive.google.com/file/d/0B3LuuP081KDYNERWQWx5MEpZU2M/edit?usp=sharin...
** commands to load the data ** CREATE TABLE "sys"."d10" ( "idkey" INTEGER, "v0601" VARCHAR(255), "pes_wgt" DOUBLE, "v0001" VARCHAR(255) );
COPY 20635472 RECORDS INTO d10 FROM 's:/temp/example.txt';
[[see .sql attachment for post-import commands]]
==general diagnostics==
# MonetDB 5 server v11.17.11 "Jan2014-SP1" # Serving database 'test', using 4 threads # Compiled for x86_64-pc-winnt/64bit with 64bit OIDs dynamically linked # Found 64.000 GiB available main-memory. # Copyright (c) 1993-July 2008 CWI. # Copyright (c) August 2008-2014 MonetDB B.V., all rights reserved # Visit http://www.monetdb.org/ for further information # Listening for connection requests on mapi:monetdb://127.0.0.1:50000/ # MonetDB/JAQL module loaded # MonetDB/SQL module loaded
==conclusion question==
the windows version of monetdb is intensely slow at rather unexpected times. can and will anything be done about this in the near future?
thanks for monetdb!
participants (1)
-
Anthony Damico