if anyone has ideas on #3-5 below, i'd appreciate it.

first just following up with a few additional snags..

1) this screenshot i've uploaded shows one more example from some demonstrations i gave to a census consultant last week.. that server has 64gb of ram and still sometimes behaves like it's paging. :/ note that R (background) is frozen because it's waiting for monetdb to do something (there's plenty of RAM accessible but 0% of the cpus are being used). this issue is simply resolved by waiting.. but it's not ideal, and i don't think it can be blamed on overloaded memory.

2) i made this commit to my code to import the brazilian census. on my 8gb laptop: previously, the code crashed mserver. now everything works. as you can see from the additions, the only thing that's different is a few additional server shutdowns and restarts. this was a pretty painless workaround, but it's weird to me that something as silly as a server restart would prevent a crash.

..i hit issues like these all the time. and when i do, i just generally feel like the windows build could use some more scrutiny from people who know what should happen. if you could incorporate more windows performance reviews on machines with 16GB or less, i'd be grateful. sometimes monetdb just gets finnicky, but that's hard for me to reproduce for your team.

===========

i also think it might be wise to update this guide with the answers to a few additional questions:

3) at times, the monetdb-backed data sets i have written about already do end up with disk paging in certain circumstances. my instructions to users have generally been "leave your computer overnight and let it run" but that only works on download/import/cleaning, not during an interactive data analysis session. can any of you give me some straightforward guidelines about "how to avoid disk paging" that would be understandable to people who may not have much experience with SQL? the memory footprint description is helpful for more advanced users, but for people who have never used a database before, are there any good rules of thumb like "never go over ten columns x one million records per gigabyte of RAM on your machine" ? more concrete pointers might help new users understand why some commands run very quickly and others clog the computer. i am looking for something like "how to avoid disk paging for dummies" :)

4) in addition to sheer data table size, are there specific commands that tend to hog memory that should be avoided or re-written gracefully to conserve computing resources? as a starting example (but hopefully there are others as well), could the SQL code from my previous e-mail (the code that was very slow on my 8gb laptop) be re-written in a more efficient manner? commands like the ones attached to my previous e-mail are very very common in the sqlsurvey package's R code, so if the monetdb team can give me some re-write advice, i would be able to comb through dr. lumley's code and make some recommended revisions for him to improve the whole system.

5) can someone just confirm that "how to detect when your disk is paging" is pretty straightforward on microsoft windows: (a) open task manager (b) click the "performance" tab (c) look for zero CPU usage and overloaded RAM usage, and optional (d) click on "resource monitor" and note the heavy movement under "disk activity" ..i think that's it, but maybe there are other things to look for?

my apologies for being a pita :/ and thanks again for your work!! i still get excited when i see a giant, complicated query finish in a few seconds.

On Tue, Mar 11, 2014 at 1:58 PM, Anthony Damico <ajdamico@gmail.com> wrote:

hi monetdb team, could disk paging performance be improved on windows? a query run without versus with any disk paging is kinda like the hulk versus plain old bruce banner. ;)

==background==

i'm working on a longer tutorial for http://asdfree.com/ about how to work with census microdata from 74 different countries (yes, 74!) with R and monetdb. i'm planning to release a blog post specifically for the brazilian microdata and then expand that code to work on the other 73 countries. both brazilian census bureau employees and outside users have taken a serious interest in this work, with official links from one, two, three of their microdata pages.

r and monetdb are obviously great choices for analysts who don't have the resources for expensive computers and/or software. a former employee at insee (the french stat. bureau) who currently consults on international demography projects outlined this problem for me pretty clearly: lower-resource statistical agencies in the developing world need a good, free option like (R + monetdb) for analysis of larger data sets on the (generally cheaper) hardware that they already have.

international census bureau employees and other demographers are much more likely to have statistical but not computer programming skills, so as i work on these instructional documents, i'll aim to make the computing side of things as easy as possible. and, most central to this e-mail, much of this new potential audience uses windows. (on march 10th, 2014, 67% of MonetDB.R package downloads were windows users)

==problem==

certain sql commands - it's hard for me to pin down which ones - tend to hang. i will run a command, look at my "process manager" and see RAM start getting gobbled up (not by mserver.exe but by some invisible process--i think disk paging?), and then wait. mserver.exe will use between 0% and 1% of the total CPU for a few minutes, then something will "wake up" and suddenly the CPU starts lighting up.. with 70%-90% usage across multiple cores. this 1% CPU usage to 75% CPU usage can jump back and forth a few times until the full command finally finishes.

all of the time spent with 0% or 1% CPU usage - waiting for commands to execute - seems like an opportunity for improvement of the windows build?

here's an example that took 11 minutes for a query run on 20,000,000 records..

> system.time( z <- svymean( ~v0601 , pes.design , byvar = ~v0001 , se = TRUE ) )
   user system elapsed
18.47    1.13 710.86

..totally awesome! but can it be improved?

==comparison between two machines==

the attached .sql code has been run on two windows machines:

on the windows server with 64 gb of ram--

# Serving database 'test', using 4 threads
# Compiled for x86_64-pc-winnt/64bit with 64bit OIDs dynamically linked
# Found 64.000 GiB available main-memory.

on my personal windows laptop--

# Serving database 'test', using 8 threads
# Compiled for x86_64-pc-winnt/64bit with 64bit OIDs dynamically linked
# Found 7.860 GiB available main-memory.

==processing time comparison==

processing time on the 64gb server: 27 tuples (5.2s)

processing time on my 8gb laptop: 27 tuples (7m 46s)

so on a windows machine, if disk paging can be avoided, the processing time is 90x faster for this attached sql code?

==how to load this reproducible example==

** data file with four columns **
https://drive.google.com/file/d/0B3LuuP081KDYNERWQWx5MEpZU2M/edit?usp=sharing

** commands to load the data **
CREATE TABLE "sys"."d10" (
        "idkey"   INTEGER,
        "v0601"   VARCHAR(255),
        "pes_wgt" DOUBLE,
        "v0001"   VARCHAR(255)
);

COPY 20635472 RECORDS INTO d10 FROM 's:/temp/example.txt';

[[see .sql attachment for post-import commands]]

==general diagnostics==

# MonetDB 5 server v11.17.11 "Jan2014-SP1"
# Serving database 'test', using 4 threads
# Compiled for x86_64-pc-winnt/64bit with 64bit OIDs dynamically linked
# Found 64.000 GiB available main-memory.
# Copyright (c) 1993-July 2008 CWI.
# Copyright (c) August 2008-2014 MonetDB B.V., all rights reserved
# Visit http://www.monetdb.org/ for further information
# Listening for connection requests on mapi:monetdb://127.0.0.1:50000/
# MonetDB/JAQL module loaded
# MonetDB/SQL module loaded
>

==conclusion question==

the windows version of monetdb is intensely slow at rather unexpected times. can and will anything be done about this in the near future?

thanks for monetdb!