Join memory overhead, COPY INTO csv, MT_mremap failure.

8 Apr 2015

      Hello all,
To the purpose of evaluating MonetDB, I am trying to perform the following:

 Load two csv files into two tables, from disk A.
 Perform a JOIN operation on two fields, using disk B for the database.
 Pipe the result to a csv file, on disk A.

This results in a very high disk B usage, and subsequent MT_mremap
failure, as seen in merovingian.log:
2015-04-08 12:50:14 ERR google-trace[2737]: = gdk_posix.c:428: MT_mremap(./bat/06/675.tail,7e8738e90000,91992817664,110391328768): GDKextendf() failed

Does somebody have ideas or explanations? Details follow:

I am loading csv data from hard drive A. The files have size:
-rw-r--r-- 1 fre fre 2.6G Apr  5 16:25 task_events_cut.csv
-rw-r--r-- 1 fre fre  34G Apr  5 16:04 task_usage_cut.csv

I create a database in a separate hard drive B (2TB):

rm -rf /mnt/diskB/mdb
mkdir /mnt/diskB/mdb/
monetdbd create  /mnt/diskB/mdb
monetdbd start   /mnt/diskB/mdb
monetdb  create  google-trace
monetdb  release google-trace

I then load the data into the database:

mclient ct_trace_events_reduced.sql -d google-trace
pv /mnt/diskA/task_events_cut.csv |mclient -d google-trace -s  "COPY  INTO  task_events_reduced FROM  STDIN  USING  DELIMITERS ',','\\n'" -
mclient ct_trace_usage_reduced.sql -d google-trace
pv /mnt/diskA/task_usage_cut.csv |mclient -d google-trace -s  "COPY  INTO  task_usage_reduced FROM  STDIN  USING  DELIMITERS ',','\\n'" -

using very standard scripts:

$ cat ct_trace_events_reduced.sql ct_trace_usage_reduced.sql
DROP TABLE task_events_reduced;

CREATE TABLE task_events_reduced  ( "job_id"     BIGINT
                                  , "task_id"    BIGINT
                                  , "class"    SMALLINT
                                  , "priority" SMALLINT);
DROP TABLE task_usage_reduced;

CREATE TABLE task_usage_reduced  ( "job_id"     BIGINT
                                  , "task_id"    BIGINT
                                  , "cpu_mean"   FLOAT
                                  , "cpu_sample" FLOAT);

These two operations take about 50 minutes, which is very reasonable.

I then use mclient to do my join:

mclient join.sql -d google-trace

using the script:

$ cat join.sql
COPY
(SELECT te.job_id, te.task_id, te.class, te.priority, tu.cpu_mean, tu.cpu_sample FROM
(SELECT *
FROM task_events_reduced
)AS te
RIGHT JOIN
(SELECT *
FROM task_usage_reduced
)AS tu
ON(te.job_id=tu.job_id
AND te.task_id=tu.task_id)
)
INTO '/diskA/join.csv'
USING DELIMITERS ',','\n';

This results in more than three hours of data crunching on a google
compute engine machine (16 processors and 100GB RAM), where disk B is
being increasingly used, until it is full (2TB HDD). Then, the aforementionned
error happens.

I am not hoping that MonetDB would perform streaming I/O on the right
file from the join. However the disk usage seems quite high. Is there a
way to force MonetDB to do a hash join?

Thanks a lot,
Valentin Reis
University of Warsaw