Hello, @Niels: I ran the same query using an INNER JOIN. MonetDB shows the same behavior: disk IO until the 2TB are used, then a crash on the same function (observed in merovingian.log). @Martin: Is the --algorithms command supposed to be used with monetdb or monetdbd? I can't seem to find it in the manpage. @Stefan:
task_usage_reduced (task_usage_cut.csv) is your larger table? task_events_reduced (task_events_cut.csv) is your smaller table? Correct.
How many rows does each table have? select count(*) from (select job_id, task_id from task_events_reduced group by job_id, task_id) as x; 25424731 select count(*) from task_events_reduced; 144648288 select count(*) from (select job_id, task_id from task_usage_reduced group by job_id, task_id) as x; 24281242 select count(*) from task_usage_reduced; 1232799308
Could you share the "PLAN" and "EXPLAIN" of your query? (Well, of the inner part without "COPY ... INTO ...") I ran PLAN and EXPLAIN on the following query: (SELECT count(*) FROM (SELECT job_id, task_id FROM task_events_reduced) AS te RIGHT JOIN (SELECT job_id, task_id FROM task_usage_reduced) AS tu ON (te.job_id=tu.job_id AND te.task_id=tu.task_id) ); The results are attached.
Could you try to run
(SELECT count(*) FROM (SELECT job_id, task_id FROM task_events_reduced) AS te RIGHT JOIN (SELECT job_id, task_id FROM task_usage_reduced) AS tu ON (te.job_id=tu.job_id AND te.task_id=tu.task_id) ); The query does not finish. Same behavior as always: the disk containing the mdb database fills up until the disk is full. then, the aforementionned error happens.
To my understanding of the RIGHT JOIN, it should have exactly the same result as: SELECT count(*) FROM task_usage_reduced; Shouldn't it? Please do correct me if I am wrong on this one :) It is surprising to us: shouldn't the optimizer reduce to this much simpler query? Thanks all for the very responsive support, Valentin & Krzysztof. Quoting Stefan Manegold (2015-04-08 22:08:08)
Valentin,
some more question for clarification:
task_usage_reduced (task_usage_cut.csv) is your larger table?
task_events_reduced (task_events_cut.csv) is your smaller table?
How many rows does each table have?
Is (job_id, task_id) unique in both tables (or at least in the smaller table), i.e. is select count(*) from (select job_id, task_id from task_events_reduced group by job_id, task_id) as x; equal to select count(*) from task_events_reduced; and select count(*) from (select job_id, task_id from task_usage_reduced group by job_id, task_id) as x; equal to select count(*) from task_usage_reduced; ?
Could you share the "PLAN" and "EXPLAIN" of your query? (Well, of the inner part without "COPY ... INTO ...")
Could you try to run
(SELECT count(*) FROM (SELECT job_id, task_id FROM task_events_reduced) AS te RIGHT JOIN (SELECT job_id, task_id FROM task_usage_reduced) AS tu ON (te.job_id=tu.job_id AND te.task_id=tu.task_id) );
and report behaviour and result?
Thanks! Stefan
----- Original Message -----
Quoting Niels Nes (2015-04-08 19:16:09)
On Wed, Apr 08, 2015 at 04:01:29PM +0200, Valentin Reis wrote:
Hello all, To the purpose of evaluating MonetDB, I am trying to perform the following:
Load two csv files into two tables, from disk A. Perform a JOIN operation on two fields, using disk B for the database. Pipe the result to a csv file, on disk A.
This results in a very high disk B usage, and subsequent MT_mremap failure, as seen in merovingian.log: 2015-04-08 12:50:14 ERR google-trace[2737]: = gdk_posix.c:428: MT_mremap(./bat/06/675.tail,7e8738e90000,91992817664,110391328768): GDKextendf() failed
Does somebody have ideas or explanations? Details follow:
I am loading csv data from hard drive A. The files have size: -rw-r--r-- 1 fre fre 2.6G Apr 5 16:25 task_events_cut.csv -rw-r--r-- 1 fre fre 34G Apr 5 16:04 task_usage_cut.csv
I create a database in a separate hard drive B (2TB):
rm -rf /mnt/diskB/mdb mkdir /mnt/diskB/mdb/ monetdbd create /mnt/diskB/mdb monetdbd start /mnt/diskB/mdb monetdb create google-trace monetdb release google-trace
I then load the data into the database:
mclient ct_trace_events_reduced.sql -d google-trace pv /mnt/diskA/task_events_cut.csv |mclient -d google-trace -s "COPY INTO task_events_reduced FROM STDIN USING DELIMITERS ',','\\n'" - mclient ct_trace_usage_reduced.sql -d google-trace pv /mnt/diskA/task_usage_cut.csv |mclient -d google-trace -s "COPY INTO task_usage_reduced FROM STDIN USING DELIMITERS ',','\\n'" -
using very standard scripts:
$ cat ct_trace_events_reduced.sql ct_trace_usage_reduced.sql DROP TABLE task_events_reduced;
CREATE TABLE task_events_reduced ( "job_id" BIGINT , "task_id" BIGINT , "class" SMALLINT , "priority" SMALLINT); DROP TABLE task_usage_reduced;
CREATE TABLE task_usage_reduced ( "job_id" BIGINT , "task_id" BIGINT , "cpu_mean" FLOAT , "cpu_sample" FLOAT);
These two operations take about 50 minutes, which is very reasonable.
I then use mclient to do my join:
mclient join.sql -d google-trace
using the script:
$ cat join.sql COPY (SELECT te.job_id, te.task_id, te.class, te.priority, tu.cpu_mean, tu.cpu_sample FROM (SELECT * FROM task_events_reduced )AS te RIGHT JOIN (SELECT * FROM task_usage_reduced )AS tu ON(te.job_id=tu.job_id AND te.task_id=tu.task_id) ) INTO '/diskA/join.csv' USING DELIMITERS ',','\n'; Why are you using a RIGHT 'OUTER' join? Outer joins are much more expensive then innen joins.
Niels Hello,
denote: A=task_usage_reduced B=task_events_reduced
Morally, the goal is to augment rows of A -big table- with some information (class and priority) from B -smaller table-.
I am not sure whether B does contain information for all (job_id,task_id) entries of A. In the cases where a row from A wouldn't have an entry in B, I would like that the row be present in the output, only with NA's instead of the information from B. Therefore: RIGHT JOIN. Please do correct me again if I am mistaken, the SQL class is far away :)
What I hoped from MonetDB (and what I still hope to get :) ) is the following behavior, denoting n the length of A, m the length of B: -build a hash-table of B indexed by job_id and task_id (somehow O(m)) -augment each row of A by doing a O(1) lookup in the hash table. -> O(n) Therefore, O(m+n). processing time with O(m+n) memory usage (hashtable+result).
I am nonetheless looking at the performance of the inner join now.
Cheers, Valentin
This results in more than three hours of data crunching on a google compute engine machine (16 processors and 100GB RAM), where disk B is being increasingly used, until it is full (2TB HDD). Then, the aforementionned error happens.
I am not hoping that MonetDB would perform streaming I/O on the right file from the join. However the disk usage seems quite high. Is there a way to force MonetDB to do a hash join?
Thanks a lot, Valentin Reis University of Warsaw _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list