Hi Viktor, unfortunately, we did not manage to dive deeper into these problems, yet. I hope to find some time by end of next week --- too much travelling and other obligations in between. Wrt. to the I/O, I did locate and fix some issues in the Dec2011 branch; cf., http://dev.monetdb.org/hg/MonetDB/rev/dc8066d67faa http://dev.monetdb.org/hg/MonetDB/rev/bb56535aa6df http://dev.monetdb.org/hg/MonetDB/rev/d30146b4d610 http://dev.monetdb.org/hg/MonetDB/rev/cf2ff66d08fa It would be great, if you could test your scenario(s) with the latest Dec2011 branch, either from HG or from http://monetdb.cwi.nl/testweb/web/42751:3ab84718f7e1/ or http://monetdb.cwi.nl/testweb/web/MonetDB-Dec2011-latest.tar.bz2 and report. (Dec2011 has not yet been released, yet, but we hope to do so still this month.) Please note that efficient handling of "BETWEEN SYMMETRIC" has not been implemented, yet; cf., http://bugs.monetdb.org/show_bug.cgi?id=2945 regards, Stefan On Wed, Jan 11, 2012 at 11:19:25AM +0100, Viktor Rosenfeld wrote:
Hi Stephan,
has there been any progress on this problem? Or is there something I can do to move it along?
Also, I still don't fully understand why the data is written to disk at all especially since it would easily fit into main memory. Is it because BATs are memory-mapped to files?
Cheers, Viktor
Stefan Manegold wrote:
Hi Viktor,
thanks for the detailed report. I'll read it later once I find the time; might be only end of next week, though ...
With "point predicates" I refer to predicates the check for equality with a scalar value: column-expression = scalar-value, i.e., predicates that can benefit from a hash index.
And my first guess with your Q8 and the inequality (self-?)join(?) is that this creates almost a Cartesian product of the one/two inputs, with large inputs, this can indeed become hugh ...
More hopefully soon, Stefan
On Fri, Dec 16, 2011 at 04:06:58PM +0100, Viktor Rosenfeld wrote:
Hi,
Stefan Manegold wrote:
Ah, I now releaize that you have point predicates in you query/ies, right? In that case, mitosis does not trigger since we want to (build and) exploit hash indices in the base column(s).
What do you mean by point predicates?
Briefly looking at your plans, I notice that they contain BAT iterators in MAL; that's why the traces blow up. I don't have the time right now to explain what that means (the "experts will know" ;-)). It appears that they are triggered by the "between SYMMETRIC" construct. I filed a bug report / feature request abut this: http://bugs.monetdb.org/show_bug.cgi?id=2945 For now, if you would be able to omit the "SYMMETRIC", I expect at least the traces not to blow up. I cannot tell, whether also the memory usage / IO explosion would be avoided by this ...
I have done a more detailed analysis with the 3 problematic queries. (They are called 7 through 9 in my benchmark set and I kept that numbering for simplicity.)
I've uploaded the original SQL queries:
http://www.informatik.hu-berlin.de/~rosenfel/monet/q7-count.sql http://www.informatik.hu-berlin.de/~rosenfel/monet/q8-count.sql http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-count.sql
All three queries fill up the hard disk on the large dataset. Only queries 7 and 9 use a BETWEEN SYMMETRIC predicate. Query 8 uses an inequality predicate on node.id (see below).
The table below shows their evaluation time and disk I/O behavior on the small dataset.
| Query | Time (ms) | Data written | |-------+-----------+--------------| | 7 | 962 | 6M | | 8 | 203 | 92k | | 9 | 9646 | 100M |
As per Stefan's suggestion, I removed the SYMMETRIC predicate. I also tried 2 other strategies:
- rewriting the BETWEEN predicate with an equivalent >= and <= comparison and - substituting the BETWEEN predicate with a single < comparison.
(The last strategy is not semantically equivalent, but nevertheless correct for my application for these queries. However, I use the BETWEEN predicate in other places where this option is not available.)
The next table shows the evaluation times (in ms) on the small dataset for query 7 and 9 using these 4 strategies:
| Query | BETWEEN SYMMETRIC | BETWEEN | >= AND <= | < | |-------+-------------------+---------+-----------+-----| | 7 | 1002 | 342 | 335 | 66 | | 9 | 9908 | 6040 | 5990 | 185 |
The next table shows the disk I/O behavior of query 9 using these 4 strategies on the small dataset.
| Strategy | Read (MB) | Write (MB) | |-------------------+-----------+------------| | BETWEEN SYMMETRIC | 0 | 100 | | BETWEEN | 0 | 84 | | >= AND <= | 0 | 84 | | < | 0 | 0 |
Using BETWEEN and using >= and <= has the same runtime behavior. Indeed, the MAL plans are identical. I have uploaded the MAL plans for query 9 for all 4 strategies.
http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-with-mitosis.plan (BETWEEN SYMMETRIC) http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-no-symmetric.plan (BETWEEN) http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-workaround.plan (>= AND <=) http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-no-precedence-restrict... (<)
Without the SYMMETRIC predicate the trace do not explode any longer. I have uploaded these for query 9.
http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-no-symmetric.trace (BETWEEN) http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-workaround.trace (>= AND <=) http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-no-precedence-restrict... (<)
Out of curiosity, I also generated plans for a much simpler query. Starting from the query below, I generated MAL plans for all 4 strategies:
SELECT count(*) FROM ( SELECT node1.id AS id1, node2.id AS id2, node1.toplevel_corpus FROM node AS node1, node AS node2 WHERE node1.right_token BETWEEN SYMMETRIC node2.left_token - 1 AND node2.left_token - 50 AND node1.text_ref = node2.text_ref ) AS solutions;
http://www.informatik.hu-berlin.de/~rosenfel/monet/between-symmetric.plan (BETWEEN SYMMETRIC) http://www.informatik.hu-berlin.de/~rosenfel/monet/between-no-symmetric.plan (BETWEEN) http://www.informatik.hu-berlin.de/~rosenfel/monet/between-workaround.plan (>= AND <=) http://www.informatik.hu-berlin.de/~rosenfel/monet/no-precedence-restriction... (<)
Interestingly, the BETWEEN variant and the >= AND <= variant do not produce the same MAL plan for the simple query as they do for query 9.
Without the SYMMETRIC predicate, queries 7 and 9 still fill up my disk when I evaluate them on the large dataset. However, using only the < comparison they finish. The next table shows their runtime and disk I/O behavior for the < strategy.
| Query | Time (ms) | Read (MB) | Written (MB) | |-------+-----------+-----------+--------------| | 7 | 990 | 0 | 3 | | 9 | 56903 | 266 | 1621 |
There was considerable variation in the evaluation times, which I usually do not observe for other queries.
- Query 7 variation: 685 - 1833 - Query 9 variation: 48253 - 64511
There was also some variation in the amount of disk I/O for query 9:
- Query 9 reads: 257, 296, 254, 237, 285 - Query 9 writes: 1736, 1545, 1545, 1736, 1545
(However, due to the long runtime, there might have been competing I/O from other processes.)
Query 8 does not use a BETWEEN SYMMETRIC predicate, but does a few comparisons of the type nodeX.id <> nodeY.id where nodeX and nodeY are aliases for the node table. This seems to be the reason for filling up the disk. If I remove these predicates, query 8 finishes on the large dataset, although it still generates quite a lot of I/O:
| Query | Time (ms) | Read (MB) | Written (MB) | |-------+-----------+-----------+--------------| | 8 | 6443 | 0 | 107 |
I've uploaded the traces and plans for query 8 with and without the <> comparisons (the traces are from the small dataset):
http://www.informatik.hu-berlin.de/~rosenfel/monet/q8-no-identical-sibling.p... (with <>) http://www.informatik.hu-berlin.de/~rosenfel/monet/q8-no-identical-sibling.t... (with <>) http://www.informatik.hu-berlin.de/~rosenfel/monet/q8-identical-sibling.plan (without <>) http://www.informatik.hu-berlin.de/~rosenfel/monet/q8-identical-sibling.trac... (without <>)
I think the next step would be to run stethoscope on the query variations (except BETWEEN SYMMETRIC) that still fill up the disk and report the last few statements before the crash. Would this be helpful?
I'm still puzzled though why data is being written to disk in the first place. The query runtime appears to be roughly proportional to the amount written (which isn't surprising).
Thanks, Viktor
------------------------------------------------------------------------------ Learn Windows Azure Live! Tuesday, Dec 13, 2011 Microsoft is holding a special Learn Windows Azure training event for developers. It will provide a great way to learn Windows Azure and what it provides. You can attend the event by watching it streamed LIVE online. Learn more at http://p.sf.net/sfu/ms-windowsazure _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- | Stefan.Manegold @ CWI.nl | DB Architectures (INS1) | | http://CWI.nl/~manegold/ | Science Park 123 (L321) | | Tel.: +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
------------------------------------------------------------------------------ Learn Windows Azure Live! Tuesday, Dec 13, 2011 Microsoft is holding a special Learn Windows Azure training event for developers. It will provide a great way to learn Windows Azure and what it provides. You can attend the event by watching it streamed LIVE online. Learn more at http://p.sf.net/sfu/ms-windowsazure _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- | Stefan.Manegold @ CWI.nl | DB Architectures (INS1) | | http://CWI.nl/~manegold/ | Science Park 123 (L321) | | Tel.: +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |