Hi, Stefan Manegold wrote:
Ah, I now releaize that you have point predicates in you query/ies, right? In that case, mitosis does not trigger since we want to (build and) exploit hash indices in the base column(s).
What do you mean by point predicates?
Briefly looking at your plans, I notice that they contain BAT iterators in MAL; that's why the traces blow up. I don't have the time right now to explain what that means (the "experts will know" ;-)). It appears that they are triggered by the "between SYMMETRIC" construct. I filed a bug report / feature request abut this: http://bugs.monetdb.org/show_bug.cgi?id=2945 For now, if you would be able to omit the "SYMMETRIC", I expect at least the traces not to blow up. I cannot tell, whether also the memory usage / IO explosion would be avoided by this ...
I have done a more detailed analysis with the 3 problematic queries. (They are called 7 through 9 in my benchmark set and I kept that numbering for simplicity.) I've uploaded the original SQL queries: http://www.informatik.hu-berlin.de/~rosenfel/monet/q7-count.sql http://www.informatik.hu-berlin.de/~rosenfel/monet/q8-count.sql http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-count.sql All three queries fill up the hard disk on the large dataset. Only queries 7 and 9 use a BETWEEN SYMMETRIC predicate. Query 8 uses an inequality predicate on node.id (see below). The table below shows their evaluation time and disk I/O behavior on the small dataset. | Query | Time (ms) | Data written | |-------+-----------+--------------| | 7 | 962 | 6M | | 8 | 203 | 92k | | 9 | 9646 | 100M | As per Stefan's suggestion, I removed the SYMMETRIC predicate. I also tried 2 other strategies: - rewriting the BETWEEN predicate with an equivalent >= and <= comparison and - substituting the BETWEEN predicate with a single < comparison. (The last strategy is not semantically equivalent, but nevertheless correct for my application for these queries. However, I use the BETWEEN predicate in other places where this option is not available.) The next table shows the evaluation times (in ms) on the small dataset for query 7 and 9 using these 4 strategies: | Query | BETWEEN SYMMETRIC | BETWEEN | >= AND <= | < | |-------+-------------------+---------+-----------+-----| | 7 | 1002 | 342 | 335 | 66 | | 9 | 9908 | 6040 | 5990 | 185 | The next table shows the disk I/O behavior of query 9 using these 4 strategies on the small dataset. | Strategy | Read (MB) | Write (MB) | |-------------------+-----------+------------| | BETWEEN SYMMETRIC | 0 | 100 | | BETWEEN | 0 | 84 | | >= AND <= | 0 | 84 | | < | 0 | 0 | Using BETWEEN and using >= and <= has the same runtime behavior. Indeed, the MAL plans are identical. I have uploaded the MAL plans for query 9 for all 4 strategies. http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-with-mitosis.plan (BETWEEN SYMMETRIC) http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-no-symmetric.plan (BETWEEN) http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-workaround.plan (>= AND <=) http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-no-precedence-restrict... (<) Without the SYMMETRIC predicate the trace do not explode any longer. I have uploaded these for query 9. http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-no-symmetric.trace (BETWEEN) http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-workaround.trace (>= AND <=) http://www.informatik.hu-berlin.de/~rosenfel/monet/q9-no-precedence-restrict... (<) Out of curiosity, I also generated plans for a much simpler query. Starting from the query below, I generated MAL plans for all 4 strategies: SELECT count(*) FROM ( SELECT node1.id AS id1, node2.id AS id2, node1.toplevel_corpus FROM node AS node1, node AS node2 WHERE node1.right_token BETWEEN SYMMETRIC node2.left_token - 1 AND node2.left_token - 50 AND node1.text_ref = node2.text_ref ) AS solutions; http://www.informatik.hu-berlin.de/~rosenfel/monet/between-symmetric.plan (BETWEEN SYMMETRIC) http://www.informatik.hu-berlin.de/~rosenfel/monet/between-no-symmetric.plan (BETWEEN) http://www.informatik.hu-berlin.de/~rosenfel/monet/between-workaround.plan (>= AND <=) http://www.informatik.hu-berlin.de/~rosenfel/monet/no-precedence-restriction... (<) Interestingly, the BETWEEN variant and the >= AND <= variant do not produce the same MAL plan for the simple query as they do for query 9. Without the SYMMETRIC predicate, queries 7 and 9 still fill up my disk when I evaluate them on the large dataset. However, using only the < comparison they finish. The next table shows their runtime and disk I/O behavior for the < strategy. | Query | Time (ms) | Read (MB) | Written (MB) | |-------+-----------+-----------+--------------| | 7 | 990 | 0 | 3 | | 9 | 56903 | 266 | 1621 | There was considerable variation in the evaluation times, which I usually do not observe for other queries. - Query 7 variation: 685 - 1833 - Query 9 variation: 48253 - 64511 There was also some variation in the amount of disk I/O for query 9: - Query 9 reads: 257, 296, 254, 237, 285 - Query 9 writes: 1736, 1545, 1545, 1736, 1545 (However, due to the long runtime, there might have been competing I/O from other processes.) Query 8 does not use a BETWEEN SYMMETRIC predicate, but does a few comparisons of the type nodeX.id <> nodeY.id where nodeX and nodeY are aliases for the node table. This seems to be the reason for filling up the disk. If I remove these predicates, query 8 finishes on the large dataset, although it still generates quite a lot of I/O: | Query | Time (ms) | Read (MB) | Written (MB) | |-------+-----------+-----------+--------------| | 8 | 6443 | 0 | 107 | I've uploaded the traces and plans for query 8 with and without the <> comparisons (the traces are from the small dataset): http://www.informatik.hu-berlin.de/~rosenfel/monet/q8-no-identical-sibling.p... (with <>) http://www.informatik.hu-berlin.de/~rosenfel/monet/q8-no-identical-sibling.t... (with <>) http://www.informatik.hu-berlin.de/~rosenfel/monet/q8-identical-sibling.plan (without <>) http://www.informatik.hu-berlin.de/~rosenfel/monet/q8-identical-sibling.trac... (without <>) I think the next step would be to run stethoscope on the query variations (except BETWEEN SYMMETRIC) that still fill up the disk and report the last few statements before the crash. Would this be helpful? I'm still puzzled though why data is being written to disk in the first place. The query runtime appears to be roughly proportional to the amount written (which isn't surprising). Thanks, Viktor