Some performance oddities in numerical comparisons on merge tables
Hi, I have a larger *merge table*, of which the relevant columns in this query should be: "timestamp" TIMESTAMP "abort_flag" TINYINT "timestamp" is roughly ascending, but far from monotonous. "abort_flag" has a low cardinality with the following possible values and number of occurrences: | 0 | 735660 | | 1 | 30119 | | 2 | 1486 | | 4 | 15 | Both columns are completely populated. Total row numbers are: merge table: 3,773,559,881 current partition: 30,754,353 I get the the following performance numbers: sql>select count(*) from mytable where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag in (0); +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (1.0s) sql>select count(*) from mytable where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag = 0; +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (34.0s) So there is a very significant, and IMHO surprising, performance difference between the "equals" and the "in" comparison. Doing the same query directly against the sub table has a much more consistent performance: sql>select count(*) from mytable_20160415 where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag IN (0); +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (618.914ms) sql>select count(*) from mytable_20160415 where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag = 0; +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (1.1s) The same can be observed with another column, where ">= 4" is much slower than the (in my case) equivalent "IN (4,5,6)". I guess, for my uses I can always switch to the "in" conditions, because all relevant columns should have very low cardinality, but in general, this seems odd. Is there an explanation for this behaviour, like optimizer problems with merge tables? Or maybe a way to avoid it? Regards, Marc
Hi Marc, profiling your queries by prefixing them with keyword TRACE and sharing the resulting profiles would be a prerequisite for us to analyze /understand what happens in these cases. Thanks! Stefan ----- On Apr 15, 2016, at 5:18 PM, Marc Ballarin marc.ballarin@1und1.de wrote:
Hi,
I have a larger *merge table*, of which the relevant columns in this query should be: "timestamp" TIMESTAMP "abort_flag" TINYINT
"timestamp" is roughly ascending, but far from monotonous. "abort_flag" has a low cardinality with the following possible values and number of occurrences: | 0 | 735660 | | 1 | 30119 | | 2 | 1486 | | 4 | 15 |
Both columns are completely populated.
Total row numbers are: merge table: 3,773,559,881 current partition: 30,754,353
I get the the following performance numbers: sql>select count(*) from mytable where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag in (0); +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (1.0s)
sql>select count(*) from mytable where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag = 0; +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (34.0s)
So there is a very significant, and IMHO surprising, performance difference between the "equals" and the "in" comparison.
Doing the same query directly against the sub table has a much more consistent performance: sql>select count(*) from mytable_20160415 where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag IN (0); +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (618.914ms)
sql>select count(*) from mytable_20160415 where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag = 0; +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (1.1s)
The same can be observed with another column, where ">= 4" is much slower than the (in my case) equivalent "IN (4,5,6)".
I guess, for my uses I can always switch to the "in" conditions, because all relevant columns should have very low cardinality, but in general, this seems odd.
Is there an explanation for this behaviour, like optimizer problems with merge tables? Or maybe a way to avoid it?
Regards, Marc _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
Hi,
I have a larger *merge table*, of which the relevant columns in this query should be: "timestamp" TIMESTAMP "abort_flag" TINYINT
"timestamp" is roughly ascending, but far from monotonous. "abort_flag" has a low cardinality with the following possible values and number of occurrences: | 0 | 735660 | | 1 | 30119 | | 2 | 1486 | | 4 | 15 |
Both columns are completely populated.
Total row numbers are: merge table: 3,773,559,881 current partition: 30,754,353
I get the the following performance numbers: sql>select count(*) from mytable where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag in (0); +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (1.0s)
sql>select count(*) from mytable where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag = 0; +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (34.0s)
So there is a very significant, and IMHO surprising, performance difference between the "equals" and the "in" comparison. Well not that surprising. The choice on which operator in MonetDB isn't
On Fri, Apr 15, 2016 at 05:18:27PM +0200, Marc Ballarin wrote: based on the content of the table but using relatively simpel rules. A equality is done before a range, and a range before an in. So in case of the in, the very selective range query is done first. In case of the equality, the equality is done first. Niels
Doing the same query directly against the sub table has a much more consistent performance: sql>select count(*) from mytable_20160415 where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag IN (0); +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (618.914ms)
sql>select count(*) from mytable_20160415 where timestamp between timestamp '2016-04-15 13:05:05' and timestamp '2016-04-15 13:20:05' and abort_flag = 0; +--------+ | L1 | +========+ | 735660 | +--------+ 1 tuple (1.1s)
The same can be observed with another column, where ">= 4" is much slower than the (in my case) equivalent "IN (4,5,6)".
I guess, for my uses I can always switch to the "in" conditions, because all relevant columns should have very low cardinality, but in general, this seems odd.
Is there an explanation for this behaviour, like optimizer problems with merge tables? Or maybe a way to avoid it?
Regards, Marc _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
Hi, Am 17.04.2016 um 12:31 schrieb Niels Nes:
Well not that surprising. The choice on which operator in MonetDB isn't based on the content of the table but using relatively simpel rules. A equality is done before a range, and a range before an in. So in case of the in, the very selective range query is done first. In case of the equality, the equality is done first. Niels Ok, this matches the trace Stefan Manegold suggested:
Doing "select sum(reads), sum(minflt) from sys.tracelog()"... ...for the query using "in": 268896, 1234 ...for the query using "equals": 16707488, 16736 PLAN SELECT also shows the different ordering of conditions. Is there any way to give hints to the query optimizer? In my case, I have a pretty good idea how most columns look, at least those usually used in conditions, and I know which conditions will reduce the result set the most. Also, is there any list of precedences? So far, I have seen: equality > range > like > in > is null > user defined functions. Regards, Marc
participants (3)
-
Marc Ballarin
-
Niels Nes
-
Stefan Manegold