Performance issue with count distinct
Hello all, Here a simple issue I have with performance : 1/ I create a table A and populate it with a bulk load 2/I create a table B, I populate the table with bulk load 3/I create table C with a CTAS statement. I analyze the statistics and set the primary key 4/I try a "count distinct" type query (on all the table) on the PK Field. It is pretty slow, with a high sollicitation of my hard drive for several dozens of second. the "select count" is however fast. So I have these following questions : 1/why despite statistics the count distinct query is slow? I thought the idea of statistics was to get some metrics about the table? 2/if the field I make a count distinct on is the PK (or the set of field is the PK), it's exactly the same than a count non null. So why the time are not the same? 3/if these issues are design issue or bugs, how can I ask to make it solved? For information : Monetdb 11.31.13 on Windows 10 Best regards, Simon AUBERT [cid:691B9F4089B9F34B8E0C22B39B7117EC@eurprd07.prod.outlook.com] Bourse Maritime - 1, place Lainé - 33 000 Bordeaux Mob : +33(0)6 66 28 52 04
On 2 Feb 2019, at 15:57, AUBERT Simon
wrote: Hello all,
Here a simple issue I have with performance : 1/ I create a table A and populate it with a bulk load 2/I create a table B, I populate the table with bulk load
3/I create table C with a CTAS statement. I analyze the statistics and set the primary key
Hai, How many tuples do you have in tables A,B,C? Did you run analyse on all three tables? What are the types of the primary keys?
4/I try a “count distinct” type query (on all the table) on the PK Field. It is pretty slow, with a high sollicitation of my hard drive for several dozens of second. the “select count” is however fast.
COUNT is a property maintained for each column, so the query can be very fast, especially for the subsequent query executions. Normally, COUNT DISTINCT reads in all involved data, does GROUP BY and then COUNT for each group. It sounds like that’s what you’ve observed. Depending the data size of your column and the available RAM on your machine, I’d expect that if you run this query immediately a second time, it will be faster. However, since you’re doing COUNT DISTINCT on a PK column, it actually equals to a COUNT. You can check the EXPLAIN output of your query to confirm, but I suspect that we have missed a straightforward optimisation opportunity here.
So I have these following questions : 1/why despite statistics the count distinct query is slow? I thought the idea of statistics was to get some metrics about the table?
Most probably that statistics is not used in these queries, most probably because the statistics is not maintained under updates.
2/if the field I make a count distinct on is the PK (or the set of field is the PK), it’s exactly the same than a count non null. So why the time are not the same? 3/if these issues are design issue or bugs, how can I ask to make it solved?
You can fire an enhancement request to the MDB open-source team. Please be aware that there is no guarantee when your request will be processed. It’s on best effort. If you have a time limit, you can get commercial support from MonetDB Solutions. Jennie
For information : Monetdb 11.31.13 on Windows 10
Best regards,
Simon AUBERT
Bourse Maritime – 1, place Lainé – 33 000 Bordeaux Mob : +33(0)6 66 28 52 04 _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (2)
-
AUBERT Simon
-
Ying Zhang