[MonetDB-users] Long running query with count(distinct ...)
Hi, I have table with 81 887 010 rows. Select count from this table works fine, but I can't wait for the results of select count(distinct ...) from this table. Why can it be? sql>explain select count(src) from transfer_ip; +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | mal | +==================================================================================================================================================================================+ | function user.s1_1{autoCommit=true}():void; | | barrier _55 := language.dataflow(); | | _2 := sql.mvc(); | | _39:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",0,0,2); | | _43 := algebra.selectNotNil(_39); | | _35:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",2,0,2); | | _45 := algebra.kdifference(_43,_35); | | _41 := algebra.selectNotNil(_35); | | _47 := algebra.kunion(_45,_41); | | _6:bat[:oid,:oid] := sql.bind_dbat(_2,"sys","transfer_ip",1); | | _9 := bat.reverse(_6); | | _49 := algebra.kdifference(_47,_9); | | _51 := aggr.count(_49); | | _40:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",0,1,2); | | _44 := algebra.selectNotNil(_40); | | _38:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",2,1,2); | | _46 := algebra.kdifference(_44,_38); | | _42 := algebra.selectNotNil(_38); | | _48 := algebra.kunion(_46,_42); | | _52 := aggr.count(_48); | | _10:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",1); | | _11 := algebra.selectNotNil(_10); | | _53 := aggr.count(_11); | | _50 := mat.pack(_51,_52,_53); | | _54 := algebra.selectNotNil(_50); | | _14 := aggr.sum(_54); | | exit _55; | | sql.exportValue(1,"sys.transfer_ip","L4","wrd",64,0,6,_14,""); | | end s1_1; | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 29 tuples (12.925ms) sql>explain select count(distinct src) from transfer_ip; +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | mal | +==================================================================================================================================================================================+ | function user.s2_1{autoCommit=true}():void; | | barrier _56 := language.dataflow(); | | _2 := sql.mvc(); | | _42:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",0,0,2); | | _46 := algebra.selectNotNil(_42); | | _38:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",2,0,2); | | _48 := algebra.kdifference(_46,_38); | | _44 := algebra.selectNotNil(_38); | | _50 := algebra.kunion(_48,_44); | | _6:bat[:oid,:oid] := sql.bind_dbat(_2,"sys","transfer_ip",1); | | _9 := bat.reverse(_6); | | _52 := algebra.kdifference(_50,_9); | | _53 := bat.reverse(_52); | | _43:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",0,1,2); | | _47 := algebra.selectNotNil(_43); | | _41:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",2,1,2); | | _49 := algebra.kdifference(_47,_41); | | _45 := algebra.selectNotNil(_41); | | _51 := algebra.kunion(_49,_45); | | _54 := bat.reverse(_51); | | _10:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",1); | | _11 := algebra.selectNotNil(_10); | | _55 := bat.reverse(_11); | | _14 := mat.pack(_53,_54,_55); | | _15 := algebra.kunique(_14); | | _16 := bat.reverse(_15); | | _17 := aggr.count(_16); | | exit _56; | | sql.exportValue(1,"sys.transfer_ip","L5","wrd",64,0,6,_17,""); | | end s2_1; | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 30 tuples (1.238ms) -- Thanks, Eugene Prokopiev
On 8/2/11 8:34 AM, Eugene Prokopiev wrote:
Hi,
I have table with 81 887 010 rows. Select count from this table works fine, but I can't wait for the results of select count(distinct ...) from this table. Why can it be? Distinct requires the table to be sorted to find them.
sql>explain select count(src) from transfer_ip; +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | mal
| +==================================================================================================================================================================================+ | function user.s1_1{autoCommit=true}():void;
| | barrier _55 := language.dataflow();
| | _2 := sql.mvc();
| | _39:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",0,0,2);
| | _43 := algebra.selectNotNil(_39);
| | _35:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",2,0,2);
| | _45 := algebra.kdifference(_43,_35);
| | _41 := algebra.selectNotNil(_35);
| | _47 := algebra.kunion(_45,_41);
| | _6:bat[:oid,:oid] := sql.bind_dbat(_2,"sys","transfer_ip",1);
| | _9 := bat.reverse(_6);
| | _49 := algebra.kdifference(_47,_9);
| | _51 := aggr.count(_49);
| | _40:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",0,1,2);
| | _44 := algebra.selectNotNil(_40);
| | _38:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",2,1,2);
| | _46 := algebra.kdifference(_44,_38);
| | _42 := algebra.selectNotNil(_38);
| | _48 := algebra.kunion(_46,_42);
| | _52 := aggr.count(_48);
| | _10:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",1);
| | _11 := algebra.selectNotNil(_10);
| | _53 := aggr.count(_11);
| | _50 := mat.pack(_51,_52,_53);
| | _54 := algebra.selectNotNil(_50);
| | _14 := aggr.sum(_54);
| | exit _55;
| | sql.exportValue(1,"sys.transfer_ip","L4","wrd",64,0,6,_14,"");
| | end s1_1;
| +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 29 tuples (12.925ms) sql>explain select count(distinct src) from transfer_ip; +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | mal
| +==================================================================================================================================================================================+ | function user.s2_1{autoCommit=true}():void;
| | barrier _56 := language.dataflow();
| | _2 := sql.mvc();
| | _42:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",0,0,2);
| | _46 := algebra.selectNotNil(_42);
| | _38:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",2,0,2);
| | _48 := algebra.kdifference(_46,_38);
| | _44 := algebra.selectNotNil(_38);
| | _50 := algebra.kunion(_48,_44);
| | _6:bat[:oid,:oid] := sql.bind_dbat(_2,"sys","transfer_ip",1);
| | _9 := bat.reverse(_6);
| | _52 := algebra.kdifference(_50,_9);
| | _53 := bat.reverse(_52);
| | _43:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",0,1,2);
| | _47 := algebra.selectNotNil(_43);
| | _41:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",2,1,2);
| | _49 := algebra.kdifference(_47,_41);
| | _45 := algebra.selectNotNil(_41);
| | _51 := algebra.kunion(_49,_45);
| | _54 := bat.reverse(_51);
| | _10:bat[:oid,:inet] := sql.bind(_2,"sys","transfer_ip","src",1);
| | _11 := algebra.selectNotNil(_10);
| | _55 := bat.reverse(_11);
| | _14 := mat.pack(_53,_54,_55);
| | _15 := algebra.kunique(_14);
| | _16 := bat.reverse(_15);
| | _17 := aggr.count(_16);
| | exit _56;
| | sql.exportValue(1,"sys.transfer_ip","L5","wrd",64,0,6,_17,"");
| | end s2_1;
| +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 30 tuples (1.238ms)
I have table with 81 887 010 rows. Select count from this table works fine, but I can't wait for the results of select count(distinct ...) from this table. Why can it be?
Distinct requires the table to be sorted to find them.
Can clustered index helps? Is it possible to create two or more clustered indexes on one table?
On Tue, Aug 02, 2011 at 11:00:28AM +0400, Eugene Prokopiev wrote:
I have table with 81 887 010 rows. Select count from this table works fine, but I can't wait for the results of select count(distinct ...) from this table. Why can it be?
What data type is the column that you run the count(distinct) on? Which version of MonetDB are you using? What OS are you running on? What hardware (in particular CPU type & speed and amount of main memory) are you using? `mserver5 --version` shoudl asnwer the latter three questions sufficiently.
Distinct requires the table to be sorted to find them.
Not quite. On (knownly) sorted columns, the count(distinct <column>) will be (considerably) faster than on non-sorted columns, as determining the distinct values can be done in a simple sequential scan over sorted data, while it requires a more complicated (and due to its inherent very random access pattern) significantly slower hash-based algorithm on non-sorted data. If you run your query with prefixed TRACE to complition, you'll see that most of the time goes into the "algebra.kunique()" statement close to the end.
Can clustered index helps? Is it possible to create two or more clustered indexes on one table?
No, there is no such thing as (user controlable) clustered indexes in MonetDB. Stefan
------------------------------------------------------------------------------ BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA The must-attend event for mobile developers. Connect with experts. Get tools for creating Super Apps. See the latest technologies. Sessions, hands-on labs, demos & much more. Register early & save! http://p.sf.net/sfu/rim-blackberry-1 _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- | Stefan.Manegold @ CWI.nl | DB Architectures (INS1) | | http://CWI.nl/~manegold/ | Science Park 123 (L321) | | Tel.: +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
2011/8/2 Stefan Manegold
On Tue, Aug 02, 2011 at 11:00:28AM +0400, Eugene Prokopiev wrote:
I have table with 81 887 010 rows. Select count from this table works fine, but I can't wait for the results of select count(distinct ...) from this table. Why can it be?
What data type is the column that you run the count(distinct) on?
INET
Which version of MonetDB are you using? What OS are you running on? What hardware (in particular CPU type & speed and amount of main memory) are you using?
`mserver5 --version` shoudl asnwer the latter three questions sufficiently.
$ mserver5 --version MonetDB 5 server v11.3.7 "Apr2011-SP2" (64-bit, 64-bit oids) Copyright (c) 1993-July 2008 CWI Copyright (c) August 2008-2011 MonetDB B.V., all rights reserved Visit http://monetdb.cwi.nl/ for further information Found 2.0GiB available memory, 4 available cpu cores Libraries: libpcre: 8.12 2011-01-15 (compiled with 8.12) openssl: OpenSSL 1.0.0d 8 Feb 2011 (compiled with ) libxml2: 2.7.8 (compiled with 2.7.8) Compiled by: builder@dev64 (x86_64-alt-linux-gnu) Compilation: x86_64-alt-linux-gcc -pipe -Wall -g -O2 Linking : /usr/bin/ld -m elf_x86_64 So, I have only 2.0GiB main memory, but table size is only 6.0GiB
Distinct requires the table to be sorted to find them.
Not quite. On (knownly) sorted columns, the count(distinct <column>) will be (considerably) faster than on non-sorted columns, as determining the distinct values can be done in a simple sequential scan over sorted data, while it requires a more complicated (and due to its inherent very random access pattern) significantly slower hash-based algorithm on non-sorted data.
If you run your query with prefixed TRACE to complition, you'll see that most of the time goes into the "algebra.kunique()" statement close to the end.
Can clustered index helps? Is it possible to create two or more clustered indexes on one table?
No, there is no such thing as (user controlable) clustered indexes in MonetDB.
So, there is no way to sort column?
participants (3)
-
Eugene Prokopiev
-
Martin Kersten
-
Stefan Manegold