Hi Arjen, while Niels should be the one who can give an educated comment, here's just a simple preliminary idea to potentially help(?) locating the origin of the unexpectedly slow performance: Your "combined" queries differs in two way from the original two queries that run fast: (1) there are two aggregations in one query: count(distinct clientip) & count(*) (2) count(distinct clientip) is now combined with a group-by instead of being global. Doing only one modification at a time gives much more info, which of the two changes might trigger the performance degradation, repectively whether it in only the combination of the two. Hence, could you please also run the following two queries and report their performance? select channelid, count(distinct clientip), count(*) from pageviews where timestamp between '2007-09-21 00:00:00' and '2007-09-22 00:00:00'; select channelid, count(distinct clientip) from pageviews where timestamp between '2007-09-21 00:00:00' and '2007-09-22 00:00:00' group by channelid; Thanks in advance! Stefan ps: and the "usual" questions are: which version of MonetDB5/SQL are you using and on what kind of platform (HW, OS, etc.) are you running it? On Sun, Oct 21, 2007 at 02:33:58PM +0200, Arjen van der Meijden wrote:
Hi list,
I was just doing some basic queries on MonetDB5/SQL to see if it is suitable for my application, I'm doing lots of aggregates on some logfile-abstractions. Basically they all boil down to 'how many unique visitors and total pageviews where there in period X-Y in section Z'.
I have this table: pageviews ( timestamp timestamp not null, clientip varchar(15) not null, sectionid smallint not null, itemid integer not null, channelid smallint default 0 )
Currently it only contains data for last september, with about 2M records/day, and 5.6M in total.
There are no additional indexes in this case.
When doing a query like this, monetdb very fast. Once the data is in the memory cache, it returns (according to trace) in about half a second.
select count(*) from pageviews where timestamp between '2007-09-21 00:00:00' and '2007-09-22 00:00:00';
The result is 1916813
This one is also pretty fast, taking about 1.7 second
select count(distinct clientip) from pageviews where timestamp between '2007-09-21 00:00:00' and '2007-09-22 00:00:00';
The result is 165700
And the third which is pretty fast: select channelid, count(*) from pageviews where timestamp between '2007-09-21 00:00:00' and '2007-09-22 00:00:00' group by channelid;
Here's the distribution, and its returned in about 0.6 second [ 0, 538187 ] [ 1, 1108478 ] [ 4, 42867 ] [ 3, 145565 ] [ 2, 81716 ]
But when I combine those last two queries, the result isn't returned in a reasonable amount of time, I waited for more than half an hour and it still hadn't returned the results.
select channelid, count(distinct clientip), count(*) from pageviews where timestamp between '2007-09-21 00:00:00' and '2007-09-22 00:00:00' group by channelid;
I'm not very good at reading your explain output yet, so I've attached the resulting explain for that query.
Is there a way to speed up this type of query? It seems a bit odd that it's taking more than half an hour (postgresql does it in about 20 seconds) while the other queries return much faster (postgresql does them in about 14 seconds).
Best regards,
Arjen
function user.s5_1():void; _1:bat[:oid,:timestamp]{rows=55914459:lng,bid=1968} := sql.bind("sys","pageviews","timestamp",0); _10:bat[:oid,:timestamp]{rows=0:lng,bid=2323} := sql.bind("sys","pageviews","timestamp",1); constraints.emptySet(_10); _10:bat[:oid,:timestamp]{rows=0:lng,bid=2323} := nil; _12:bat[:oid,:timestamp]{rows=0:lng,bid=2324} := sql.bind("sys","pageviews","timestamp",2); constraints.emptySet(_12); _12:bat[:oid,:timestamp]{rows=0:lng,bid=2324} := nil; _17:bat[:oid,:sht]{notnil=true,rows=55914459:lng,bid=1969} := sql.bind("sys","pageviews","channelid",0); _24:bat[:oid,:str]{rows=55914459:lng,bid=1961} := sql.bind("sys","pageviews","clientip",0); _8{rows=27957230:lng} := algebra.uselect(_1,2007-09-21 00:00:00.000,2007-09-22 00:00:00.000,true,true); _1:bat[:oid,:timestamp]{rows=55914459:lng,bid=1968} := nil; _15{rows=27957230:lng} := algebra.markT(_8,0@0); _8{rows=27957230:lng} := nil; _16{rows=27957230:lng} := bat.reverse(_15); _15{rows=27957230:lng} := nil; _19{rows=27957230:lng} := algebra.join(_16,_17); (ext40,grp38):= group.new(_19); _19{rows=27957230:lng} := nil; _22 := bat.mirror(ext40); ext40 := nil; _23 := algebra.joinPath(_22,_16,_17); _17:bat[:oid,:sht]{notnil=true,rows=55914459:lng,bid=1969} := nil; _26{rows=27957230:lng} := algebra.join(_16,_24); _16{rows=27957230:lng} := nil; _24:bat[:oid,:str]{rows=55914459:lng,bid=1961} := nil; (_27,grp55):= group.new(grp38); (_29,grp57):= group.derive(_27,grp55,_26); _27 := nil; grp55 := nil; _31 := bat.mirror(_29); _29 := nil; _32 := algebra.semijoin(_26,_31); _26{rows=27957230:lng} := nil; _31 := nil; _33:bat[:oid,:int] := aggr.count_no_nil(_32,grp38,_22); _32 := nil; _34:bat[:oid,:int] := aggr.count(grp38,grp38,_22); grp38 := nil; _22 := nil; _35 := sql.resultSet(3,1,_23); sql.rsColumn(_35,"sys.pageviews","channelid","smallint",16,0,_23); _23 := nil; sql.rsColumn(_35,"sys.pageviews","count_no_nil_clientip","int",15,0,_33); _33:bat[:oid,:int] := nil; sql.rsColumn(_35,"sys.pageviews","count_channelid","int",32,0,_34); _34:bat[:oid,:int] := nil; sql.exportResult(_35,""); end s5_1;
------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4312 |