[MonetDB-users] Profiling queries

newer
[MonetDB-users] Changing default...

m h

4 Dec 2007 4 Dec '07

7:56 p.m.

Hello all- First of all thanks to you guys for your great help these past few weeks. I really appreciate it. I've learned a bit, and hopefully your product is improving from my bug reports. I've managed to pull a bit of my data from my proprietary db into monetdb. (Have some 60 million rows in the fact table, pulled in 5 million)! I even ran a query against it and got the same numbers (though in a different order)! So I did a very basic poor mans benchmark. My python code can hit the old db or monetdb by changing one line. So I took timestamps around the query execution portion. The old db takes 0.21 seconds when cold. Subsequent queries take 0.027-0.072 seconds Monetdb takes 0.67 seconds when cold and 0.12-0.23 otherwise. It appears that the old db is faster when cold than monetdb is after being primed. A few more caveats. (Yes, this is a poor benchmark but after reading pdfs saying that monetdb was 7 times faster than a 64 way oracle box, I thought I'd get quite a bit of speedup.) The old db is running under vmware on a Core2 1.86Ghz box. Monetdb is running on my laptop, a Pentium M 1.86 Ghz machine. Since the query is confidential I can't really put it here, but I can describe it. I'm pulling stuff out of my fact table and two other dimension tables. The select clause sums a few columns, and on one sums the product of one column and the sum of a few others. The coalesce function is used to convert NULLs to 0. The where statement joins the tables using the = operator and then has two "IN" clauses. Then there is a group by (since I'm aggregating in the select) and an order by. Any hints on how to increase performance? Do I need to tune my query or tweak other things? (Yes I know the laptop machine is not apples to apples to the vmware machine) Am open to suggestions. thanks much, -matt

Show replies by date

m h

5 Dec 5 Dec

6:58 a.m.

On Dec 4, 2007 12:56 PM, m h wrote:

...

Since the query is confidential I can't really put it here, but I can describe it. I'm pulling stuff out of my fact table and two other dimension tables. The select clause sums a few columns, and on one sums the product of one column and the sum of a few others. The coalesce function is used to convert NULLs to 0. The where statement joins the tables using the = operator and then has two "IN" clauses. Then there is a group by (since I'm aggregating in the select) and an order by.

Any hints on how to increase performance? Do I need to tune my query or tweak other things? (Yes I know the laptop machine is not apples to apples to the vmware machine)

Am open to suggestions.

In the meantime I've done the following in an attempt to make the query faster all to no avail (ie the speed stays the same): * Remove the calculation doing summing and coerce'ing * Change IN statement to = * Remove primary KEY for date table and join date table with between clause (actually surprised this didn't make it slower) * Added indexes for non-PK columns I'm joining on. No difference! Does this have something to do with the tables in the "tmp" schema? Does it make queries slower? cheers, -matt

Niels Nes

7:15 a.m.

On Tue, Dec 04, 2007 at 11:58:55PM -0700, m h wrote:

...

On Dec 4, 2007 12:56 PM, m h wrote:

...
Since the query is confidential I can't really put it here, but I can describe it. I'm pulling stuff out of my fact table and two other dimension tables. The select clause sums a few columns, and on one sums the product of one column and the sum of a few others. The coalesce function is used to convert NULLs to 0. The where statement joins the tables using the = operator and then has two "IN" clauses. Then there is a group by (since I'm aggregating in the select) and an order by.

Any hints on how to increase performance? Do I need to tune my query or tweak other things? (Yes I know the laptop machine is not apples to apples to the vmware machine)

Am open to suggestions.

In the meantime I've done the following in an attempt to make the query faster all to no avail (ie the speed stays the same):

* Remove the calculation doing summing and coerce'ing (this was in the selection part were I guess you do not have much data left) * Change IN statement to = * Remove primary KEY for date table and join date table with between clause (actually surprised this didn't make it slower) Only if the otherside has a foreign key it would slow down.

Single column primary key's don't help, as these are handled by in memory hash tables which are created when needed (ie without influence of the user).

...

* Added indexes for non-PK columns I'm joining on. No difference! So your query involves many join expressions ?

...

Does this have something to do with the tables in the "tmp" schema? Does it make queries slower?

We never use the 'tmp' schema for columns other then really temporaries ourselfs, so I have know real data on this, but known how the queries are translated I would expect any impact. Niels

...

cheers,

-matt

------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

-- Niels Nes, Centre for Mathematics and Computer Science (CWI) Kruislaan 413, 1098 SJ Amsterdam, The Netherlands room C0.02, phone ++31 20 592-4098, fax ++31 20 592-4312 url: http://www.cwi.nl/~niels e-mail: Niels.Nes@cwi.nl

m h

7:36 a.m.

On Dec 5, 2007 12:15 AM, Niels Nes wrote:

...

On Tue, Dec 04, 2007 at 11:58:55PM -0700, m h wrote:

...
On Dec 4, 2007 12:56 PM, m h wrote:

...
In the meantime I've done the following in an attempt to make the query faster all to no avail (ie the speed stays the same):

* Remove the calculation doing summing and coerce'ing (this was in the selection part were I guess you do not have much data left) * Change IN statement to = * Remove primary KEY for date table and join date table with between clause (actually surprised this didn't make it slower) Only if the otherside has a foreign key it would slow down.

Single column primary key's don't help, as these are handled by in memory hash tables which are created when needed (ie without influence of the user).

...
* Added indexes for non-PK columns I'm joining on. No difference! So your query involves many join expressions ?

I don't think it's that many, 3-4...

...

...
Does this have something to do with the tables in the "tmp" schema? Does it make queries slower?

We never use the 'tmp' schema for columns other then really temporaries ourselfs, so I have know real data on this, but known how the queries are translated I would expect any impact.

I'll work on using non tmp schema. I thought I read something about indexes not working there, but I'm probably just imagining it.... ok, names have been changed to protect the innocent, here's an anonymized query (the column types are in the comments to the right): SELECT item_dim.deptnm AS "DEPARTMENT", --varchar(30) item_dim.upcdesc AS "DESCRIPTION", --varchar(30) item_dim.freeformsz AS "SIZE", --varchar(30) loc_dim.orglevel1 AS "STORE CHAIN", --varchar(30) loc_dim.store AS "STORE NAME", --varchar(30) item_dim.upc AS "UPC", -- bigint sum(sale_fact.units * (coalesce(sale_fact.dealamt, 0) + coalesce(sale_fact.purbillback, 0) + coalesce(sale_fact.scanbillback, 0))) AS "current allowances", sum(sale_fact.sales - coalesce(sale_fact.cost, 0)) AS margin, sum(sale_fact.sales) AS sales, sum(sale_fact.units) AS units FROM item_dim, loc_dim, sale_fact WHERE sale_fact.adweekid = 16 AND --int item_dim.upc = sale_fact.upc AND --bigint loc_dim.stornum = sale_fact.stornum AND --int loc_dim.store IN ('FOO STORE') AND --varchar item_dim.groupnumnm IN ('BAR ITEMS') --varchar GROUP BY item_dim.deptnm, item_dim.upcdesc, item_dim.freeformsz, loc_dim.orglevel1, loc_dim.store, item_dim.upc ORDER BY item_dim.depnm ASC thanks, -matt

Niels Nes

7:46 a.m.

On Wed, Dec 05, 2007 at 12:36:36AM -0700, m h wrote:

...

On Dec 5, 2007 12:15 AM, Niels Nes wrote:

...
On Tue, Dec 04, 2007 at 11:58:55PM -0700, m h wrote:

...
On Dec 4, 2007 12:56 PM, m h wrote:

...
In the meantime I've done the following in an attempt to make the query faster all to no avail (ie the speed stays the same):

* Remove the calculation doing summing and coerce'ing (this was in the selection part were I guess you do not have much data left) * Change IN statement to = * Remove primary KEY for date table and join date table with between clause (actually surprised this didn't make it slower) Only if the otherside has a foreign key it would slow down.

Single column primary key's don't help, as these are handled by in memory hash tables which are created when needed (ie without influence of the user).

...
* Added indexes for non-PK columns I'm joining on. No difference! So your query involves many join expressions ?

I don't think it's that many, 3-4...

...
...
Does this have something to do with the tables in the "tmp" schema? Does it make queries slower?

We never use the 'tmp' schema for columns other then really temporaries ourselfs, so I have know real data on this, but known how the queries are translated I would expect any impact.

I'll work on using non tmp schema. I thought I read something about indexes not working there, but I'm probably just imagining it....

ok, names have been changed to protect the innocent, here's an anonymized query (the column types are in the comments to the right):

SELECT item_dim.deptnm AS "DEPARTMENT", --varchar(30) item_dim.upcdesc AS "DESCRIPTION", --varchar(30) item_dim.freeformsz AS "SIZE", --varchar(30) loc_dim.orglevel1 AS "STORE CHAIN", --varchar(30) loc_dim.store AS "STORE NAME", --varchar(30) item_dim.upc AS "UPC", -- bigint sum(sale_fact.units * (coalesce(sale_fact.dealamt, 0) + coalesce(sale_fact.purbillback, 0) + coalesce(sale_fact.scanbillback, 0))) AS "current allowances", sum(sale_fact.sales - coalesce(sale_fact.cost, 0)) AS margin, sum(sale_fact.sales) AS sales, sum(sale_fact.units) AS units FROM item_dim, loc_dim, sale_fact WHERE sale_fact.adweekid = 16 AND --int item_dim.upc = sale_fact.upc AND --bigint loc_dim.stornum = sale_fact.stornum AND --int loc_dim.store IN ('FOO STORE') AND --varchar item_dim.groupnumnm IN ('BAR ITEMS') --varchar GROUP BY item_dim.deptnm, item_dim.upcdesc, item_dim.freeformsz, loc_dim.orglevel1, loc_dim.store, item_dim.upc All of the groupby columns are varchar (except the last bigint). I'm just guessing but do you see high numbers in your trace for the group.* calls? (The numbers are listed a before the statement being executed).

Niels

...

ORDER BY item_dim.depnm ASC

thanks,

-matt

------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

m h

8:09 a.m.

On Dec 5, 2007 12:46 AM, Niels Nes wrote:

...

All of the groupby columns are varchar (except the last bigint). I'm just guessing but do you see high numbers in your trace for the group.* calls? (The numbers are listed a before the statement being executed).

Again, I'll look at this after rest, but how do you get around group by when your aggregating data?

Niels Nes

8:12 a.m.

On Wed, Dec 05, 2007 at 01:09:48AM -0700, m h wrote:

...

On Dec 5, 2007 12:46 AM, Niels Nes wrote:

...
All of the groupby columns are varchar (except the last bigint). I'm just guessing but do you see high numbers in your trace for the group.* calls? (The numbers are listed a before the statement being executed).

Again, I'll look at this after rest, but how do you get around group by when your aggregating data? I'm just looking where the cost come from. Maybe our groupby on strings isn't strong enough. Usually grouping on group by attributes with low ordinality (low number of different values in the type) is better.

Niels

...

------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

Martin Kersten

7:07 a.m.

Hello Matt This is of course a very difficult question to answer ;) to understand if it is an intrinsic problem with the MonetDB solution or another (optimizer) bug, the best step is to use the profiling tools of MonetDB. see http://monetdb.cwi.nl/projects/monetdb/SQL/Documentation/TRACE-Statement.htm... and this trace would be helpfull for us to make better suggestions. m h wrote:

...

Hello all-

First of all thanks to you guys for your great help these past few weeks. I really appreciate it. I've learned a bit, and hopefully your product is improving from my bug reports.

I've managed to pull a bit of my data from my proprietary db into monetdb. (Have some 60 million rows in the fact table, pulled in 5 million)! I even ran a query against it and got the same numbers (though in a different order)!

So I did a very basic poor mans benchmark. My python code can hit the old db or monetdb by changing one line. So I took timestamps around the query execution portion.

The old db takes 0.21 seconds when cold. Subsequent queries take 0.027-0.072 seconds

Monetdb takes 0.67 seconds when cold and 0.12-0.23 otherwise.

It appears that the old db is faster when cold than monetdb is after being primed.

A few more caveats. (Yes, this is a poor benchmark but after reading pdfs saying that monetdb was 7 times faster than a 64 way oracle box, I thought I'd get quite a bit of speedup.) The old db is running under vmware on a Core2 1.86Ghz box. Monetdb is running on my laptop, a Pentium M 1.86 Ghz machine.

Since the query is confidential I can't really put it here, but I can describe it. I'm pulling stuff out of my fact table and two other dimension tables. The select clause sums a few columns, and on one sums the product of one column and the sum of a few others. The coalesce function is used to convert NULLs to 0. The where statement joins the tables using the = operator and then has two "IN" clauses. Then there is a group by (since I'm aggregating in the select) and an order by.

Any hints on how to increase performance? Do I need to tune my query or tweak other things? (Yes I know the laptop machine is not apples to apples to the vmware machine)

Am open to suggestions.

thanks much,

-matt

------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

m h

7:38 a.m.

On Dec 5, 2007 12:07 AM, Martin Kersten wrote:

...

Hello Matt

This is of course a very difficult question to answer ;)

to understand if it is an intrinsic problem with the MonetDB solution or another (optimizer) bug, the best step is to use the profiling tools of MonetDB. see http://monetdb.cwi.nl/projects/monetdb/SQL/Documentation/TRACE-Statement.htm...

and this trace would be helpfull for us to make better suggestions.

OK, thanks. I'll see what I can do about getting a trace and anonymizing it....

Niels Nes

7:07 a.m.

On Tue, Dec 04, 2007 at 12:56:27PM -0700, m h wrote:

...

Hello all-

First of all thanks to you guys for your great help these past few weeks. I really appreciate it. I've learned a bit, and hopefully your product is improving from my bug reports.

I've managed to pull a bit of my data from my proprietary db into monetdb. (Have some 60 million rows in the fact table, pulled in 5 million)! I even ran a query against it and got the same numbers (though in a different order)!

So I did a very basic poor mans benchmark. My python code can hit the old db or monetdb by changing one line. So I took timestamps around the query execution portion.

The old db takes 0.21 seconds when cold. Subsequent queries take 0.027-0.072 seconds

Monetdb takes 0.67 seconds when cold and 0.12-0.23 otherwise.

It appears that the old db is faster when cold than monetdb is after being primed.

A few more caveats. (Yes, this is a poor benchmark but after reading pdfs saying that monetdb was 7 times faster than a 64 way oracle box, I thought I'd get quite a bit of speedup.) The old db is running under vmware on a Core2 1.86Ghz box. Monetdb is running on my laptop, a Pentium M 1.86 Ghz machine. cpu wise this would allready give you almost an par results as a core2 is quite a bit faster than a pentium M. An important thing would be to know the memory size of your laptop and the memory requirements of your query. What kind of columns (type of the columns) are involved.

But lets try to get to the problem here. First thing to make sure if there is no overhead in the python code, ie is it MonetDB/SQL server side or client side. Could you therefor 1) give the number of rows return (doesn't have to be exact) 2) try with mclient -t

...

Since the query is confidential I can't really put it here, but I can describe it. I'm pulling stuff out of my fact table and two other dimension tables. The select clause sums a few columns, and on one sums the product of one column and the sum of a few others. The coalesce function is used to convert NULLs to 0. The where statement joins the tables using the = operator and then has two "IN" clauses.

You may have hit a problem with our "IN" clause handling. Are the sizes sets your doing the IN against large?

...

Then there is a group by (since I'm aggregating in the select) and an order by. Are there any (foreign/primary/unique) key's involved?

We would like your to see your query/data. If its all too confidential maybe you could anonimize it.

...

Any hints on how to increase performance? Do I need to tune my query or tweak other things? (Yes I know the laptop machine is not apples to apples to the vmware machine)

What you could do is run your query with 'trace' infront and see which lower level monetdb statement takes a lot of time. Maybe we can without disclosing your data extra fix the performance problem. Niels

...

Am open to suggestions.

thanks much,

-matt

------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

m h

7:25 a.m.

On Dec 5, 2007 12:07 AM, Niels Nes wrote:

...

On Tue, Dec 04, 2007 at 12:56:27PM -0700, m h wrote:

...
A few more caveats. (Yes, this is a poor benchmark but after reading pdfs saying that monetdb was 7 times faster than a 64 way oracle box, I thought I'd get quite a bit of speedup.) The old db is running under vmware on a Core2 1.86Ghz box. Monetdb is running on my laptop, a Pentium M 1.86 Ghz machine.

cpu wise this would allready give you almost an par results as a core2 is quite a bit faster than a pentium M. An important thing would be to know the memory size of your laptop and the memory requirements of your query. What kind of columns (type of the columns) are involved.

Like I said, it's not apples to apples. Tomorrow I'll deploy monetdb on the same machine running the vm'd db. (But not under vmware, so again not the exactly the same, but the hardware now should be faster than the old db hardware since no virtualization). My laptop has 1.5Gig and Monetdb was using between 200M to 1+ Gigs. (It eats memory during loading the data. I've got 67 Million rows to load and after 4 Million I was swapping like crazy, but I'll save that for another email/ticket). After a restart memory goes down to 200M again and I can query without stopping. Machine running the old db (under VM) only has 400M allocated to it (and also has 67 Million rows)! You know what I'm willing to give you anonymous schemas and queries. I don't know what to do with 67 Million rows in the fact table though....

...

But lets try to get to the problem here.

First thing to make sure if there is no overhead in the python code, ie is it MonetDB/SQL server side or client side.

Good point. Yes, I've been thinking that said python code for the other db might be a bit faster than monetdb's python code. One thing to check out is profiling the python side.

...

Could you therefor 1) give the number of rows return (doesn't have to be exact)

Current query is returning 90 rows

...

2) try with mclient -t

I tried that before blowing away the data and got like 3 seconds (guess) but I also have 90 rows of output scrolling... Is there a way to disable output, so you only measure the impact of the query and not displaying the results? I'll have to try it again tomorrow, I re-installed the latest build to test the order by and blew away my data.

...

...
Since the query is confidential I can't really put it here, but I can describe it. I'm pulling stuff out of my fact table and two other dimension tables. The select clause sums a few columns, and on one sums the product of one column and the sum of a few others. The coalesce function is used to convert NULLs to 0. The where statement joins the tables using the = operator and then has two "IN" clauses.

You may have hit a problem with our "IN" clause handling. Are the sizes sets your doing the IN against large?

...
Then there is a group by (since I'm aggregating in the select) and an order by. Are there any (foreign/primary/unique) key's involved?

Primary and unique yes, no foreign.

...

We would like your to see your query/data. If its all too confidential maybe you could anonimize it.

Am willing to work with you guys. I'd like to come up with something but the 67 Rows of data is around 3Gigs compressed.... Perhaps I can find a way to get you some data...

...

...
Any hints on how to increase performance? Do I need to tune my query or tweak other things? (Yes I know the laptop machine is not apples to apples to the vmware machine)

What you could do is run your query with 'trace' infront and see which lower level monetdb statement takes a lot of time. Maybe we can without disclosing your data extra fix the performance problem.

Yeah, I ran with TRACE and tons of data scrolled by. I need to learn what it actually means (ran over my xterm buffer....). The intermediate language looks like perl to me... (sorry perl people ;)) -matt

Niels Nes

7:40 a.m.

...

On Dec 5, 2007 12:07 AM, Niels Nes wrote:

...
On Tue, Dec 04, 2007 at 12:56:27PM -0700, m h wrote:

...
A few more caveats. (Yes, this is a poor benchmark but after reading pdfs saying that monetdb was 7 times faster than a 64 way oracle box, I thought I'd get quite a bit of speedup.) The old db is running under vmware on a Core2 1.86Ghz box. Monetdb is running on my laptop, a Pentium M 1.86 Ghz machine.

cpu wise this would allready give you almost an par results as a core2 is quite a bit faster than a pentium M. An important thing would be to know the memory size of your laptop and the memory requirements of your query. What kind of columns (type of the columns) are involved.

Like I said, it's not apples to apples. Tomorrow I'll deploy monetdb on the same machine running the vm'd db. (But not under vmware, so again not the exactly the same, but the hardware now should be faster than the old db hardware since no virtualization).

My laptop has 1.5Gig and Monetdb was using between 200M to 1+ Gigs. (It eats memory during loading the data. I've got 67 Million rows to load and after 4 Million I was swapping like crazy, but I'll save that for another email/ticket). After a restart memory goes down to 200M again and I can query without stopping. As I don't known your schema/data types I don't know if 1+ GB is much. Monetdb likes memory ;-).

Machine running the old db (under VM) only has 400M allocated to it (and also has 67 Million rows)! But your old db is block oriented. And your data is somehow organised

On Wed, Dec 05, 2007 at 12:25:36AM -0700, m h wrote: that that doesn't hurt much too (any sorting or cluster indices?).

...

You know what I'm willing to give you anonymous schemas and queries.

That would help a lot.

...

I don't know what to do with 67 Million rows in the fact table though.... Well uploading should be a problem, but de-personalizing tyour 67 M fact table may be problem.

...
But lets try to get to the problem here.

First thing to make sure if there is no overhead in the python code, ie is it MonetDB/SQL server side or client side.

Good point. Yes, I've been thinking that said python code for the other db might be a bit faster than monetdb's python code. One thing to check out is profiling the python side.

...
Could you therefor 1) give the number of rows return (doesn't have to be exact)

Current query is returning 90 rows

...
2) try with mclient -t

I tried that before blowing away the data and got like 3 seconds 3 seconds is very high, -t gives times in millisecends. (guess) but I also have 90 rows of output scrolling... Is there a way Simply redirect output to a file always helps. to disable output, so you only measure the impact of the query and not displaying the results?

I'll have to try it again tomorrow, I re-installed the latest build to test the order by and blew away my data.

...

...
...
Since the query is confidential I can't really put it here, but I can describe it. I'm pulling stuff out of my fact table and two other dimension tables. The select clause sums a few columns, and on one sums the product of one column and the sum of a few others. The coalesce function is used to convert NULLs to 0. The where statement joins the tables using the = operator and then has two "IN" clauses.

You may have hit a problem with our "IN" clause handling. Are the sizes sets your doing the IN against large?

...
Then there is a group by (since I'm aggregating in the select) and an order by. Are there any (foreign/primary/unique) key's involved?

Primary and unique yes, no foreign. No foreign make the joins expensive afcourse, are these single column joins are are these joins over multiple columns?

...
We would like your to see your query/data. If its all too confidential maybe you could anonimize it.

Am willing to work with you guys. I'd like to come up with something but the 67 Rows of data is around 3Gigs compressed.... Perhaps I can find a way to get you some data... We have a ftp service at cwi. Any way we could/should discuss this of the list.

...
...
Any hints on how to increase performance? Do I need to tune my query or tweak other things? (Yes I know the laptop machine is not apples to apples to the vmware machine)

What you could do is run your query with 'trace' infront and see which lower level monetdb statement takes a lot of time. Maybe we can without disclosing your data extra fix the performance problem.

Yeah, I ran with TRACE and tons of data scrolled by. I need to learn what it actually means (ran over my xterm buffer....). The intermediate language looks like perl to me... (sorry perl people ;)) So now i'm offended. Perl is not really a language I appriciate. Same here redirect to a file and view with your favourite editor.

Niels

...

-matt

------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

m h

8:06 a.m.

On Dec 5, 2007 12:40 AM, Niels Nes wrote:

...

On Wed, Dec 05, 2007 at 12:25:36AM -0700, m h wrote:

...
...
On Tue, Dec 04, 2007 at 12:56:27PM -0700, m h wrote: Like I said, it's not apples to apples. Tomorrow I'll deploy monetdb on the same machine running the vm'd db. (But not under vmware, so again not the exactly the same, but the hardware now should be faster

On Dec 5, 2007 12:07 AM, Niels Nes wrote: than the old db hardware since no virtualization).

My laptop has 1.5Gig and Monetdb was using between 200M to 1+ Gigs. (It eats memory during loading the data. I've got 67 Million rows to load and after 4 Million I was swapping like crazy, but I'll save that for another email/ticket). After a restart memory goes down to 200M again and I can query without stopping. As I don't known your schema/data types I don't know if 1+ GB is much. Monetdb likes memory ;-).

Yes, it does ;)

...

...
Machine running the old db (under VM) only has 400M allocated to it (and also has 67 Million rows)!

But your old db is block oriented. And your data is somehow organised that that doesn't hurt much too (any sorting or cluster indices?).

There are indexes on all the columns in the WHERE clause

...

...
You know what I'm willing to give you anonymous schemas and queries.

That would help a lot.

...
I don't know what to do with 67 Million rows in the fact table though.... Well uploading should be a problem, but de-personalizing tyour 67 M fact table may be problem.

Seems like the easiest solution is to change the dimension tables....

...

...
...
But lets try to get to the problem here.

First thing to make sure if there is no overhead in the python code, ie is it MonetDB/SQL server side or client side.

Good point. Yes, I've been thinking that said python code for the other db might be a bit faster than monetdb's python code. One thing to check out is profiling the python side.

...
Could you therefor 1) give the number of rows return (doesn't have to be exact)

Current query is returning 90 rows

...
2) try with mclient -t

I tried that before blowing away the data and got like 3 seconds

3 seconds is very high, -t gives times in millisecends.

...
(guess) but I also have 90 rows of output scrolling... Is there a way Simply redirect output to a file always helps.

Must have done that wrong. Will try again after sleep.

...

...
to disable output, so you only measure the impact of the query and not displaying the results?

I'll have to try it again tomorrow, I re-installed the latest build to test the order by and blew away my data.

...
...
...
Since the query is confidential I can't really put it here, but I can describe it. I'm pulling stuff out of my fact table and two other dimension tables. The select clause sums a few columns, and on one sums the product of one column and the sum of a few others. The coalesce function is used to convert NULLs to 0. The where statement joins the tables using the = operator and then has two "IN" clauses.

You may have hit a problem with our "IN" clause handling. Are the sizes sets your doing the IN against large?

...
Then there is a group by (since I'm aggregating in the select) and an order by. Are there any (foreign/primary/unique) key's involved?

Primary and unique yes, no foreign. No foreign make the joins expensive afcourse, are these single column joins are are these joins over multiple columns?

So an optimization is to remove all foreign keys? I'll need to double check if there are FK's after sleep.... Obviously most entries in the FACT table could be FKs.

...

...
...
We would like your to see your query/data. If its all too confidential maybe you could anonimize it.

Am willing to work with you guys. I'd like to come up with something but the 67 Rows of data is around 3Gigs compressed.... Perhaps I can find a way to get you some data...

We have a ftp service at cwi. Any way we could/should discuss this of the list.

Yes, that sounds like the best way to handle.

...

...
...
...
Any hints on how to increase performance? Do I need to tune my query or tweak other things? (Yes I know the laptop machine is not apples to apples to the vmware machine)

What you could do is run your query with 'trace' infront and see which lower level monetdb statement takes a lot of time. Maybe we can without disclosing your data extra fix the performance problem.

Yeah, I ran with TRACE and tons of data scrolled by. I need to learn what it actually means (ran over my xterm buffer....). The intermediate language looks like perl to me... (sorry perl people ;))

So now i'm offended. Perl is not really a language I appriciate. Same here redirect to a file and view with your favourite editor.

Sorry. I don't like perl either. I'll get used to the other language ;) -matt

Niels Nes

8:15 a.m.

On Wed, Dec 05, 2007 at 01:06:07AM -0700, m h wrote:

...

...
But your old db is block oriented. And your data is somehow organised that that doesn't hurt much too (any sorting or cluster indices?).

There are indexes on all the columns in the WHERE clause

...

...
...
Primary and unique yes, no foreign. No foreign make the joins expensive afcourse, are these single column joins are are these joins over multiple columns?

So an optimization is to remove all foreign keys? I'll need to double check if there are FK's after sleep.... Obviously most entries in the FACT table could be FKs. No the other way arround. Having foreign keys will give you 'join-indices' ie the joins should be (almost) for free.

Niels -- Niels Nes, Centre for Mathematics and Computer Science (CWI) Kruislaan 413, 1098 SJ Amsterdam, The Netherlands room C0.02, phone ++31 20 592-4098, fax ++31 20 592-4312 url: http://www.cwi.nl/~niels e-mail: Niels.Nes@cwi.nl

m h

10:33 p.m.

I ran the query from the command line and piped the results. At the bottom are them two numbers: Timer 197.414 msec Timer 1.690 msec I'm assumming one represents the runtime, not sure about the other. I also ran the query with trace. At the bottom it says: Timer 245.435 msec Timer 1.380 msec In the trace file here are the 13 biggest lines: [ 243 usec # _16 := nil; ] [ 251 usec # _31 := nil; ] [ 1538 usec # _30 := algebra.join(_27=<~tmp_21501>bat[:oid,:oid]{176871}, _28=bat[:oid,:oid]{4000000}) ] [ 1877 usec # _104 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _102=bat[:oid,:str]{32}) ] [ 1883 usec # _121 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _97=bat[:oid,:str]{32}) ] [ 1885 usec # _122 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _102=bat[:oid,:str]{32}) ] [ 1900 usec # _120 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _17=bat[:oid,:str]{32}) ] [ 1906 usec # _119 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _89=bat[:oid,:str]{32}) ] [ 1910 usec # _94 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _17=bat[:oid,:str]{32}) ] [ 1969 usec # _91 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _89=bat[:oid,:str]{32}) ] [ 1979 usec # _99 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _97=bat[:oid,:str]{32}) ] [ 3256 usec # sql.exportResult(_197=0, _242="") ] [ 4921 usec # _39 := algebra.join(_38=bat[:oid,:oid]{210}, _31=<~tmp_21452>bat[:oid,:oid]{176871}) ] [ 73885 usec # _23 := algebra.semijoin(_16=<~tmp_13761>bat[:oid,:oid]{4000000}, _20=bat[:oid,:oid]{1}) ] (the above is piped through sort, not the last one is an order of magnitude bigger than anything else). How do I go about starting to understand/debug this?

Niels Nes

6 Dec 6 Dec

7:44 a.m.

On Wed, Dec 05, 2007 at 03:33:20PM -0700, m h wrote:

...

I ran the query from the command line and piped the results. At the bottom are them two numbers:

Timer 197.414 msec Timer 1.690 msec

I'm assumming one represents the runtime, not sure about the other.

I also ran the query with trace. At the bottom it says:

Timer 245.435 msec Timer 1.380 msec

In the trace file here are the 13 biggest lines: [ 243 usec # _16 := nil; ] [ 251 usec # _31 := nil; ] [ 1538 usec # _30 := algebra.join(_27=<~tmp_21501>bat[:oid,:oid]{176871}, _28=bat[:oid,:oid]{4000000}) ] [ 1877 usec # _104 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _102=bat[:oid,:str]{32}) ] [ 1883 usec # _121 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _97=bat[:oid,:str]{32}) ] [ 1885 usec # _122 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _102=bat[:oid,:str]{32}) ] [ 1900 usec # _120 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _17=bat[:oid,:str]{32}) ] [ 1906 usec # _119 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _89=bat[:oid,:str]{32}) ] [ 1910 usec # _94 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _17=bat[:oid,:str]{32}) ] [ 1969 usec # _91 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _89=bat[:oid,:str]{32}) ] [ 1979 usec # _99 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _97=bat[:oid,:str]{32}) ] [ 3256 usec # sql.exportResult(_197=0, _242="") ] [ 4921 usec # _39 := algebra.join(_38=bat[:oid,:oid]{210}, _31=<~tmp_21452>bat[:oid,:oid]{176871}) ] [ 73885 usec # _23 := algebra.semijoin(_16=<~tmp_13761>bat[:oid,:oid]{4000000}, _20=bat[:oid,:oid]{1}) ]

(the above is piped through sort, not the last one is an order of magnitude bigger than anything else). How do I go about starting to understand/debug this? Could you send the same output but without running through sort. I need to see the related commands of this semijoin statement. I'm also missing quite a few of the statements, its not a full plan.

Niels

...

------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

m h

7:45 a.m.

On Dec 5, 2007 3:33 PM, m h wrote:

...

I ran the query from the command line and piped the results. At the bottom are them two numbers:

Timer 197.414 msec Timer 1.690 msec

I'm assumming one represents the runtime, not sure about the other.

I also ran the query with trace. At the bottom it says:

Timer 245.435 msec Timer 1.380 msec

In the trace file here are the 13 biggest lines: [ 243 usec # _16 := nil; ] [ 251 usec # _31 := nil; ] [ 1538 usec # _30 := algebra.join(_27=<~tmp_21501>bat[:oid,:oid]{176871}, _28=bat[:oid,:oid]{4000000}) ] [ 1877 usec # _104 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _102=bat[:oid,:str]{32}) ] [ 1883 usec # _121 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _97=bat[:oid,:str]{32}) ] [ 1885 usec # _122 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _102=bat[:oid,:str]{32}) ] [ 1900 usec # _120 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _17=bat[:oid,:str]{32}) ] [ 1906 usec # _119 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _89=bat[:oid,:str]{32}) ] [ 1910 usec # _94 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _17=bat[:oid,:str]{32}) ] [ 1969 usec # _91 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _89=bat[:oid,:str]{32}) ] [ 1979 usec # _99 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _97=bat[:oid,:str]{32}) ] [ 3256 usec # sql.exportResult(_197=0, _242="") ] [ 4921 usec # _39 := algebra.join(_38=bat[:oid,:oid]{210}, _31=<~tmp_21452>bat[:oid,:oid]{176871}) ] [ 73885 usec # _23 := algebra.semijoin(_16=<~tmp_13761>bat[:oid,:oid]{4000000}, _20=bat[:oid,:oid]{1}) ]

(the above is piped through sort, not the last one is an order of magnitude bigger than anything else). How do I go about starting to understand/debug this?

I've gotten as far as understanding the the 73885 and 4921 statements have to do with two statements in the where clause that showed up as IN. I'm changed the query a bit and they are not doing = comparisons against varchars. These two lines: loc_dim.store IN ('FOO STORE') AND --varchar item_dim.groupnumnm IN ('BAR ITEMS') --varchar Any suggestions for optimizing there?

Niels Nes

7:48 a.m.

On Thu, Dec 06, 2007 at 12:45:27AM -0700, m h wrote:

...

On Dec 5, 2007 3:33 PM, m h wrote:

...
I ran the query from the command line and piped the results. At the bottom are them two numbers:

Timer 197.414 msec Timer 1.690 msec

I'm assumming one represents the runtime, not sure about the other.

I also ran the query with trace. At the bottom it says:

Timer 245.435 msec Timer 1.380 msec

In the trace file here are the 13 biggest lines: [ 243 usec # _16 := nil; ] [ 251 usec # _31 := nil; ] [ 1538 usec # _30 := algebra.join(_27=<~tmp_21501>bat[:oid,:oid]{176871}, _28=bat[:oid,:oid]{4000000}) ] [ 1877 usec # _104 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _102=bat[:oid,:str]{32}) ] [ 1883 usec # _121 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _97=bat[:oid,:str]{32}) ] [ 1885 usec # _122 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _102=bat[:oid,:str]{32}) ] [ 1900 usec # _120 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _17=bat[:oid,:str]{32}) ] [ 1906 usec # _119 := algebra.joinPath(_112=bat[:oid,:oid]{89}, _60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _89=bat[:oid,:str]{32}) ] [ 1910 usec # _94 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _17=bat[:oid,:str]{32}) ] [ 1969 usec # _91 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _89=bat[:oid,:str]{32}) ] [ 1979 usec # _99 := algebra.joinPath(_60=<~tmp_21352>bat[:oid,:oid]{89}, _43=<~tmp_21452>bat[:oid,:oid]{1304}, _88=<~tmp_21522>bat[:oid,:oid]{176871}, _97=bat[:oid,:str]{32}) ] [ 3256 usec # sql.exportResult(_197=0, _242="") ] [ 4921 usec # _39 := algebra.join(_38=bat[:oid,:oid]{210}, _31=<~tmp_21452>bat[:oid,:oid]{176871}) ] [ 73885 usec # _23 := algebra.semijoin(_16=<~tmp_13761>bat[:oid,:oid]{4000000}, _20=bat[:oid,:oid]{1}) ]

(the above is piped through sort, not the last one is an order of magnitude bigger than anything else). How do I go about starting to understand/debug this?

I've gotten as far as understanding the the 73885 and 4921 statements have to do with two statements in the where clause that showed up as IN. I'm changed the query a bit and they are not doing = comparisons against varchars.

These two lines:

loc_dim.store IN ('FOO STORE') AND --varchar item_dim.groupnumnm IN ('BAR ITEMS') --varchar

Any suggestions for optimizing there? could you send me the full trace output (unsorted).

Niels

...

------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

6232

Age (days ago)

6234

Last active (days ago)

List overview

Download

17 comments

3 participants

participants (3)

m h
Martin Kersten
Niels Nes

[MonetDB-users] Profiling queries

m h

m h

m h

m h

m h

m h

m h

m h

m h

tags

participants (3)