On Dec 5, 2007 12:40 AM, Niels Nes
On Wed, Dec 05, 2007 at 12:25:36AM -0700, m h wrote:
On Tue, Dec 04, 2007 at 12:56:27PM -0700, m h wrote: Like I said, it's not apples to apples. Tomorrow I'll deploy monetdb on the same machine running the vm'd db. (But not under vmware, so again not the exactly the same, but the hardware now should be faster
On Dec 5, 2007 12:07 AM, Niels Nes
wrote: than the old db hardware since no virtualization). My laptop has 1.5Gig and Monetdb was using between 200M to 1+ Gigs. (It eats memory during loading the data. I've got 67 Million rows to load and after 4 Million I was swapping like crazy, but I'll save that for another email/ticket). After a restart memory goes down to 200M again and I can query without stopping. As I don't known your schema/data types I don't know if 1+ GB is much. Monetdb likes memory ;-).
Yes, it does ;)
Machine running the old db (under VM) only has 400M allocated to it (and also has 67 Million rows)!
But your old db is block oriented. And your data is somehow organised that that doesn't hurt much too (any sorting or cluster indices?).
There are indexes on all the columns in the WHERE clause
You know what I'm willing to give you anonymous schemas and queries.
That would help a lot.
I don't know what to do with 67 Million rows in the fact table though.... Well uploading should be a problem, but de-personalizing tyour 67 M fact table may be problem.
Seems like the easiest solution is to change the dimension tables....
But lets try to get to the problem here.
First thing to make sure if there is no overhead in the python code, ie is it MonetDB/SQL server side or client side.
Good point. Yes, I've been thinking that said python code for the other db might be a bit faster than monetdb's python code. One thing to check out is profiling the python side.
Could you therefor 1) give the number of rows return (doesn't have to be exact)
Current query is returning 90 rows
2) try with mclient -t
I tried that before blowing away the data and got like 3 seconds
3 seconds is very high, -t gives times in millisecends.
(guess) but I also have 90 rows of output scrolling... Is there a way Simply redirect output to a file always helps.
Must have done that wrong. Will try again after sleep.
to disable output, so you only measure the impact of the query and not displaying the results?
I'll have to try it again tomorrow, I re-installed the latest build to test the order by and blew away my data.
Since the query is confidential I can't really put it here, but I can describe it. I'm pulling stuff out of my fact table and two other dimension tables. The select clause sums a few columns, and on one sums the product of one column and the sum of a few others. The coalesce function is used to convert NULLs to 0. The where statement joins the tables using the = operator and then has two "IN" clauses.
You may have hit a problem with our "IN" clause handling. Are the sizes sets your doing the IN against large?
Then there is a group by (since I'm aggregating in the select) and an order by. Are there any (foreign/primary/unique) key's involved?
Primary and unique yes, no foreign. No foreign make the joins expensive afcourse, are these single column joins are are these joins over multiple columns?
So an optimization is to remove all foreign keys? I'll need to double check if there are FK's after sleep.... Obviously most entries in the FACT table could be FKs.
We would like your to see your query/data. If its all too confidential maybe you could anonimize it.
Am willing to work with you guys. I'd like to come up with something but the 67 Rows of data is around 3Gigs compressed.... Perhaps I can find a way to get you some data...
We have a ftp service at cwi. Any way we could/should discuss this of the list.
Yes, that sounds like the best way to handle.
Any hints on how to increase performance? Do I need to tune my query or tweak other things? (Yes I know the laptop machine is not apples to apples to the vmware machine)
What you could do is run your query with 'trace' infront and see which lower level monetdb statement takes a lot of time. Maybe we can without disclosing your data extra fix the performance problem.
Yeah, I ran with TRACE and tons of data scrolled by. I need to learn what it actually means (ran over my xterm buffer....). The intermediate language looks like perl to me... (sorry perl people ;))
So now i'm offended. Perl is not really a language I appriciate. Same here redirect to a file and view with your favourite editor.
Sorry. I don't like perl either. I'll get used to the other language ;) -matt