MonetDB performance on large table with 20K columns

Hi, I posted a question on StackOverflow https://stackoverflow.com/questions/55951563/monetdb-performance-on-a-large-... regarding s specific performance issue. I’m not sure if the right people are monitoring S/O from time to time so I thought I’ll post it here too. Here it is in case you do not want to follow the link: I'm testing MonetDB as a solution for a data-science project. I have a table of 21K columns - all but three are features described as float (32bit) and 6.5M rows (which may or may not become larger, perhaps up to 20M rows). My aim is to use the integrated Python on MonetDB to achieve the ability to train without exporting the data from the DB every time. In addition, queries on specific columns are necessary so the columnar storage can be a significant advantage. I have compiled MonetDB 11.31.13 to gain the embedded Python support. OS is CentOS 7. Storage is not SSD. 48 core server with ~300GB of memory. I created an (unique) index on the table (without analyze). I noticed that when I SELECT * FROM [TABLE_NAME] SAMPLE 50; it takes a long long time to complete. I then tried: SELECT f1, f2, ..., f501 from [TABLE_NAME] SAMPLE 50; SELECT f1, f2, ..., f1001 from [TABLE_NAME] SAMPLE 50; SELECT f1, f2, ..., f2001 from [TABLE_NAME] SAMPLE 50; ... SELECT * from [TABLE_NAME] SAMPLE 50; I ran the queries locally with mclient and used time to measure the amount of time it took and I noticed two things: 1. There is a period where a single core is taking 100% CPU. The more columns the longer it takes to complete. Only when it finishes I can see all cores working, data being consumed, etc... In addition, during that time, the query does not appear in the result of select * from sys.queue(); Eventually, the time needed to get 50 rows from the table was almost 4 hours. 2. The amount of columns is doubled but between each step in the test the amount of time it takes to get a result is tripled. So my questions is: Is this behaviour expected or does it reflect something I did wrong? The data requested from the table should be around 4MB (50 * 21000 * 4Bytes), so this reflects a significant time waiting for such a small amount of data. Help is appreciated! Help is very much appreciated!

Hai,
On 7 May 2019, at 10:45, Malware Research
wrote: Hi,
I posted a question on StackOverflow regarding s specific performance issue. I’m not sure if the right people are monitoring S/O from time to time so I thought I’ll post it here too.
Sorry, we should do better monitoring that.
Here it is in case you do not want to follow the link:
Thanks for reposting here.
I'm testing MonetDB as a solution for a data-science project. I have a table of 21K columns
That’s a lot. It would be a good idea to reduce it as much as possible.
- all but three are features described as float (32bit)
what data type are those three columns
and 6.5M rows (which may or may not become larger, perhaps up to 20M rows).
My aim is to use the integrated Python on MonetDB to achieve the ability to train without exporting the data from the DB every time. In addition, queries on specific columns are necessary so the columnar storage can be a significant advantage. I have compiled MonetDB 11.31.13
this is very old. The current version is 11.33. Please upgrade. 11.31 is no longer supported.
to gain the embedded Python support. OS is CentOS 7. Storage is not SSD.
Do you really mean “not SSD”? If yes, that one main explanation for the lack of speed you have observed. So it’s an HDD? What’s its RPM?
48 core server with ~300GB of memory. I created an (unique) index on the table (without analyze).
Did you declare some columns to be unique?
I noticed that when I
SELECT * FROM [TABLE_NAME] SAMPLE 50; it takes a long long time to complete.
That’s to be expected. This requires MonetDB to read 21K files, glue 21K columns together and serialise 900k floats. Each of them can be a disaster: i) if you’re indeed using an HDD, reading 21k files takes a long time ii) MonetDB is a columnar database. SELECT * on so many columns is a bad idea. iii) I have seen recently that serialising DOUBLEs is considerably more expensive than serialising e.g. INT. Maybe serialising FLOATs is cheaper than DOUBLEs, but still...
I then tried: SELECT f1, f2, ..., f501 from [TABLE_NAME] SAMPLE 50;
SELECT f1, f2, ..., f1001 from [TABLE_NAME] SAMPLE 50;
SELECT f1, f2, ..., f2001 from [TABLE_NAME] SAMPLE 50;
These should run much faster, I think...
...
SELECT * from [TABLE_NAME] SAMPLE 50;
I ran the queries locally with mclient and used time to measure the amount of time it took and I noticed two things:
1. There is a period where a single core is taking 100% CPU.
This calls for query execution TRACE.
The more columns the longer it takes to complete. Only when it finishes I can see all cores working, data being consumed, etc... In addition, during that time, the query does not appear in the result of select * from sys.queue(); Eventually, the time needed to get 50 rows from the table was almost 4 hours. 2. The amount of columns is doubled but between each step in the test the amount of time it takes to get a result is tripled. So my questions is: Is this behaviour expected or does it reflect something I did wrong?
Again, query TRACEs might give us some useful information.
The data requested from the table should be around 4MB (50 * 21000 * 4Bytes), so this reflects a significant time waiting for such a small amount of data.
The amount of data eventually returned is not the problem here. It’s the number of columns, and possibly the disc. Regards, Jennie
Help is appreciated!
Help is very much appreciated! _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (2)
-
Malware Research
-
Ying Zhang