MonetDB performance on large table with 20K columns

7 May 2019

      Hi,

I posted a question on StackOverflow
https://stackoverflow.com/questions/55951563/monetdb-performance-on-a-large-...
regarding
s specific performance issue. I’m not sure if the right people are
monitoring S/O from time to time so I thought I’ll post it here too.

Here it is in case you do not want to follow the link:

I'm testing MonetDB as a solution for a data-science project. I have a
table of 21K columns - all but three are features described as float
(32bit) and 6.5M rows (which may or may not become larger, perhaps up to
20M rows).

My aim is to use the integrated Python on MonetDB to achieve the ability to
train without exporting the data from the DB every time. In addition,
queries on specific columns are necessary so the columnar storage can be a
significant advantage. I have compiled MonetDB 11.31.13 to gain the
embedded Python support. OS is CentOS 7. Storage is not SSD. 48 core server
with ~300GB of memory. I created an (unique) index on the table (without
analyze).

I noticed that when I

SELECT * FROM [TABLE_NAME] SAMPLE 50; it takes a long long time to
complete. I then tried:

SELECT f1, f2, ..., f501 from [TABLE_NAME] SAMPLE 50;

SELECT f1, f2, ..., f1001 from [TABLE_NAME] SAMPLE 50;

SELECT f1, f2, ..., f2001 from [TABLE_NAME] SAMPLE 50;

...

SELECT * from [TABLE_NAME] SAMPLE 50;

I ran the queries locally with mclient and used time to measure the amount
of time it took and I noticed two things:

1.   There is a period where a single core is taking 100% CPU. The more
columns the longer it takes to complete. Only when it finishes I can see
all cores working, data being consumed, etc... In addition, during that
time, the query does not appear in the result of select * from
sys.queue(); Eventually,
the time needed to get 50 rows from the table was almost 4 hours.

2.   The amount of columns is doubled but between each step in the test the
amount of time it takes to get a result is tripled.

So my questions is: Is this behaviour expected or does it reflect something
I did wrong?

The data requested from the table should be around 4MB (50 * 21000 *
4Bytes), so this reflects a significant time waiting for such a small
amount of data.

Help is appreciated!

Help is very much appreciated!

Malware Research

Ying Zhang

tags

participants (2)