Hi all,

I'm trying to figure out how much storage MonetDB (Feb2013-SP2) uses when populated with my data on a fresh database.

Here is my table definition: create table genomics.grch36 (site bigint, base char(1));
I'm populating the table via COPY INTO.
The source file size is 40GB and contains more than 3 billion rows.

Following the suggestions on http://www.monetdb.org/Documentation/Userguide/diskspace, I get following result for the SQL query:

sql>select * from storage() where "table" = 'grch36';
+----------+--------+--------+--------+----------+------------+-----------+-------------+----------+---------+--------+
| schema   | table  | column | type   | location | count      | typewidth | columnsize  | heapsize | indices | sorted |
+==========+========+========+========+==========+============+===========+=============+==========+=========+========+
| genomics | grch36            | site                 | bigint    | 12/1252  | 3137459420 |         8 | 25099675360 |        0 |       0 | false  |
| genomics | grch36            | base                 | char      | 12/1247  | 3137459420 |         1 |  6274918840 |   131072 |       0 | false  |
+----------+--------+--------+--------+----------+------------+-----------+-------------+----------+---------+--------+

I assume that 'columnsize' refers to the storage demand. I'm I right? Based on the numbers given, I assume a storage need of round about 31 GB (without any compression).

Calling 'du -h' on my database directory confirms this assumption.
4,0K    ./bat/14
44K    ./bat/10
108K    ./bat/07
476K    ./bat/01
30G    ./bat/12
256K    ./bat/02
4,0K    ./bat/03
4,0K    ./bat/HC
4,0K    ./bat/16
4,0K    ./bat/17
12K    ./bat/13
4,0K    ./bat/LEFTOVERS
28K    ./bat/BACKUP
4,0K    ./bat/15
4,0K    ./bat/11
12K    ./bat/04
30G    ./bat
8,0K    ./sql_logs/sql
12K    ./sql_logs
4,0K    ./box
30G    .

But why is the 'columnsize' of the 'base' column something about 6GB whereas 3 billion times 1 Byte (for a 1 char) is 3GB?
For the 'site' column this calculation works!

Another question that comes into my mind:
The number of distinct values in the 'base' column is 5.
Thus, assuming a ideal dictionary encoding 3 Bit would be sufficient to encode all values.
Since MonetDB uses dictionary compression, can it utilize this fact?


Regards,
Sebastian