growing size of database

Medha Atre

12 Nov 2012 12 Nov '12

10:11 p.m.

Hi, I am running the latest MonetDB 11.13.3. I loaded a pretty tiny RDF dataset of 13 million triples and created some "vertical partitions" on predicate values. After having done dataset loading, I checked the size of the dataset folder and it showed to be _258 MB_. Then I ran about 6000+ bulk queries by periodically shutting down the server after about 1000 queries each time. The queries finished successfully and there was no crash. After the final query I stopped the database "monetdb stop dbname" and shutdown the main server too "monetdbd stop /path/to/dbfarm". After that I again checked the size of the database folder and then it showed it to be _5.8GB_ and the "HC" folder under "bat" was shown to have size of about 3 GB with some other files having sizes between 300-700 MBs. It seems mysterious to me why would the size of database go on increasing as much as 25 times the original size. I intend to load much larger data (> 1 billion triples), and want to avoid growing disk size of the dataset due to limited disk-space. Can someone please enlighten? Medha _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Show replies by date

Medha Atre

14 Nov 14 Nov

1:25 p.m.

Just wondering if anyone could throw a light on this issue? ---------- Forwarded message ---------- From: Medha Atre Date: Mon, Nov 12, 2012 at 5:11 PM Subject: growing size of database To: Communication channel for MonetDB users Hi, I am running the latest MonetDB 11.13.3. I loaded a pretty tiny RDF dataset of 13 million triples and created some "vertical partitions" on predicate values. After having done dataset loading, I checked the size of the dataset folder and it showed to be _258 MB_. Then I ran about 6000+ bulk queries by periodically shutting down the server after about 1000 queries each time. The queries finished successfully and there was no crash. After the final query I stopped the database "monetdb stop dbname" and shutdown the main server too "monetdbd stop /path/to/dbfarm". After that I again checked the size of the database folder and then it showed it to be _5.8GB_ and the "HC" folder under "bat" was shown to have size of about 3 GB with some other files having sizes between 300-700 MBs. It seems mysterious to me why would the size of database go on increasing as much as 25 times the original size. I intend to load much larger data (> 1 billion triples), and want to avoid growing disk size of the dataset due to limited disk-space. Can someone please enlighten? Medha _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Lefteris

1:43 p.m.

Hi Medha, can you tell us a bit more how you load the data, which SQL statements you used and how you did the vertical partitioning? What you are describing should not be happening. thank you On Wed, Nov 14, 2012 at 2:25 PM, Medha Atre wrote:

...

Just wondering if anyone could throw a light on this issue?

---------- Forwarded message ---------- From: Medha Atre Date: Mon, Nov 12, 2012 at 5:11 PM Subject: growing size of database To: Communication channel for MonetDB users

Hi,

I am running the latest MonetDB 11.13.3. I loaded a pretty tiny RDF dataset of 13 million triples and created some "vertical partitions" on predicate values. After having done dataset loading, I checked the size of the dataset folder and it showed to be _258 MB_. Then I ran about 6000+ bulk queries by periodically shutting down the server after about 1000 queries each time. The queries finished successfully and there was no crash. After the final query I stopped the database "monetdb stop dbname" and shutdown the main server too "monetdbd stop /path/to/dbfarm".

After that I again checked the size of the database folder and then it showed it to be _5.8GB_ and the "HC" folder under "bat" was shown to have size of about 3 GB with some other files having sizes between 300-700 MBs.

It seems mysterious to me why would the size of database go on increasing as much as 25 times the original size.

I intend to load much larger data (> 1 billion triples), and want to avoid growing disk size of the dataset due to limited disk-space.

Can someone please enlighten?

Medha _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Medha Atre

2 p.m.

For bulk loading the RDF data in triples form (triples represented simply as integers 1:2:3 on each line of datafile to load, where 1 is an ID to which subject value is mapped, 2 to which predicate string is mapped, and 3 is what object is mapped to), I use commands and SQL statements like -- $ monetdbd create /path/to/dbfarm/rdf $ monetdbd start /path/to/dbfarm $ monetdb create lubm $ monetdb release lubm $ mclient -d lubm -lsql < /path/to/bulk_load_file bulk_load_file contains -- -------------------------- create table lubm100u(sub int, pred int, obj int); copy into lubm100u from '/path/to/datafile' using delimiters ':','\n'; create table lubm100u_p1(sub int, obj int); create table lubm100u_p2(sub int, obj int); insert into lubm100u_p1 select sub, obj from lubm100u where pred=1 order by sub, obj; insert into lubm100u_p2 select sub, obj from lubm100u where pred=2 order by sub, obj; ------------------- Then I run bulk queries like -- for (( i=1; i <=6000; i++ )); do echo $i sed -n "${i}p" /path/to/bulk_q_file | mclient -d lubm -lsql | grep -P "^\|\s+\d+\s+\|$" >> results_file if (( i % 1000 == 0 )); then echo "Shutting down MonetDB" monetdb stop lubm monetdbd stop /path/to/dbfarm/rdf sleep 10 echo "Starting MonetDB" monetdbd start /path/to/dbfarm/rdf monetdb start lubm sleep 2 fi done My queries were simple 1 join on a pair of vertically partitioned tables -- select count(*) from lubm100u_p1 as t1, lubm100u_p1 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p5 as t2 where t1.obj=t2.sub; ................... etc AND select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.sub=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.sub=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.sub=t2.sub; ................. etc AND select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.obj=t2.obj; select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.obj=t2.obj; select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.obj=t2.obj; .................. etc. I noticed that for very large datasets (~800 million rows in original sub, pred, obj table), the simple merge-joins on vertically partitioned tables kept on crashing for some long running queries. And at the time of crash sometimes the disk got full because datasize increased several times more than original one. I thought that due to crash and backup/recovery this may have happened, but then when I tried running queries on really tiny RDF dataset (~13 million rows), the server didn't crash but the size of db farm increased as much as 25 times! On Wed, Nov 14, 2012 at 8:43 AM, Lefteris wrote:

...

Hi Medha,

can you tell us a bit more how you load the data, which SQL statements you used and how you did the vertical partitioning?

What you are describing should not be happening.

thank you

On Wed, Nov 14, 2012 at 2:25 PM, Medha Atre wrote:

...
Just wondering if anyone could throw a light on this issue?

---------- Forwarded message ---------- From: Medha Atre Date: Mon, Nov 12, 2012 at 5:11 PM Subject: growing size of database To: Communication channel for MonetDB users

Hi,

I am running the latest MonetDB 11.13.3. I loaded a pretty tiny RDF dataset of 13 million triples and created some "vertical partitions" on predicate values. After having done dataset loading, I checked the size of the dataset folder and it showed to be _258 MB_. Then I ran about 6000+ bulk queries by periodically shutting down the server after about 1000 queries each time. The queries finished successfully and there was no crash. After the final query I stopped the database "monetdb stop dbname" and shutdown the main server too "monetdbd stop /path/to/dbfarm".

After that I again checked the size of the database folder and then it showed it to be _5.8GB_ and the "HC" folder under "bat" was shown to have size of about 3 GB with some other files having sizes between 300-700 MBs.

It seems mysterious to me why would the size of database go on increasing as much as 25 times the original size.

I intend to load much larger data (> 1 billion triples), and want to avoid growing disk size of the dataset due to limited disk-space.

Can someone please enlighten?

Medha _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Stefan Manegold

3:55 p.m.

Dear Medha, dear all, just a few ideas and questions that jump to my mind: - What OS are you running on? - How did you measure the storage size, i.e., did you measure actual used disk space or allocated file size (the latter possibly including "unused holes")? - You're talking about "merge-joins"; did you verify that MonetDB indeed uses merge-joins? Prerequisite would be that the data is sorted and MonetDB "knows" about that ... - During query processing, dbfarm size can (will) grow due to intermediate results; given that we (MonetDB) use memory-mapped files for virtual memory, it depends on the intermediate result sizes and you manchines physical memory size, and your OS, whether or not the data actually gets written to disk or not ... - Do I understand you loop correctly, that you bulk-load your data 6000 times??? - "HC" stands for "heap cache"; we might want to investigate, why it grows so large. - We might want to double-check that we are not leaking temporary BAT --- so I have no other indication that we do ... Best, Stefan ----- Original Message -----

...

For bulk loading the RDF data in triples form (triples represented simply as integers 1:2:3 on each line of datafile to load, where 1 is an ID to which subject value is mapped, 2 to which predicate string is mapped, and 3 is what object is mapped to), I use commands and SQL statements like --

$ monetdbd create /path/to/dbfarm/rdf $ monetdbd start /path/to/dbfarm $ monetdb create lubm $ monetdb release lubm $ mclient -d lubm -lsql < /path/to/bulk_load_file

bulk_load_file contains -- -------------------------- create table lubm100u(sub int, pred int, obj int); copy into lubm100u from '/path/to/datafile' using delimiters ':','\n';

create table lubm100u_p1(sub int, obj int); create table lubm100u_p2(sub int, obj int);

insert into lubm100u_p1 select sub, obj from lubm100u where pred=1 order by sub, obj; insert into lubm100u_p2 select sub, obj from lubm100u where pred=2 order by sub, obj; ------------------- Then I run bulk queries like --

for (( i=1; i <=6000; i++ )); do echo $i sed -n "${i}p" /path/to/bulk_q_file | mclient -d lubm -lsql | grep -P "^\|\s+\d+\s+\|$" >> results_file if (( i % 1000 == 0 )); then echo "Shutting down MonetDB" monetdb stop lubm monetdbd stop /path/to/dbfarm/rdf sleep 10 echo "Starting MonetDB" monetdbd start /path/to/dbfarm/rdf monetdb start lubm sleep 2 fi done

My queries were simple 1 join on a pair of vertically partitioned tables --

select count(*) from lubm100u_p1 as t1, lubm100u_p1 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p5 as t2 where t1.obj=t2.sub; ................... etc AND

select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.sub=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.sub=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.sub=t2.sub; ................. etc

AND

select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.obj=t2.obj; select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.obj=t2.obj; select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.obj=t2.obj; .................. etc.

I noticed that for very large datasets (~800 million rows in original sub, pred, obj table), the simple merge-joins on vertically partitioned tables kept on crashing for some long running queries. And at the time of crash sometimes the disk got full because datasize increased several times more than original one. I thought that due to crash and backup/recovery this may have happened, but then when I tried running queries on really tiny RDF dataset (~13 million rows), the server didn't crash but the size of db farm increased as much as 25 times!

On Wed, Nov 14, 2012 at 8:43 AM, Lefteris wrote:

...
Hi Medha,

can you tell us a bit more how you load the data, which SQL statements you used and how you did the vertical partitioning?

What you are describing should not be happening.

thank you

On Wed, Nov 14, 2012 at 2:25 PM, Medha Atre wrote:

...
Just wondering if anyone could throw a light on this issue?

---------- Forwarded message ---------- From: Medha Atre Date: Mon, Nov 12, 2012 at 5:11 PM Subject: growing size of database To: Communication channel for MonetDB users

Hi,

I am running the latest MonetDB 11.13.3. I loaded a pretty tiny RDF dataset of 13 million triples and created some "vertical partitions" on predicate values. After having done dataset loading, I checked the size of the dataset folder and it showed to be _258 MB_. Then I ran about 6000+ bulk queries by periodically shutting down the server after about 1000 queries each time. The queries finished successfully and there was no crash. After the final query I stopped the database "monetdb stop dbname" and shutdown the main server too "monetdbd stop /path/to/dbfarm".

After that I again checked the size of the database folder and then it showed it to be _5.8GB_ and the "HC" folder under "bat" was shown to have size of about 3 GB with some other files having sizes between 300-700 MBs.

It seems mysterious to me why would the size of database go on increasing as much as 25 times the original size.

I intend to load much larger data (> 1 billion triples), and want to avoid growing disk size of the dataset due to limited disk-space.

Can someone please enlighten?

Medha _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Medha Atre

4:04 p.m.

On Wed, Nov 14, 2012 at 10:55 AM, Stefan Manegold wrote:

...

Dear Medha, dear all,

just a few ideas and questions that jump to my mind:

- What OS are you running on?

I am running Ubuntu 12.04 distro with 3.2.0-31-generic #50-Ubuntu SMP kernel on x86_64 arch. I don't see any other disk related problems on my machine otherwise.

...

- How did you measure the storage size, i.e., did you measure actual used disk space or allocated file size (the latter possibly including "unused holes")?

I went in the folder named "lubm" under dbfarm path and did a "du -sh" when the server was shutdown completely (no mserver/merovingian or any MonetDB related service found running on the machine through "ps -aef" command).

...

- You're talking about "merge-joins"; did you verify that MonetDB indeed uses merge-joins? Prerequisite would be that the data is sorted and MonetDB "knows" about that ...

That was just my assumption based on previously published papers.. but yes, I do not know of internal details of MonetDB and nor have I touched MonetDB code while running my data/queries on it.

...

- During query processing, dbfarm size can (will) grow due to intermediate results; given that we (MonetDB) use memory-mapped files for virtual memory, it depends on the intermediate result sizes and you manchines physical memory size, and your OS, whether or not the data actually gets written to disk or not ...

Yes, that is what even I understood, but which is why I mentioned that the size I measured was after the server was shutdown completely, and I presumed that like any other server, MonetDB would not maintain cache or intermediate results "across" multiple reboots of the server (however, my knowledge of internals of large-scale servers is limited).

...

- Do I understand you loop correctly, that you bulk-load your data 6000 times???

No. Data is loaded only once, and 6000 distinct queries are run on it one after another, with shutting down MonetDB every 1000 queries (to relieve memory leaks, if there are any).

...

- "HC" stands for "heap cache"; we might want to investigate, why it grows so large.

- We might want to double-check that we are not leaking temporary BAT --- so I have no other indication that we do ...

Best, Stefan

----- Original Message -----

...
For bulk loading the RDF data in triples form (triples represented simply as integers 1:2:3 on each line of datafile to load, where 1 is an ID to which subject value is mapped, 2 to which predicate string is mapped, and 3 is what object is mapped to), I use commands and SQL statements like --

$ monetdbd create /path/to/dbfarm/rdf $ monetdbd start /path/to/dbfarm $ monetdb create lubm $ monetdb release lubm $ mclient -d lubm -lsql < /path/to/bulk_load_file

bulk_load_file contains -- -------------------------- create table lubm100u(sub int, pred int, obj int); copy into lubm100u from '/path/to/datafile' using delimiters ':','\n';

create table lubm100u_p1(sub int, obj int); create table lubm100u_p2(sub int, obj int);

insert into lubm100u_p1 select sub, obj from lubm100u where pred=1 order by sub, obj; insert into lubm100u_p2 select sub, obj from lubm100u where pred=2 order by sub, obj; ------------------- Then I run bulk queries like --

for (( i=1; i <=6000; i++ )); do echo $i sed -n "${i}p" /path/to/bulk_q_file | mclient -d lubm -lsql | grep -P "^\|\s+\d+\s+\|$" >> results_file if (( i % 1000 == 0 )); then echo "Shutting down MonetDB" monetdb stop lubm monetdbd stop /path/to/dbfarm/rdf sleep 10 echo "Starting MonetDB" monetdbd start /path/to/dbfarm/rdf monetdb start lubm sleep 2 fi done

My queries were simple 1 join on a pair of vertically partitioned tables --

select count(*) from lubm100u_p1 as t1, lubm100u_p1 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.obj=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p5 as t2 where t1.obj=t2.sub; ................... etc AND

select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.sub=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.sub=t2.sub; select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.sub=t2.sub; ................. etc

AND

select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.obj=t2.obj; select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.obj=t2.obj; select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.obj=t2.obj; .................. etc.

I noticed that for very large datasets (~800 million rows in original sub, pred, obj table), the simple merge-joins on vertically partitioned tables kept on crashing for some long running queries. And at the time of crash sometimes the disk got full because datasize increased several times more than original one. I thought that due to crash and backup/recovery this may have happened, but then when I tried running queries on really tiny RDF dataset (~13 million rows), the server didn't crash but the size of db farm increased as much as 25 times!

On Wed, Nov 14, 2012 at 8:43 AM, Lefteris wrote:

...
Hi Medha,

can you tell us a bit more how you load the data, which SQL statements you used and how you did the vertical partitioning?

What you are describing should not be happening.

thank you

On Wed, Nov 14, 2012 at 2:25 PM, Medha Atre wrote:

...
Just wondering if anyone could throw a light on this issue?

---------- Forwarded message ---------- From: Medha Atre Date: Mon, Nov 12, 2012 at 5:11 PM Subject: growing size of database To: Communication channel for MonetDB users

Hi,

I am running the latest MonetDB 11.13.3. I loaded a pretty tiny RDF dataset of 13 million triples and created some "vertical partitions" on predicate values. After having done dataset loading, I checked the size of the dataset folder and it showed to be _258 MB_. Then I ran about 6000+ bulk queries by periodically shutting down the server after about 1000 queries each time. The queries finished successfully and there was no crash. After the final query I stopped the database "monetdb stop dbname" and shutdown the main server too "monetdbd stop /path/to/dbfarm".

After that I again checked the size of the database folder and then it showed it to be _5.8GB_ and the "HC" folder under "bat" was shown to have size of about 3 GB with some other files having sizes between 300-700 MBs.

It seems mysterious to me why would the size of database go on increasing as much as 25 times the original size.

I intend to load much larger data (> 1 billion triples), and want to avoid growing disk size of the dataset due to limited disk-space.

Can someone please enlighten?

Medha _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Stefan Manegold

4:18 p.m.

On Wed, Nov 14, 2012 at 11:04:06AM -0500, Medha Atre wrote:

...

On Wed, Nov 14, 2012 at 10:55 AM, Stefan Manegold wrote: [...]

...
- You're talking about "merge-joins"; did you verify that MonetDB indeed uses merge-joins? Prerequisite would be that the data is sorted and MonetDB "knows" about that ...

That was just my assumption based on previously published papers.. but yes, I do not know of internal details of MonetDB and nor have I touched MonetDB code while running my data/queries on it.

In case your join attributes are both sorted, MonetDB will use merge-join; otherwise, it will use hash-join, unless you join columns exceed physical memory size.

...

...
- During query processing, dbfarm size can (will) grow due to intermediate results; given that we (MonetDB) use memory-mapped files for virtual memory, it depends on the intermediate result sizes and you manchines physical memory size, and your OS, whether or not the data actually gets written to disk or not ...

Yes, that is what even I understood, but which is why I mentioned that the size I measured was after the server was shutdown completely, and I presumed that like any other server, MonetDB would not maintain cache or intermediate results "across" multiple reboots of the server (however, my knowledge of internals of large-scale servers is limited).

Clean-up of "left-overs" is done at server (re-)start, not at shutdown. Of course, no longer used intermediates are clean-up instantly (while server is running), unless deemed useful for reuse via the heap-cache.

...

...
- Do I understand you loop correctly, that you bulk-load your data 6000 times???

No. Data is loaded only once, and 6000 distinct queries are run on it one after another, with shutting down MonetDB every 1000 queries (to relieve memory leaks, if there are any).

Thanks --- I mis-read "bulk_q_file" for "bulk_load_file" ... Stefan [...] -- | Stefan.Manegold @ CWI.nl | DB Architectures (INS1) | | http://CWI.nl/~manegold/ | Science Park 123 (L321) | | Tel.: +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

4498

Age (days ago)

4500

Last active (days ago)

List overview

Download

6 comments

3 participants

participants (3)

Lefteris
Medha Atre
Stefan Manegold