High memory usage while loading data with many columns

Andrew Duffey

4 Feb 2013 4 Feb '13

5:59 p.m.

Hi, I'm loading data into a table with >20,000 columns and several hundred thousand rows using "COPY INTO". It works, but I can't seem to make it take less than about 30GB of memory. I've tried specifying the number of records as well. Is there any way to reduce memory usage when loading? Is this much memory usage expected? It's a shame because our queries take very little memory. We have to size our machine simply to handle the load. It's impressive (and very convenient) that I can actually have a table with so many columns, but are there any other limitations I might come across? I've just been trying to keep the column count in the queries low. A "select *" seems to crash MonetDB. Thanks, Andrew

Attachments:

attachment.html (text/html — 1.1 KB)

Show replies by date

Stefan Manegold

4 Feb 4 Feb

6:26 p.m.

Hi Andrew, There is no option to reduce the (virtual) memory usage of MonetDB during bulk load. MonetDB needs to have all columns active the its address space concurrently during bulkload, and hence need as much address sapce as required to hold the entire table. See also http://www.monetdb.org/Documentation/UserGuide/resources However, address space does not necessarily mean physical memory. We easily load 100 GB TPCH (including a 70+GB lineitem table) on an 8 GB machine. If you're loading in isolation (i.e., without concurrent queries), you might what to consider using the "LOCKED" option of copy into to avoid one extra copy; cf., http://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto . Also, in case your table has constraints like primary (or foreign, if you had more than one table) keys or "not NULL", we advise to initially create the table without all constraints, then bulkload the data via COPY INTO, and only then add the constraints one by one using ALTER TABLE ... ADD CONSTRAINT. We'd be curios to learn more about the crash with the "select *" query. Is there any error message on the mserver5 console or in the monetdbd / merovingian log file? Could you possible build a debugging version of MonetDB (configured with --disable-optimize --enable-debug --enable-assert), run it the a debugger (e.g., gdb), and check where (and why?) it crashes? Thanks! Best, Stefan ----- Original Message -----

...

Hi,

I'm loading data into a table with >20,000 columns and several hundred thousand rows using "COPY INTO". It works, but I can't seem to make it take less than about 30GB of memory. I've tried specifying the number of records as well. Is there any way to reduce memory usage when loading? Is this much memory usage expected? It's a shame because our queries take very little memory. We have to size our machine simply to handle the load.

It's impressive (and very convenient) that I can actually have a table with so many columns, but are there any other limitations I might come across? I've just been trying to keep the column count in the queries low. A "select *" seems to crash MonetDB.

Thanks, Andrew

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Andrew Duffey

7 Feb 7 Feb

7:23 p.m.

Hi, It seems to be using quite a bit of physical memory. We just tried to load on a machine with 24GB of memory and eventually the Linux OOM killer was invoked. We see resident set sizes > 20GB, and virtual memory usage in only a bit higher. Here are some more details. We have a very wide table, >20,000 columns (all bigint), that we are trying to load. We have about as many rows. Our input files are gzipped and split into 200 parts (so there are only about 100 lines per file). Here is our load script: #!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/* set -e set -u MDB_DATABASE=$1 MDB_TABLE=$2 LAST=$BASH_ARGV FILE='' while [[ "$FILE" != "$LAST" ]]; do shift || break FILE=$2 NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" COUNT=$(zcat $FILE | wc -l) STMT="COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" echo $STMT | mclient $MDB_DATABASE done What we see is that memory usage starts off low, but rises as we load more files. Sometimes it decreases a bit (like something is being flushed to disk) but it subsequently rises higher than before, until we run out of memory. We got through about 70 files on our 24GB machine. Previously, we've loaded similar data on a 34GB machine, 100 files at a time, restarting in between. We're planning on trying the following: 1. Load one large file. 2. Load a few files, stop and start the db and dbfarm, load a few more, etc. Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Monday, February 04, 2013 1:26 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns Hi Andrew, There is no option to reduce the (virtual) memory usage of MonetDB during bulk load. MonetDB needs to have all columns active the its address space concurrently during bulkload, and hence need as much address sapce as required to hold the entire table. See also http://www.monetdb.org/Documentation/UserGuide/resources However, address space does not necessarily mean physical memory. We easily load 100 GB TPCH (including a 70+GB lineitem table) on an 8 GB machine. If you're loading in isolation (i.e., without concurrent queries), you might what to consider using the "LOCKED" option of copy into to avoid one extra copy; cf., http://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto . Also, in case your table has constraints like primary (or foreign, if you had more than one table) keys or "not NULL", we advise to initially create the table without all constraints, then bulkload the data via COPY INTO, and only then add the constraints one by one using ALTER TABLE ... ADD CONSTRAINT. We'd be curios to learn more about the crash with the "select *" query. Is there any error message on the mserver5 console or in the monetdbd / merovingian log file? Could you possible build a debugging version of MonetDB (configured with --disable-optimize --enable-debug --enable-assert), run it the a debugger (e.g., gdb), and check where (and why?) it crashes? Thanks! Best, Stefan ----- Original Message -----

...

Hi,

I'm loading data into a table with >20,000 columns and several hundred thousand rows using "COPY INTO". It works, but I can't seem to make it take less than about 30GB of memory. I've tried specifying the number of records as well. Is there any way to reduce memory usage when loading? Is this much memory usage expected? It's a shame because our queries take very little memory. We have to size our machine simply to handle the load.

It's impressive (and very convenient) that I can actually have a table with so many columns, but are there any other limitations I might come across? I've just been trying to keep the column count in the queries low. A "select *" seems to crash MonetDB.

Thanks, Andrew

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Stefan Manegold

8:11 p.m.

Hi Andrew, you might was to consider disabling the system's OOM killer on dedicated database servers. MonetDB owes most of its speed advantages to aggressively using all available physical memory. Also, you might want to give the total number of records to be loaded with the first copy into statement to avoid the need for successive storage extensions, e.g., as follows: #!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/* set -e set -u MDB_DATABASE=$1 shift MDB_TABLE=$1 shift COUNT=0 for FILE in "$@" ; do NOW=$(date +"%c") echo "Counting file ${FILE} at $NOW" COUNT=$[$COUNT + $(zcat "$FILE" | wc -l)] done FILE="$1" NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" mclient -s "COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" shift for FILE in "$@" ; do NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" >&2 mclient -s "COPY INTO $MDB_TABLE FROM '$FILE';" done Loading one large file basically does the same trick ... Best, Stefan ----- Original Message -----

...

Hi,

It seems to be using quite a bit of physical memory. We just tried to load on a machine with 24GB of memory and eventually the Linux OOM killer was invoked. We see resident set sizes > 20GB, and virtual memory usage in only a bit higher.

Here are some more details. We have a very wide table, >20,000 columns (all bigint), that we are trying to load. We have about as many rows. Our input files are gzipped and split into 200 parts (so there are only about 100 lines per file). Here is our load script:

#!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/*

set -e set -u

MDB_DATABASE=$1 MDB_TABLE=$2 LAST=$BASH_ARGV

FILE='' while [[ "$FILE" != "$LAST" ]]; do shift || break FILE=$2 NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" COUNT=$(zcat $FILE | wc -l) STMT="COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" echo $STMT | mclient $MDB_DATABASE done

What we see is that memory usage starts off low, but rises as we load more files. Sometimes it decreases a bit (like something is being flushed to disk) but it subsequently rises higher than before, until we run out of memory. We got through about 70 files on our 24GB machine. Previously, we've loaded similar data on a 34GB machine, 100 files at a time, restarting in between.

We're planning on trying the following:

1. Load one large file. 2. Load a few files, stop and start the db and dbfarm, load a few more, etc.

Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Monday, February 04, 2013 1:26 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns

Hi Andrew,

There is no option to reduce the (virtual) memory usage of MonetDB during bulk load. MonetDB needs to have all columns active the its address space concurrently during bulkload, and hence need as much address sapce as required to hold the entire table. See also http://www.monetdb.org/Documentation/UserGuide/resources

However, address space does not necessarily mean physical memory. We easily load 100 GB TPCH (including a 70+GB lineitem table) on an 8 GB machine.

If you're loading in isolation (i.e., without concurrent queries), you might what to consider using the "LOCKED" option of copy into to avoid one extra copy; cf., http://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto .

Also, in case your table has constraints like primary (or foreign, if you had more than one table) keys or "not NULL", we advise to initially create the table without all constraints, then bulkload the data via COPY INTO, and only then add the constraints one by one using ALTER TABLE ... ADD CONSTRAINT.

We'd be curios to learn more about the crash with the "select *" query. Is there any error message on the mserver5 console or in the monetdbd / merovingian log file? Could you possible build a debugging version of MonetDB (configured with --disable-optimize --enable-debug --enable-assert), run it the a debugger (e.g., gdb), and check where (and why?) it crashes?

Thanks!

Best, Stefan

----- Original Message -----

...
Hi,

I'm loading data into a table with >20,000 columns and several hundred thousand rows using "COPY INTO". It works, but I can't seem to make it take less than about 30GB of memory. I've tried specifying the number of records as well. Is there any way to reduce memory usage when loading? Is this much memory usage expected? It's a shame because our queries take very little memory. We have to size our machine simply to handle the load.

It's impressive (and very convenient) that I can actually have a table with so many columns, but are there any other limitations I might come across? I've just been trying to keep the column count in the queries low. A "select *" seems to crash MonetDB.

Thanks, Andrew

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Andrew Duffey

10:01 p.m.

Hi, I tried loading one large file (about 60% of our data), uncompressed, given the row count, and with "LOCKED". I got the same problem. On systems with swap, we see it start to use up the swap space as well until the system is unresponsive. We can try to turn off the OOM killer, but that concerns me a bit. There are only a few other processes running on this server for administration, but I'm not sure if they will be able to deal with failed memory allocations if MonetDB uses up all available physical memory. Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Thursday, February 07, 2013 3:11 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns Hi Andrew, you might was to consider disabling the system's OOM killer on dedicated database servers. MonetDB owes most of its speed advantages to aggressively using all available physical memory. Also, you might want to give the total number of records to be loaded with the first copy into statement to avoid the need for successive storage extensions, e.g., as follows: #!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/* set -e set -u MDB_DATABASE=$1 shift MDB_TABLE=$1 shift COUNT=0 for FILE in "$@" ; do NOW=$(date +"%c") echo "Counting file ${FILE} at $NOW" COUNT=$[$COUNT + $(zcat "$FILE" | wc -l)] done FILE="$1" NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" mclient -s "COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" shift for FILE in "$@" ; do NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" >&2 mclient -s "COPY INTO $MDB_TABLE FROM '$FILE';" done Loading one large file basically does the same trick ... Best, Stefan ----- Original Message -----

...

Hi,

It seems to be using quite a bit of physical memory. We just tried to load on a machine with 24GB of memory and eventually the Linux OOM killer was invoked. We see resident set sizes > 20GB, and virtual memory usage in only a bit higher.

Here are some more details. We have a very wide table, >20,000 columns (all bigint), that we are trying to load. We have about as many rows. Our input files are gzipped and split into 200 parts (so there are only about 100 lines per file). Here is our load script:

#!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/*

set -e set -u

MDB_DATABASE=$1 MDB_TABLE=$2 LAST=$BASH_ARGV

FILE='' while [[ "$FILE" != "$LAST" ]]; do shift || break FILE=$2 NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" COUNT=$(zcat $FILE | wc -l) STMT="COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" echo $STMT | mclient $MDB_DATABASE done

What we see is that memory usage starts off low, but rises as we load more files. Sometimes it decreases a bit (like something is being flushed to disk) but it subsequently rises higher than before, until we run out of memory. We got through about 70 files on our 24GB machine. Previously, we've loaded similar data on a 34GB machine, 100 files at a time, restarting in between.

We're planning on trying the following:

1. Load one large file. 2. Load a few files, stop and start the db and dbfarm, load a few more, etc.

Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Monday, February 04, 2013 1:26 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns

Hi Andrew,

There is no option to reduce the (virtual) memory usage of MonetDB during bulk load. MonetDB needs to have all columns active the its address space concurrently during bulkload, and hence need as much address sapce as required to hold the entire table. See also http://www.monetdb.org/Documentation/UserGuide/resources

However, address space does not necessarily mean physical memory. We easily load 100 GB TPCH (including a 70+GB lineitem table) on an 8 GB machine.

If you're loading in isolation (i.e., without concurrent queries), you might what to consider using the "LOCKED" option of copy into to avoid one extra copy; cf., http://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto .

Also, in case your table has constraints like primary (or foreign, if you had more than one table) keys or "not NULL", we advise to initially create the table without all constraints, then bulkload the data via COPY INTO, and only then add the constraints one by one using ALTER TABLE ... ADD CONSTRAINT.

We'd be curios to learn more about the crash with the "select *" query. Is there any error message on the mserver5 console or in the monetdbd / merovingian log file? Could you possible build a debugging version of MonetDB (configured with --disable-optimize --enable-debug --enable-assert), run it the a debugger (e.g., gdb), and check where (and why?) it crashes?

Thanks!

Best, Stefan

----- Original Message -----

...
Hi,

I'm loading data into a table with >20,000 columns and several hundred thousand rows using "COPY INTO". It works, but I can't seem to make it take less than about 30GB of memory. I've tried specifying the number of records as well. Is there any way to reduce memory usage when loading? Is this much memory usage expected? It's a shame because our queries take very little memory. We have to size our machine simply to handle the load.

It's impressive (and very convenient) that I can actually have a table with so many columns, but are there any other limitations I might come across? I've just been trying to keep the column count in the queries low. A "select *" seems to crash MonetDB.

Thanks, Andrew

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Stefan Manegold

10:15 p.m.

Hi, Which version of MonetDB or you using? (Maybe you can share the output of `mserver5 --version`?) Which flavour of Linux are you running on? Do I understand correctly, that your data consists of about 20000 x 20000 BIGINT values, i.e., should not take more than about 3.2 GB? Stefan ----- Original Message -----

...

Hi,

I tried loading one large file (about 60% of our data), uncompressed, given the row count, and with "LOCKED". I got the same problem. On systems with swap, we see it start to use up the swap space as well until the system is unresponsive.

We can try to turn off the OOM killer, but that concerns me a bit. There are only a few other processes running on this server for administration, but I'm not sure if they will be able to deal with failed memory allocations if MonetDB uses up all available physical memory.

Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Thursday, February 07, 2013 3:11 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns

Hi Andrew,

you might was to consider disabling the system's OOM killer on dedicated database servers. MonetDB owes most of its speed advantages to aggressively using all available physical memory.

Also, you might want to give the total number of records to be loaded with the first copy into statement to avoid the need for successive storage extensions, e.g., as follows:

#!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/*

set -e set -u

MDB_DATABASE=$1 shift MDB_TABLE=$1 shift

COUNT=0 for FILE in "$@" ; do NOW=$(date +"%c") echo "Counting file ${FILE} at $NOW" COUNT=$[$COUNT + $(zcat "$FILE" | wc -l)] done

FILE="$1" NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" mclient -s "COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" shift

for FILE in "$@" ; do NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" >&2 mclient -s "COPY INTO $MDB_TABLE FROM '$FILE';" done

Loading one large file basically does the same trick ...

Best, Stefan

----- Original Message -----

...
Hi,

It seems to be using quite a bit of physical memory. We just tried to load on a machine with 24GB of memory and eventually the Linux OOM killer was invoked. We see resident set sizes > 20GB, and virtual memory usage in only a bit higher.

Here are some more details. We have a very wide table, >20,000 columns (all bigint), that we are trying to load. We have about as many rows. Our input files are gzipped and split into 200 parts (so there are only about 100 lines per file). Here is our load script:

#!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/*

set -e set -u

MDB_DATABASE=$1 MDB_TABLE=$2 LAST=$BASH_ARGV

FILE='' while [[ "$FILE" != "$LAST" ]]; do shift || break FILE=$2 NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" COUNT=$(zcat $FILE | wc -l) STMT="COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" echo $STMT | mclient $MDB_DATABASE done

What we see is that memory usage starts off low, but rises as we load more files. Sometimes it decreases a bit (like something is being flushed to disk) but it subsequently rises higher than before, until we run out of memory. We got through about 70 files on our 24GB machine. Previously, we've loaded similar data on a 34GB machine, 100 files at a time, restarting in between.

We're planning on trying the following:

1. Load one large file. 2. Load a few files, stop and start the db and dbfarm, load a few more, etc.

Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Monday, February 04, 2013 1:26 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns

Hi Andrew,

There is no option to reduce the (virtual) memory usage of MonetDB during bulk load. MonetDB needs to have all columns active the its address space concurrently during bulkload, and hence need as much address sapce as required to hold the entire table. See also http://www.monetdb.org/Documentation/UserGuide/resources

However, address space does not necessarily mean physical memory. We easily load 100 GB TPCH (including a 70+GB lineitem table) on an 8 GB machine.

If you're loading in isolation (i.e., without concurrent queries), you might what to consider using the "LOCKED" option of copy into to avoid one extra copy; cf., http://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto .

Also, in case your table has constraints like primary (or foreign, if you had more than one table) keys or "not NULL", we advise to initially create the table without all constraints, then bulkload the data via COPY INTO, and only then add the constraints one by one using ALTER TABLE ... ADD CONSTRAINT.

We'd be curios to learn more about the crash with the "select *" query. Is there any error message on the mserver5 console or in the monetdbd / merovingian log file? Could you possible build a debugging version of MonetDB (configured with --disable-optimize --enable-debug --enable-assert), run it the a debugger (e.g., gdb), and check where (and why?) it crashes?

Thanks!

Best, Stefan

----- Original Message -----

...
Hi,

I'm loading data into a table with >20,000 columns and several hundred thousand rows using "COPY INTO". It works, but I can't seem to make it take less than about 30GB of memory. I've tried specifying the number of records as well. Is there any way to reduce memory usage when loading? Is this much memory usage expected? It's a shame because our queries take very little memory. We have to size our machine simply to handle the load.

It's impressive (and very convenient) that I can actually have a table with so many columns, but are there any other limitations I might come across? I've just been trying to keep the column count in the queries low. A "select *" seems to crash MonetDB.

Thanks, Andrew

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Stefan Manegold

10:17 p.m.

How large are you compressed input files? Could you share that data with us such that we coudl try to reproduce the problem? Stefan ----- Original Message -----

...

Hi,

Which version of MonetDB or you using? (Maybe you can share the output of `mserver5 --version`?)

Which flavour of Linux are you running on?

Do I understand correctly, that your data consists of about 20000 x 20000 BIGINT values, i.e., should not take more than about 3.2 GB?

Stefan

----- Original Message -----

...
Hi,

I tried loading one large file (about 60% of our data), uncompressed, given the row count, and with "LOCKED". I got the same problem. On systems with swap, we see it start to use up the swap space as well until the system is unresponsive.

We can try to turn off the OOM killer, but that concerns me a bit. There are only a few other processes running on this server for administration, but I'm not sure if they will be able to deal with failed memory allocations if MonetDB uses up all available physical memory.

Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Thursday, February 07, 2013 3:11 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns

Hi Andrew,

you might was to consider disabling the system's OOM killer on dedicated database servers. MonetDB owes most of its speed advantages to aggressively using all available physical memory.

Also, you might want to give the total number of records to be loaded with the first copy into statement to avoid the need for successive storage extensions, e.g., as follows:

#!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/*

set -e set -u

MDB_DATABASE=$1 shift MDB_TABLE=$1 shift

COUNT=0 for FILE in "$@" ; do NOW=$(date +"%c") echo "Counting file ${FILE} at $NOW" COUNT=$[$COUNT + $(zcat "$FILE" | wc -l)] done

FILE="$1" NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" mclient -s "COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" shift

for FILE in "$@" ; do NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" >&2 mclient -s "COPY INTO $MDB_TABLE FROM '$FILE';" done

Loading one large file basically does the same trick ...

Best, Stefan

----- Original Message -----

...
Hi,

It seems to be using quite a bit of physical memory. We just tried to load on a machine with 24GB of memory and eventually the Linux OOM killer was invoked. We see resident set sizes > 20GB, and virtual memory usage in only a bit higher.

Here are some more details. We have a very wide table, >20,000 columns (all bigint), that we are trying to load. We have about as many rows. Our input files are gzipped and split into 200 parts (so there are only about 100 lines per file). Here is our load script:

#!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/*

set -e set -u

MDB_DATABASE=$1 MDB_TABLE=$2 LAST=$BASH_ARGV

FILE='' while [[ "$FILE" != "$LAST" ]]; do shift || break FILE=$2 NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" COUNT=$(zcat $FILE | wc -l) STMT="COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" echo $STMT | mclient $MDB_DATABASE done

What we see is that memory usage starts off low, but rises as we load more files. Sometimes it decreases a bit (like something is being flushed to disk) but it subsequently rises higher than before, until we run out of memory. We got through about 70 files on our 24GB machine. Previously, we've loaded similar data on a 34GB machine, 100 files at a time, restarting in between.

We're planning on trying the following:

1. Load one large file. 2. Load a few files, stop and start the db and dbfarm, load a few more, etc.

Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Monday, February 04, 2013 1:26 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns

Hi Andrew,

There is no option to reduce the (virtual) memory usage of MonetDB during bulk load. MonetDB needs to have all columns active the its address space concurrently during bulkload, and hence need as much address sapce as required to hold the entire table. See also http://www.monetdb.org/Documentation/UserGuide/resources

However, address space does not necessarily mean physical memory. We easily load 100 GB TPCH (including a 70+GB lineitem table) on an 8 GB machine.

If you're loading in isolation (i.e., without concurrent queries), you might what to consider using the "LOCKED" option of copy into to avoid one extra copy; cf., http://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto .

Also, in case your table has constraints like primary (or foreign, if you had more than one table) keys or "not NULL", we advise to initially create the table without all constraints, then bulkload the data via COPY INTO, and only then add the constraints one by one using ALTER TABLE ... ADD CONSTRAINT.

We'd be curios to learn more about the crash with the "select *" query. Is there any error message on the mserver5 console or in the monetdbd / merovingian log file? Could you possible build a debugging version of MonetDB (configured with --disable-optimize --enable-debug --enable-assert), run it the a debugger (e.g., gdb), and check where (and why?) it crashes?

Thanks!

Best, Stefan

----- Original Message -----

...
Hi,

I'm loading data into a table with >20,000 columns and several hundred thousand rows using "COPY INTO". It works, but I can't seem to make it take less than about 30GB of memory. I've tried specifying the number of records as well. Is there any way to reduce memory usage when loading? Is this much memory usage expected? It's a shame because our queries take very little memory. We have to size our machine simply to handle the load.

It's impressive (and very convenient) that I can actually have a table with so many columns, but are there any other limitations I might come across? I've just been trying to keep the column count in the queries low. A "select *" seems to crash MonetDB.

Thanks, Andrew

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Andrew Duffey

10:56 p.m.

Hi, I'm sorry, my count was off by an order of magnitude. We have about 200,000 rows. So that's ~32GB if everything is packed perfectly. The compressed input files are about 600MB. We have a lot of zeros. We've tried two versions: $ monetdb --version MonetDB Database Server Toolkit v1.0 (Oct2012-SP3) $ monetdb --version MonetDB Database Server Toolkit v1.0 (Oct2012-SP2) Both were actually built from source (we have some UDFs). I checked out the repo and updated to the tags for those releases before adding our UDF code and building MonetDB. I can't share the actual data, but I may be able to reproduce this with a synthetic data set. I probably won't have time to do so until at least tomorrow. Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Thursday, February 07, 2013 5:17 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns How large are you compressed input files? Could you share that data with us such that we coudl try to reproduce the problem? Stefan ----- Original Message -----

...

Hi,

Which version of MonetDB or you using? (Maybe you can share the output of `mserver5 --version`?)

Which flavour of Linux are you running on?

Do I understand correctly, that your data consists of about 20000 x 20000 BIGINT values, i.e., should not take more than about 3.2 GB?

Stefan

----- Original Message -----

...
Hi,

I tried loading one large file (about 60% of our data), uncompressed, given the row count, and with "LOCKED". I got the same problem. On systems with swap, we see it start to use up the swap space as well until the system is unresponsive.

We can try to turn off the OOM killer, but that concerns me a bit. There are only a few other processes running on this server for administration, but I'm not sure if they will be able to deal with failed memory allocations if MonetDB uses up all available physical memory.

Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Thursday, February 07, 2013 3:11 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns

Hi Andrew,

you might was to consider disabling the system's OOM killer on dedicated database servers. MonetDB owes most of its speed advantages to aggressively using all available physical memory.

Also, you might want to give the total number of records to be loaded with the first copy into statement to avoid the need for successive storage extensions, e.g., as follows:

#!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/*

set -e set -u

MDB_DATABASE=$1 shift MDB_TABLE=$1 shift

COUNT=0 for FILE in "$@" ; do NOW=$(date +"%c") echo "Counting file ${FILE} at $NOW" COUNT=$[$COUNT + $(zcat "$FILE" | wc -l)] done

FILE="$1" NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" mclient -s "COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" shift

for FILE in "$@" ; do NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" >&2 mclient -s "COPY INTO $MDB_TABLE FROM '$FILE';" done

Loading one large file basically does the same trick ...

Best, Stefan

----- Original Message -----

...
Hi,

It seems to be using quite a bit of physical memory. We just tried to load on a machine with 24GB of memory and eventually the Linux OOM killer was invoked. We see resident set sizes > 20GB, and virtual memory usage in only a bit higher.

Here are some more details. We have a very wide table, >20,000 columns (all bigint), that we are trying to load. We have about as many rows. Our input files are gzipped and split into 200 parts (so there are only about 100 lines per file). Here is our load script:

#!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/*

set -e set -u

MDB_DATABASE=$1 MDB_TABLE=$2 LAST=$BASH_ARGV

FILE='' while [[ "$FILE" != "$LAST" ]]; do shift || break FILE=$2 NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" COUNT=$(zcat $FILE | wc -l) STMT="COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" echo $STMT | mclient $MDB_DATABASE done

What we see is that memory usage starts off low, but rises as we load more files. Sometimes it decreases a bit (like something is being flushed to disk) but it subsequently rises higher than before, until we run out of memory. We got through about 70 files on our 24GB machine. Previously, we've loaded similar data on a 34GB machine, 100 files at a time, restarting in between.

We're planning on trying the following:

1. Load one large file. 2. Load a few files, stop and start the db and dbfarm, load a few more, etc.

Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Monday, February 04, 2013 1:26 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns

Hi Andrew,

There is no option to reduce the (virtual) memory usage of MonetDB during bulk load. MonetDB needs to have all columns active the its address space concurrently during bulkload, and hence need as much address sapce as required to hold the entire table. See also http://www.monetdb.org/Documentation/UserGuide/resources

However, address space does not necessarily mean physical memory. We easily load 100 GB TPCH (including a 70+GB lineitem table) on an 8 GB machine.

If you're loading in isolation (i.e., without concurrent queries), you might what to consider using the "LOCKED" option of copy into to avoid one extra copy; cf., http://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto .

Also, in case your table has constraints like primary (or foreign, if you had more than one table) keys or "not NULL", we advise to initially create the table without all constraints, then bulkload the data via COPY INTO, and only then add the constraints one by one using ALTER TABLE ... ADD CONSTRAINT.

We'd be curios to learn more about the crash with the "select *" query. Is there any error message on the mserver5 console or in the monetdbd / merovingian log file? Could you possible build a debugging version of MonetDB (configured with --disable-optimize --enable-debug --enable-assert), run it the a debugger (e.g., gdb), and check where (and why?) it crashes?

Thanks!

Best, Stefan

----- Original Message -----

...
Hi,

I'm loading data into a table with >20,000 columns and several hundred thousand rows using "COPY INTO". It works, but I can't seem to make it take less than about 30GB of memory. I've tried specifying the number of records as well. Is there any way to reduce memory usage when loading? Is this much memory usage expected? It's a shame because our queries take very little memory. We have to size our machine simply to handle the load.

It's impressive (and very convenient) that I can actually have a table with so many columns, but are there any other limitations I might come across? I've just been trying to keep the column count in the queries low. A "select *" seems to crash MonetDB.

Thanks, Andrew

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Stefan Manegold

8 Feb 8 Feb

10:52 a.m.

FYI, here's what top says on a 4-core hyper-threading Intel i7 2600K with 16 GB RAM running Fedora 16 (AFAIKT no OOM killer active by default) while loading TPC-H SF-100: Mem: 16344600k total, 16173872k used, 170728k free, 41536k buffers Swap: 33554428k total, 82248k used, 33472180k free, 14848284k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27273 manegold 20 0 117g 9.8g 9.5g S 82.7 63.0 26:41.52 mserver5 Best, Stefan On Thu, Feb 07, 2013 at 10:56:28PM +0000, Andrew Duffey wrote:

...

Hi,

I'm sorry, my count was off by an order of magnitude. We have about 200,000 rows. So that's ~32GB if everything is packed perfectly.

The compressed input files are about 600MB. We have a lot of zeros.

We've tried two versions:

$ monetdb --version MonetDB Database Server Toolkit v1.0 (Oct2012-SP3)

$ monetdb --version MonetDB Database Server Toolkit v1.0 (Oct2012-SP2)

Both were actually built from source (we have some UDFs). I checked out the repo and updated to the tags for those releases before adding our UDF code and building MonetDB.

I can't share the actual data, but I may be able to reproduce this with a synthetic data set. I probably won't have time to do so until at least tomorrow.

Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Thursday, February 07, 2013 5:17 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns

How large are you compressed input files?

Could you share that data with us such that we coudl try to reproduce the problem?

Stefan

----- Original Message -----

...
Hi,

Which version of MonetDB or you using? (Maybe you can share the output of `mserver5 --version`?)

Which flavour of Linux are you running on?

Do I understand correctly, that your data consists of about 20000 x 20000 BIGINT values, i.e., should not take more than about 3.2 GB?

Stefan

----- Original Message -----

...
Hi,

I tried loading one large file (about 60% of our data), uncompressed, given the row count, and with "LOCKED". I got the same problem. On systems with swap, we see it start to use up the swap space as well until the system is unresponsive.

We can try to turn off the OOM killer, but that concerns me a bit. There are only a few other processes running on this server for administration, but I'm not sure if they will be able to deal with failed memory allocations if MonetDB uses up all available physical memory.

Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Thursday, February 07, 2013 3:11 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns

Hi Andrew,

you might was to consider disabling the system's OOM killer on dedicated database servers. MonetDB owes most of its speed advantages to aggressively using all available physical memory.

Also, you might want to give the total number of records to be loaded with the first copy into statement to avoid the need for successive storage extensions, e.g., as follows:

#!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/*

set -e set -u

MDB_DATABASE=$1 shift MDB_TABLE=$1 shift

COUNT=0 for FILE in "$@" ; do NOW=$(date +"%c") echo "Counting file ${FILE} at $NOW" COUNT=$[$COUNT + $(zcat "$FILE" | wc -l)] done

FILE="$1" NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" mclient -s "COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" shift

for FILE in "$@" ; do NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" >&2 mclient -s "COPY INTO $MDB_TABLE FROM '$FILE';" done

Loading one large file basically does the same trick ...

Best, Stefan

----- Original Message -----

...
Hi,

It seems to be using quite a bit of physical memory. We just tried to load on a machine with 24GB of memory and eventually the Linux OOM killer was invoked. We see resident set sizes > 20GB, and virtual memory usage in only a bit higher.

Here are some more details. We have a very wide table, >20,000 columns (all bigint), that we are trying to load. We have about as many rows. Our input files are gzipped and split into 200 parts (so there are only about 100 lines per file). Here is our load script:

#!/bin/bash # USAGE: ./load [db] [table] /full/path/to/files/*

set -e set -u

MDB_DATABASE=$1 MDB_TABLE=$2 LAST=$BASH_ARGV

FILE='' while [[ "$FILE" != "$LAST" ]]; do shift || break FILE=$2 NOW=$(date +"%c") echo "Loading file ${FILE} at $NOW" COUNT=$(zcat $FILE | wc -l) STMT="COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';" echo $STMT | mclient $MDB_DATABASE done

What we see is that memory usage starts off low, but rises as we load more files. Sometimes it decreases a bit (like something is being flushed to disk) but it subsequently rises higher than before, until we run out of memory. We got through about 70 files on our 24GB machine. Previously, we've loaded similar data on a 34GB machine, 100 files at a time, restarting in between.

We're planning on trying the following:

1. Load one large file. 2. Load a few files, stop and start the db and dbfarm, load a few more, etc.

Thanks, Andrew Duffey ________________________________________ From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl] Sent: Monday, February 04, 2013 1:26 PM To: Communication channel for MonetDB users Subject: Re: High memory usage while loading data with many columns

Hi Andrew,

There is no option to reduce the (virtual) memory usage of MonetDB during bulk load. MonetDB needs to have all columns active the its address space concurrently during bulkload, and hence need as much address sapce as required to hold the entire table. See also http://www.monetdb.org/Documentation/UserGuide/resources

However, address space does not necessarily mean physical memory. We easily load 100 GB TPCH (including a 70+GB lineitem table) on an 8 GB machine.

If you're loading in isolation (i.e., without concurrent queries), you might what to consider using the "LOCKED" option of copy into to avoid one extra copy; cf., http://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto .

Also, in case your table has constraints like primary (or foreign, if you had more than one table) keys or "not NULL", we advise to initially create the table without all constraints, then bulkload the data via COPY INTO, and only then add the constraints one by one using ALTER TABLE ... ADD CONSTRAINT.

We'd be curios to learn more about the crash with the "select *" query. Is there any error message on the mserver5 console or in the monetdbd / merovingian log file? Could you possible build a debugging version of MonetDB (configured with --disable-optimize --enable-debug --enable-assert), run it the a debugger (e.g., gdb), and check where (and why?) it crashes?

Thanks!

Best, Stefan

----- Original Message -----

...
Hi,

I'm loading data into a table with >20,000 columns and several hundred thousand rows using "COPY INTO". It works, but I can't seem to make it take less than about 30GB of memory. I've tried specifying the number of records as well. Is there any way to reduce memory usage when loading? Is this much memory usage expected? It's a shame because our queries take very little memory. We have to size our machine simply to handle the load.

It's impressive (and very convenient) that I can actually have a table with so many columns, but are there any other limitations I might come across? I've just been trying to keep the column count in the queries low. A "select *" seems to crash MonetDB.

Thanks, Andrew

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

4413

Age (days ago)

4417

Last active (days ago)

List overview

Download

8 comments

2 participants

participants (2)

Andrew Duffey
Stefan Manegold