RE: High memory usage while loading data with many columns

7 Feb 2013

      Hi,

I'm sorry, my count was off by an order of magnitude. We have about 200,000 rows. So that's ~32GB if everything is packed perfectly.

The compressed input files are about 600MB. We have a lot of zeros.

We've tried two versions:

$ monetdb --version
MonetDB Database Server Toolkit v1.0 (Oct2012-SP3)

$ monetdb --version
MonetDB Database Server Toolkit v1.0 (Oct2012-SP2)

Both were actually built from source (we have some UDFs). I checked out the repo and updated to the tags for those releases before adding our UDF code and building MonetDB.

I can't share the actual data, but I may be able to reproduce this with a synthetic data set. I probably won't have time to do so until at least tomorrow.

Thanks,
Andrew Duffey
________________________________________
From: users-list-bounces+aduffey=merkleinc.com@monetdb.org [users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of Stefan Manegold [Stefan.Manegold@cwi.nl]
Sent: Thursday, February 07, 2013 5:17 PM
To: Communication channel for MonetDB users
Subject: Re: High memory usage while loading data with many columns

How large are you compressed input files?

Could you share that data with us such that we coudl try to reproduce the problem?

Stefan

----- Original Message -----
...
Hi,
Which version of MonetDB or you using?
(Maybe you can share the output of `mserver5 --version`?)
Which flavour of Linux are you running on?
Do I understand correctly, that your data consists of about 20000 x
20000 BIGINT values, i.e., should not take more than about 3.2 GB?
Stefan
----- Original Message -----
...
Hi,
I tried loading one large file (about 60% of our data),
uncompressed,
given the row count, and with "LOCKED". I got the same problem. On
systems with swap, we see it start to use up the swap space as well
until the system is unresponsive.
We can try to turn off the OOM killer, but that concerns me a bit.
There are only a few other processes running on this server for
administration, but I'm not sure if they will be able to deal with
failed memory allocations if MonetDB uses up all available physical
memory.
Thanks,
Andrew Duffey
________________________________________
From: users-list-bounces+aduffey=merkleinc.com@monetdb.org
[users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf of
Stefan Manegold [Stefan.Manegold@cwi.nl]
Sent: Thursday, February 07, 2013 3:11 PM
To: Communication channel for MonetDB users
Subject: Re: High memory usage while loading data with many columns
Hi Andrew,
you might was to consider disabling the system's OOM killer on
dedicated database servers.
MonetDB owes most of its speed advantages to aggressively using all
available physical memory.
Also, you might want to give the total number of records to be
loaded
with the first copy into statement to avoid the need for successive
storage extensions, e.g., as follows:
#!/bin/bash
# USAGE: ./load [db] [table] /full/path/to/files/*
set -e
set -u
MDB_DATABASE=$1
shift
MDB_TABLE=$1
shift
COUNT=0
for FILE in "$@" ; do
        NOW=$(date +"%c")
        echo "Counting file ${FILE} at $NOW"
        COUNT=$[$COUNT + $(zcat "$FILE" | wc -l)]
done
FILE="$1"
NOW=$(date +"%c")
echo "Loading file ${FILE} at $NOW"
mclient -s "COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';"
shift
for FILE in "$@" ; do
        NOW=$(date +"%c")
        echo "Loading file ${FILE} at $NOW" >&2
        mclient -s "COPY INTO $MDB_TABLE FROM '$FILE';"
done
Loading one large file basically does the same trick ...
Best,
Stefan
----- Original Message -----
...
Hi,
It seems to be using quite a bit of physical memory. We just
tried
to
load on a machine with 24GB of memory and eventually the Linux
OOM
killer was invoked. We see resident set sizes > 20GB, and virtual
memory usage in only a bit higher.
Here are some more details. We have a very wide table, >20,000
columns (all bigint), that we are trying to load. We have about
as
many rows. Our input files are gzipped and split into 200 parts
(so
there are only about 100 lines per file). Here is our load
script:
#!/bin/bash
# USAGE: ./load [db] [table] /full/path/to/files/*
set -e
set -u
MDB_DATABASE=$1
MDB_TABLE=$2
LAST=$BASH_ARGV
FILE=''
while [[ "$FILE" != "$LAST" ]]; do
      shift || break
      FILE=$2
      NOW=$(date +"%c")
      echo "Loading file ${FILE} at $NOW"
      COUNT=$(zcat $FILE | wc -l)
      STMT="COPY $COUNT RECORDS INTO $MDB_TABLE FROM '$FILE';"
      echo $STMT | mclient $MDB_DATABASE
done
What we see is that memory usage starts off low, but rises as we
load
more files. Sometimes it decreases a bit (like something is being
flushed to disk) but it subsequently rises higher than before,
until
we run out of memory. We got through about 70 files on our 24GB
machine. Previously, we've loaded similar data on a 34GB machine,
100 files at a time, restarting in between.
We're planning on trying the following:
1. Load one large file.
2. Load a few files, stop and start the db and dbfarm, load a few
more, etc.
Thanks,
Andrew Duffey
________________________________________
From: users-list-bounces+aduffey=merkleinc.com@monetdb.org
[users-list-bounces+aduffey=merkleinc.com@monetdb.org] on behalf
of
Stefan Manegold [Stefan.Manegold@cwi.nl]
Sent: Monday, February 04, 2013 1:26 PM
To: Communication channel for MonetDB users
Subject: Re: High memory usage while loading data with many
columns
Hi Andrew,
There is no option to reduce the (virtual) memory usage of
MonetDB
during bulk load. MonetDB needs to have all columns active the
its
address space concurrently during bulkload, and hence need as
much
address sapce as required to hold the entire table. See also
http://www.monetdb.org/Documentation/UserGuide/resources
However, address space does not necessarily mean physical memory.
We
easily load 100 GB TPCH (including a 70+GB lineitem table) on an
8
GB machine.
If you're loading in isolation (i.e., without concurrent
queries),
you might what to consider using the "LOCKED" option of copy into
to
avoid one extra copy; cf.,
http://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto
.
Also, in case your table has constraints like primary (or
foreign,
if
you had more than one table) keys or "not NULL", we advise to
initially create the table without all constraints, then bulkload
the data via COPY INTO, and only then add the constraints one by
one
using ALTER TABLE ... ADD CONSTRAINT.
We'd be curios to learn more about the crash with the "select *"
query. Is there any error message on the mserver5 console or in
the
monetdbd / merovingian log file?
Could you possible build a debugging version of MonetDB
(configured
with --disable-optimize --enable-debug --enable-assert), run it
the
a debugger (e.g., gdb), and check where (and why?) it crashes?
Thanks!
Best,
Stefan
----- Original Message -----
...
Hi,
I'm loading data into a table with >20,000 columns and several
hundred thousand rows using "COPY INTO". It works, but I can't
seem
to make it take less than about 30GB of memory. I've tried
specifying the number of records as well. Is there any way to
reduce
memory usage when loading? Is this much memory usage expected?
It's
a shame because our queries take very little memory. We have to
size
our machine simply to handle the load.
It's impressive (and very convenient) that I can actually have
a
table with so many columns, but are there any other limitations
I
might come across? I've just been trying to keep the column
count
in
the queries low. A "select *" seems to crash MonetDB.
Thanks,
Andrew
_______________________________________________
users-list mailing list
users-list@monetdb.org
http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
http://mail.monetdb.org/mailman/listinfo/users-list