Re: [MonetDB-users] Loading problems since FEB2010+SP1 release

26 Apr 2010

      On Tue, Apr 20, 2010 at 11:14:54PM +0200, Martin Kersten wrote:
...
Nozhup wrote:
...
Since the FEB2010+SP1 release we're encountering loading issues.
With the FEB2010 release we were able to load records until the HDD was full
(500+ million records) on PCs with 2,3 or 24 GB ram memory without going to
the SWAP. This was done by loading CSV-files with 1 million records each
through the COPY INTO command (in the mclient and through a c-script).
Since the FEB2010+SP1 release the system starts to SWAP after loading the
first couple of million records. This results in increased loadtimes per
loaded CSV-file:
1th million: 14 sec
2nd million: 26 sec
3rd million: 44 sec
4th million: 50 sec
5th million: 69 sec
etc.
We have tested this on several PCs and all PCs encounter the same problem.
Test files used are the same for both releases as is the OS (Ubuntu 9.10).
The only variable that has changed is MonetDB.
To analyse the situation, the schema should be known, in particular if there
are any constraints defined over it.
This problem could not been reproduced on the ontime benchmark database (120Mrecords
in batches of 0.5M)
Investigation still going...
I also looked at loading the AirTraffic OnTime benchmark (single table with
90+ columns, loaded from 260 .csv files, each consisting of ~0.5M record,
i.e., almost 130M record in total). This is the closed we have / can do to
simulate the originally reported problem.

I tested the loading with Feb2010, Feb2010-SP1, Feb2010-SP2 preview;
all compiled from the respective super source tarball, on my 64-bit Fedora
12 desktop with 8 GB RAM.

With Feb2010, loading times vary between 10 and 15 sec per .csv file for the
first files and then slowly increase towards 15 to 25 sec per .csv file at
the end of the series. Both the variation and the slight increase can be
explaint by the fact that the files vary in size and number of records,
with the first files starting at just over 0.4M records, while the last ones
hold almost 0.6M records.

With Feb2010-SP1 & Feb2010-SP2, I see roughtly the same pattern as baseline.
However, "once in a while" loading one .csv file takes much longer, namely
between 100 & 400(!) sec. The frequency of such outliers is rather low in
the beginning (atmost one in 10), but becomes eevry other file at the end of
the series.

Moreover, I noticed that as of loading the second file into the same table,
extra *.new files are created / left behind for most of the *.tail & *.theap
files in the dbfarm, making the dbfarm about twice a large as with Feb2010.
...
From the differences between the Feb2010 codebase and the Feb2010-SP1
codebase (respectively the log messages of the cahnges that happened in
between) I could not yet conclude which change(s) trigger(s) this
misbehaviour of Feb2010 & Feb2010-SP1 with multiple "COPY INTO" into the
same table. --- More invetigation is pending ...
Stefan
...
Martin
------------------------------------------------------------------------------
_______________________________________________
MonetDB-users mailing list
MonetDB-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- 
| Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl |
| CWI,  P.O.Box 94079 | http://www.cwi.nl/~manegold/  |
| 1090 GB Amsterdam   | Tel.: +31 (20) 592-4212       |
| The Netherlands     | Fax : +31 (20) 592-4199       |