On Tue, Apr 20, 2010 at 11:14:54PM +0200, Martin Kersten wrote:
Nozhup wrote:
Since the FEB2010+SP1 release we're encountering loading issues.
With the FEB2010 release we were able to load records until the HDD was full (500+ million records) on PCs with 2,3 or 24 GB ram memory without going to the SWAP. This was done by loading CSV-files with 1 million records each through the COPY INTO command (in the mclient and through a c-script).
Since the FEB2010+SP1 release the system starts to SWAP after loading the first couple of million records. This results in increased loadtimes per loaded CSV-file:
1th million: 14 sec 2nd million: 26 sec 3rd million: 44 sec 4th million: 50 sec 5th million: 69 sec etc.
We have tested this on several PCs and all PCs encounter the same problem. Test files used are the same for both releases as is the OS (Ubuntu 9.10). The only variable that has changed is MonetDB.
To analyse the situation, the schema should be known, in particular if there are any constraints defined over it.
This problem could not been reproduced on the ontime benchmark database (120Mrecords in batches of 0.5M)
Investigation still going...
I also looked at loading the AirTraffic OnTime benchmark (single table with 90+ columns, loaded from 260 .csv files, each consisting of ~0.5M record, i.e., almost 130M record in total). This is the closed we have / can do to simulate the originally reported problem. I tested the loading with Feb2010, Feb2010-SP1, Feb2010-SP2 preview; all compiled from the respective super source tarball, on my 64-bit Fedora 12 desktop with 8 GB RAM. With Feb2010, loading times vary between 10 and 15 sec per .csv file for the first files and then slowly increase towards 15 to 25 sec per .csv file at the end of the series. Both the variation and the slight increase can be explaint by the fact that the files vary in size and number of records, with the first files starting at just over 0.4M records, while the last ones hold almost 0.6M records. With Feb2010-SP1 & Feb2010-SP2, I see roughtly the same pattern as baseline. However, "once in a while" loading one .csv file takes much longer, namely between 100 & 400(!) sec. The frequency of such outliers is rather low in the beginning (atmost one in 10), but becomes eevry other file at the end of the series. Moreover, I noticed that as of loading the second file into the same table, extra *.new files are created / left behind for most of the *.tail & *.theap files in the dbfarm, making the dbfarm about twice a large as with Feb2010.
From the differences between the Feb2010 codebase and the Feb2010-SP1 codebase (respectively the log messages of the cahnges that happened in between) I could not yet conclude which change(s) trigger(s) this misbehaviour of Feb2010 & Feb2010-SP1 with multiple "COPY INTO" into the same table. --- More invetigation is pending ...
Stefan
Martin
------------------------------------------------------------------------------ _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |