[MonetDB-users] COPY TO corrupting data
I am trying to load about 500M rows into a table with about 160 columns: column types: 7 columns of type "int" 2 columns of type "varchar(20)" 6 columns of type "varchar(10)" 1 columns of type "varchar(15)" 144 columns of type "real" on a system with 16 GB RAM , 64 bit windows server 2003 SP2. it takes some 20 hours to load the 100M rows, but when I reach that boundary, the server seems to crash and database becomes corrupt so that about 95M rows have 0 in numeric columns, blanks in varchars. The remaining 5M rows have real data. does anyone have any suggestions on what can be done to make it possible to load this data? -- View this message in context: http://www.nabble.com/COPY-TO-corrupting-data-tp15634267p15634267.html Sent from the monetdb-users mailing list archive at Nabble.com.
Dear Mobi, Thank you for your interest. The load you present is a real challenge to any DBMS. One normally performs it in smaller increments. The report of 0 in many columns has been reported internally as well, and we are investigating the situation. It is a hard one, because it only seems to appear in very large loads. And it is not easily reproducable. A first step i would take is to use an incremental version, with smaller increments. How much is optimal is probably trial and error. Your experiences, however, are helpful for many others that follow. regards, Martin mobigital1 wrote:
I am trying to load about 500M rows into a table with about 160 columns:
column types: 7 columns of type "int" 2 columns of type "varchar(20)" 6 columns of type "varchar(10)" 1 columns of type "varchar(15)" 144 columns of type "real"
on a system with 16 GB RAM , 64 bit windows server 2003 SP2.
it takes some 20 hours to load the 100M rows, but when I reach that boundary, the server seems to crash and database becomes corrupt so that about 95M rows have 0 in numeric columns, blanks in varchars. The remaining 5M rows have real data.
does anyone have any suggestions on what can be done to make it possible to load this data?
FYI, i am using 500K rows at a time. I've seen the mserver5.exe process memory utilization go up to 13GB (free memory was around 1.5 GB). but then it winds down pretty quickly. -- View this message in context: http://www.nabble.com/COPY-TO-corrupting-data-tp15634267p15634453.html Sent from the monetdb-users mailing list archive at Nabble.com.
mobigital1 wrote:
FYI, i am using 500K rows at a time.
I've seen the mserver5.exe process memory utilization go up to 13GB (free memory was around 1.5 GB). but then it winds down pretty quickly.
Thanks. This means that the expected database chunk size is topped at 500K * (7 * 8+ 2*20+6 *10+15+144*8 ) bytes i.e. 500K* 1323= 0.670Gb chucksize wise there does not seem a problem. However, virtual memory footprint quickly increases and I am wondering if we hit a limit in OS regaring memory mapped files. I would check a single batch for consistency. Then built a merge table over the smaller batches. In a later stage you can then glue together the pieces until we hit another limit thanks for the info and help in improving MonetDB
Thanks. This means that the expected database chunk size is topped at 500K * (7 * 8+ 2*20+6 *10+15+144*8 ) bytes i.e. 500K* 1323= 0.670Gb
chucksize wise there does not seem a problem. However, virtual memory footprint quickly increases and I am wondering if we hit a limit in OS regaring memory mapped files.
Is there something I can monitor to see if this limit is reached? system Load (on task manager - left bar) usually reflects cumulative system load with virtual memory utilization. That didn't seem to be showing substantially more than current physical memory use.
I would check a single batch for consistency. Then built a merge table over the smaller batches. In a later stage you can then glue together the pieces until we hit another limit
that's a good idea to try. Also would be great to add a batch size capability to COPY TO, so that I can have it load a multi-million row table without having to slice it myself. -- View this message in context: http://www.nabble.com/COPY-TO-corrupting-data-tp15634267p15640793.html Sent from the monetdb-users mailing list archive at Nabble.com.
On Fri, Feb 22, 2008 at 10:52:07AM -0800, mobigital1 wrote:
Thanks. This means that the expected database chunk size is topped at 500K * (7 * 8+ 2*20+6 *10+15+144*8 ) bytes i.e. 500K* 1323= 0.670Gb
chucksize wise there does not seem a problem. However, virtual memory footprint quickly increases and I am wondering if we hit a limit in OS regaring memory mapped files.
Is there something I can monitor to see if this limit is reached? system Load (on task manager - left bar) usually reflects cumulative system load with virtual memory utilization. That didn't seem to be showing substantially more than current physical memory use.
I would check a single batch for consistency. Then built a merge table over the smaller batches. In a later stage you can then glue together the pieces until we hit another limit
that's a good idea to try.
Also would be great to add a batch size capability to COPY TO, so that I can have it load a multi-million row table without having to slice it myself.
That feature exists ;-) For example COPY 5 OFFSET 5 RECORDS INTO my_test FROM stdin USING DELIMITERS '|','\n' ; Starts from row number 5 and reads 4 more. Niels
-- View this message in context: http://www.nabble.com/COPY-TO-corrupting-data-tp15634267p15640793.html Sent from the monetdb-users mailing list archive at Nabble.com.
------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- Niels Nes, Centre for Mathematics and Computer Science (CWI) Kruislaan 413, 1098 SJ Amsterdam, The Netherlands room C0.02, phone ++31 20 592-4098, fax ++31 20 592-4312 url: http://www.cwi.nl/~niels e-mail: Niels.Nes@cwi.nl
I would check a single batch for consistency. Then built a merge table over the smaller batches. In a later stage you can then glue together the pieces until we hit another limit
that's a good idea to try.
Also would be great to add a batch size capability to COPY TO, so that I can have it load a multi-million row table without having to slice it myself.
i've been having the same issue on my other environment: a 32 bit windows 2003, with 8gb ram. trying to load a table with following types of columns: 33 columns of type "int" 27 columns of type "real" 1 columns of type "numeric(10,0)" 1 columns of type "varchar(6)" 19 columns of type "varchar(1)" 5 columns of type "varchar(3)" 2 columns of type "varchar(12)" 6 columns of type "varchar(2)" 1 columns of type "varchar(18)" 5 columns of type "varchar(5)" 1 columns of type "varchar(15)" 2 columns of type "varchar(30)" at around 9M+ rows the server dies and some data corrupts. I tried the method above and loaded 250 K at a time into a staging table. in each batch I did a COPY TO the staging table (250K rows at a time) then executed an insert from staging to final table, and delete from staging table. either through staging table or directly, the problem exhibited the same symptoms. I think a few times I tried, the mserver5.exe did not crash, but it did not let go of the COPY TO source data file, i knew that because could not delete it to load with new batch of data. the database had to be discarded and created from scratch again. -- View this message in context: http://www.nabble.com/COPY-TO-corrupting-data-tp15634267p15673697.html Sent from the monetdb-users mailing list archive at Nabble.com.
On Sun, Feb 24, 2008 at 08:10:51PM -0800, mobigital1 wrote:
I would check a single batch for consistency. Then built a merge table over the smaller batches. In a later stage you can then glue together the pieces until we hit another limit
that's a good idea to try.
Also would be great to add a batch size capability to COPY TO, so that I can have it load a multi-million row table without having to slice it myself.
i've been having the same issue on my other environment: a 32 bit windows 2003, with 8gb ram.
32-bit system on 8 GB machines does not make much sense, does it? A 32-bit systems can (at least per process) address at most 4 GB, usually only 2 GB in practice...
trying to load a table with following types of columns: 33 columns of type "int" 27 columns of type "real" 1 columns of type "numeric(10,0)" 1 columns of type "varchar(6)" 19 columns of type "varchar(1)" 5 columns of type "varchar(3)" 2 columns of type "varchar(12)" 6 columns of type "varchar(2)" 1 columns of type "varchar(18)" 5 columns of type "varchar(5)" 1 columns of type "varchar(15)" 2 columns of type "varchar(30)"
at around 9M+ rows the server dies and some data corrupts.
With 9M+ records of >= 438 byte each, your data is >=3.9 GB in size. I'm not completely sure whether MonetDB temprarily need to hold all data concurrently in its address sapce during bulk loading; but if so, your data size might simply exceed the 32-bit address space. (Obviously, MonetDB should better give an error than crash ...) Did you try on a 64-bit systems using 64-bit MonetDB? Stefan
I tried the method above and loaded 250 K at a time into a staging table.
in each batch I did a COPY TO the staging table (250K rows at a time) then executed an insert from staging to final table, and delete from staging table.
either through staging table or directly, the problem exhibited the same symptoms. I think a few times I tried, the mserver5.exe did not crash, but it did not let go of the COPY TO source data file, i knew that because could not delete it to load with new batch of data.
the database had to be discarded and created from scratch again.
-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4312 |
participants (4)
-
Martin Kersten
-
mobigital1
-
Niels Nes
-
Stefan Manegold