Binary Bulk Loads

Binary Bulk Loads mk Sun, 03/28/2010 - 23:29

A database is often populated with the SQL COPY statement using an ASCII representation of a table with column and record separators, e.g. a CSV file. This scheme is preferred for almost all situations. Its performance is mostly determined by parsing cost of turning ASCII value representation into the native binary one. The MonetDB copy process is highly optimized for multi-core systems, where several threads in parallel handle variable length records efficiently.

When large tables are migrated between MonetDB instances, or when large tables are produced by a trusted external program, a slightly faster loading track can be used instead. This saves rendering of data into ASCII and subsequent parsing of the data being exchanged. For this to work, the user creates the binary version of the BAT storage and 'attach' it to the SQL catalog using a variation of the COPY command.

To illustrate, consider the following snippet:

create table Tmp( i integer, f real, s string); 
copy binary into Tmp from ('path_to_file_i', 'path_to_file_f', 'path_to_file_s');

Each attachment file is produced by a program that writes the binary image of the BAT directly, i.e. a binary dump of an C-array. For the SQL types tinyint (8-bits), smallint (16-bits),  int/integer (32 bits), and bigint (64 bits) correspond with the C types directly; real and double are mapped onto type float and double in C, respectively. For variable length strings, the file must have one C-based string value per line, terminated by a newline, and it is processed without escape character conversion. Fixed length strings are handled the same way. MonetDB assumes that all files are aligned, i.e. the i-th value in each file corresponds to the i-th record in the table.

All other types (including UTF8 and escaped characters) should be handled using the default COPY INTO language construct.

The file names must be absolute paths to the column input files.

In the past, some of the files would be "moved", but nowadays, all data is copied, so the original files remain unchanged.