This section applies to the COPY BINARY INTO
command as found MonetDB version
Jul2021 and onward. Earlier versions also had a COPY BINARY INTO
but it worked
differently. The most notable differences are that text was newline terminated
rather than NUL terminated, that the list of supported datatypes was
significantly shorter and that there was no control over byte order.
Usually the preferred way of populating a database is by importing a CSV file
using the COPY INTO
command. This command also supports
variant file formats such as tab separated files and fixed width records. The
primary advantage of this way of data interchange is that almost every
application has a way of producing and/or consuming these types of file. A
disadvantage is that even though MonetDB's CSV parser is highly tuned,
converting from a textual format to the internal binary format can still be time
consuming.
If very large amounts of data need to be inserted quickly, COPY BINARY INTO
can be a good alternative. COPY BINARY INTO
allows you to skip the parsing
overhead by passing the data in a binary form that MonetDB can ingest very
efficiently. The main down side is that it is entirely MonetDB specific and
probably needs custom application code to generate the data in the appropriate
form.
COPY [ BIG ENDIAN | LITTLE ENDIAN | NATIVE ENDIAN ] BINARY
INTO [ schema_name . ] table_name
[ '(' column_name [ ',' column_name [ ... ] ] ')' ]
FROM file_name [ ',' file_name [ ... ] ]
[ ON CLIENT | ON SERVER ]
With COPY BINARY INTO
, every column is read from a a separate file. The number
and order of file names given must match the number of columns listed, or the
number of columns of the table if no column list is present. As with the regular
COPY INTO
, the data can be read on the server or on the client.
Because every column has its own file, no delimiters can be configured and
there also is no BEST EFFORT
mode.
Every column is read from a separate file. For fixed width data such as numbers, the data is simply the concatenation of the binary representation of each value. Both Big Endian and Little Endian byte orders are supported, see below.
Text is represented as NUL terminated UTF-8 encoded strings.
Temporal types are represented as structured values containing a small integer for each field.
For each data type there is a special binary pattern to represent NULL values.
Assume we have the following table definition:
CREATE TABLE foo(i INT, t TEXT);
We will use Python's struct
module
to generate some binary data to fill it. Note that for large amounts of data
this might not be very efficient as it first materializes the data as a large
in-memory Python array.
#!/usr/bin/env python3
import struct
open('col_i.bin', 'wb').write(struct.pack('<4i', 42, 43, 44, 45))
open('col_t.bin', 'wb').write(b'foo\x00bar\x00baz\x00quux\x00')
This program creates two files: col_i.bin is a 16-byte file containing four 4-byte/32-bits integers:
┌─────────────────────────┬─────────────────────────┐
│ 2a 00 00 00 2b 00 00 00 ┊ 2c 00 00 00 2d 00 00 00 │
└─────────────────────────┴─────────────────────────┘
The other file is a 17-byte file containing the four strings, each terminated with a NUL byte:
┌─────────────────────────┬─────────────────────────┬────┐
│ 66 6f 6f 00 62 61 72 00 ┊ 62 61 7a 00 71 75 75 78 ┊ 00 │
│ f o o \0 b a r \0 ┊ b a z \0 q u u x ┊ \0 │
└─────────────────────────┴─────────────────────────┴────┘
To copy this data into MonetDB, issue the command
COPY LITTLE ENDIAN BINARY
INTO foo
FROM 'col_i.bin', 'col_t.bin' ON CLIENT;
First a word about byte order. On modern systems, bytes are 8 bits wide. That means that to store a 16-bit number we need 2 bytes. Take for example the number 2021 = 7 x 256 + 229. It is stored as a combination as a high byte 7 and a low byte 229. There are two obvious ways to store the number 2021 in a byte file:
on Little Endian systems, the low byte comes first: [ 229 | 7 ]
.
Intel/AMD-based computers systems and most ARM-based computer systems use
a little endian byte order.
on Big Endian systems, the high byte comes first: [ 7 | 229 ]
.
Many older computer systems such as IBM mainframes and SPARC/MIPS systems
use a big endian byte order. The Java virtual machine also prefers big endian
numbers, see for example Java's DataOutputStream.
In the first few words of the COPY BINARY INTO
statement you can indicate the
byte order of the incoming data: BIG ENDIAN
, LITTLE ENDIAN
or NATIVE ENDIAN
. Native endian means the preferred byte order of the system
MonetDB is running on. When omitted, native endian is assumed.
The byte width of the file for each numeric column depends on the data type:
SQL data type | byte width |
---|---|
tinyint | 1 |
smallint | 2 |
int | 4 |
bigint | 8 |
hugeint | 16, if supported |
decimal(1..2,X) | 1 |
decimal(3..4,X) | 2 |
decimal(5..9,X) | 4 |
decimal(10..18,X) | 8 |
decimal(19..38,X) | 16, if supported |
float(24) | 4 |
float(53) | 8 |
The NULL representation for the integers and decimals has 0x80 in the highest
byte and 0 in the other bytes. When interpreted as a signed integer, this
byte sequence is just outside the range for the corresponding SQL integer.
For example, when reading little endian smallint
values, the bytes [ 00 80 ]
are interpreted as the signed integer -32768 which falls outside smallint
's supported range
of -32767..+32767.
The NULL representation for floating point values is any not-a-number (NaN) value.
For text columns (VARCHAR(*)
, STRING
, TEXT
) and text-like datatypes such
as URL, the file should contain the concatenated
values, each followed by a NUL byte. That is, the values are NUL-terminated, not
NUL-separated. The text encoding MUST be UTF-8.
The NULL representation for textual columns consists of the single
byte \x80
, followed by the usual column terminator \x00
.
This is not a valid UTF-8 sequence so it cannot be confused with a value.
Temporal types are represented as small structured values consisting of
several small integers holding the day, the hour, the month, etc.
The exact layout is defined in the C header file monetdb/copybinary.h
provided
with MonetDB and reproduced below:
typedef struct {
uint8_t day;
uint8_t month;
int16_t year;
} copy_binary_date; // natural size: 32, natural alignment: 16
typedef struct {
uint32_t ms;
uint8_t seconds;
uint8_t minutes;
uint8_t hours;
uint8_t padding; // implied in C, explicit elsewhere
} copy_binary_time; // natural size: 64, natural alignment: 32
typedef struct {
copy_binary_time time;
copy_binary_date date;
} copy_binary_timestamp; // natural size: 96, natural alignment: 32
The day and month fields are 1-based.
Any invalid value, such as month=32, is treated as a NULL.