This section applies to the COPY BINARY INTO
command as found in MonetDB version
Jul2021 and onward. Earlier versions also had a COPY BINARY INTO
but it worked
differently. The most notable differences are that text was newline terminated
rather than NUL terminated, that the list of supported datatypes was
significantly shorter and that there was no control over byte order.
Usually the preferred way of populating a database is by importing a CSV file
using the COPY INTO
command. This command also supports
variant file formats such as tab separated files and fixed width records. The
primary advantage of this way of data interchange is that almost every
application has a way of producing and/or consuming these types of file. A
disadvantage is that even though MonetDB's CSV parser is highly tuned,
converting from a textual format to the internal binary format can still be time
consuming.
If very large amounts of data need to be inserted quickly, COPY BINARY INTO
can be a good alternative. COPY BINARY INTO
allows you to skip the parsing
overhead by passing the data in a binary form that MonetDB can ingest very
efficiently. The main down side is that it is entirely MonetDB specific and
probably needs custom application code to generate the data in the appropriate
form, unless you use
COPY select query INTO BINARY files
to export data from an existing MonetDB instance.
COPY [ BIG ENDIAN | LITTLE ENDIAN | NATIVE ENDIAN ] BINARY
INTO [ schema_name . ] table_name
[ '(' column_name [ ',' column_name [ ... ] ] ')' ]
FROM file_name [ ',' file_name [ ... ] ]
[ ON CLIENT | ON SERVER ]
With COPY BINARY INTO
, every column is read from a a separate file. The number
and order of file names given must match the number of columns listed, or the
number of columns of the table if no column list is present. As with the regular
COPY INTO
, the data can be read on the server or on the client.
Because every column has its own file, no delimiters can be configured and
there also is no BEST EFFORT
mode.
Every column is read from a separate file. For fixed width data such as numbers, the data is simply the concatenation of the binary representation of each value. Both Big Endian and Little Endian byte orders are supported, see below.
Text is represented as NUL terminated UTF-8 encoded strings.
Temporal types are represented as structured values containing a small integer for each field.
For each data type there is a special binary pattern to represent NULL values.
When the optional endianess is not specified, NATIVE ENDIAN is used.
When the optional ON CLIENT or ON SERVER is not specified, ON SERVER is used.
Assume we have the following table definition:
CREATE TABLE foo(i INT, t TEXT);
We will use Python's struct
module
to generate some binary data to fill it. Note that for large amounts of data
this might not be very efficient as it first materializes the data as a large
in-memory Python array.
#!/usr/bin/env python3
import struct
open('col_i.bin', 'wb').write(struct.pack('<4i', 42, 43, 44, 45))
open('col_t.bin', 'wb').write(b'foo\x00bar\x00baz\x00quux\x00')
This program creates two files: col_i.bin is a 16-byte file containing four 4-byte/32-bits integers:
┌─────────────────────────┬─────────────────────────┐
│ 2a 00 00 00 2b 00 00 00 ┊ 2c 00 00 00 2d 00 00 00 │
└─────────────────────────┴─────────────────────────┘
The other file is a 17-byte file containing the four strings, each terminated with a NUL byte:
┌─────────────────────────┬─────────────────────────┬────┐
│ 66 6f 6f 00 62 61 72 00 ┊ 62 61 7a 00 71 75 75 78 ┊ 00 │
│ f o o \0 b a r \0 ┊ b a z \0 q u u x ┊ \0 │
└─────────────────────────┴─────────────────────────┴────┘
To copy this data into MonetDB, issue the command
COPY LITTLE ENDIAN BINARY
INTO foo
FROM 'col_i.bin', 'col_t.bin' ON CLIENT;
First a word about byte order. On modern systems, bytes are 8 bits wide. That means that to store a 16-bit number we need 2 bytes. Take for example the number 2021 = 7 x 256 + 229. It is stored as a combination as a high byte 7 and a low byte 229. There are two obvious ways to store the number 2021 in a byte file:
on Little Endian systems, the low byte comes first: [ 229 | 7 ]
.
Intel/AMD-based computers systems and most ARM-based computer systems use
a little endian byte order.
on Big Endian systems, the high byte comes first: [ 7 | 229 ]
.
Many older computer systems such as IBM mainframes and SPARC/MIPS systems
use a big endian byte order. The Java virtual machine also prefers big endian
numbers, see for example Java's DataOutputStream.
In the first few words of the COPY BINARY INTO
statement you can indicate the
byte order of the incoming data: BIG ENDIAN
, LITTLE ENDIAN
or NATIVE ENDIAN
. Native endian means the preferred byte order of the system
MonetDB is running on. When omitted, native endian is assumed.
The byte width of the file for each numeric column depends on the data type:
SQL data type | byte width |
---|---|
tinyint | 1 |
smallint | 2 |
int | 4 |
bigint | 8 |
hugeint | 16, if supported |
decimal(1..2,X) | 1 |
decimal(3..4,X) | 2 |
decimal(5..9,X) | 4 |
decimal(10..18,X) | 8 |
decimal(19..38,X) | 16, if supported |
float(24) | 4 |
float(53) | 8 |
The NULL representation for the integers and decimals has 0x80 in the highest
byte and 0 in the other bytes. When interpreted as a signed integer, this
byte sequence is just outside the range for the corresponding SQL integer.
For example, when reading little endian smallint
values, the bytes [ 00 80 ]
are interpreted as the signed integer -32768 which falls outside smallint
's supported range
of -32767..+32767.
The NULL representation for floating point values is any not-a-number (NaN) value.
For text columns (VARCHAR(*)
, CHAR(*)
, CLOB
, STRING
, TEXT
) and text-like datatypes
such as URL and JSON,
the file should contain the concatenated values, each followed by a NUL byte.
That is, the values are NUL-terminated, not NUL-separated. The text encoding MUST be UTF-8.
The NULL representation for textual columns consists of the single
byte \x80
, followed by the usual column terminator \x00
.
This is not a valid UTF-8 sequence so it cannot be confused with a value.
For BLOB
columns, the file is a concatenation of blob records. Each record
consists of an 8 byte (64 bit) integer indicating the length of the blob,
followed by the bytes of the blob. There is no padding. The byte order of the
length header is the byte order indicated in the COPY BINARY
statement.
Null blobs have the length header set to all-ones (0xFFFF_FFFF_FFFF_FFFF)
without any following bytes.
For example, the statement
COPY VALUES
(BLOB 'aabbcc'),
(BLOB ''),
(NULL)
INTO BIG ENDIAN BINARY '/tmp/x' ON CLIENT;
results in the bytes
00 00 00 00 00 00 00 03 aa bb cc
00 00 00 00 00 00 00 00
ff ff ff ff ff ff ff ff
Conversely, with INTO LITTLE ENDIAN BINARY
it results in
03 00 00 00 00 00 00 00 aa bb cc
00 00 00 00 00 00 00 00
ff ff ff ff ff ff ff ff
Temporal types are represented as small structured values consisting of
several small integers holding the day, the hour, the month, etc.
The exact layout is defined in the C header file monetdb/copybinary.h
provided
with MonetDB and reproduced below:
typedef struct {
uint8_t day;
uint8_t month;
int16_t year;
} copy_binary_date; // natural size: 32, natural alignment: 16
typedef struct {
uint32_t ms;
uint8_t seconds;
uint8_t minutes;
uint8_t hours;
uint8_t padding; // implied in C, explicit elsewhere
} copy_binary_time; // natural size: 64, natural alignment: 32
typedef struct {
copy_binary_time time;
copy_binary_date date;
} copy_binary_timestamp; // natural size: 96, natural alignment: 32
The day and month fields are 1-based.
When reading temporal types in binary, any out-of-bounds value is treated as a NULL. For example, month=32. When writing temporal types as binary, MonetDB sets all bits to one, so a NULL date and time read as
{
.day=255,
.month=255,
.year=-1
},
and
{
.ms=4294967295, // 0xFFFFFFFF
.seconds=255,
.minutes=255,
.hours=255,
.padding=255
}
Interval types represent the difference between two dates. There are many varieties, for example,
INTERVAL YEAR
,INTERVAL SECOND
,INTERVAL DAY TO MINUTE
,etc. For a full description, see section Temporal Types.
The binary representation of interval types depends on the granularity. Types
with granularity MONTH or YEAR are represented as a 32 bits integer indicating
the number of months. For example, INTERVAL '1-3' YEAR TO MONTH
is represented
as 15. The other, more fine grained interval types are all represented as 64
bits integers indicating the number of milliseconds. For example, INTERVAL '1' HOUR
is represented as 3600000, or equivalently, as the DECIMAL 3600.000.