Loading from Binary Files

This section applies to the COPY BINARY INTO command as found in MonetDB version Jul2021 and onward. Earlier versions also had a COPY BINARY INTO but it worked differently. The most notable differences are that text was newline terminated rather than NUL terminated, that the list of supported datatypes was significantly shorter and that there was no control over byte order.

Usually the preferred way of populating a database is by importing a CSV file using the COPY INTO command. This command also supports variant file formats such as tab separated files and fixed width records. The primary advantage of this way of data interchange is that almost every application has a way of producing and/or consuming these types of file. A disadvantage is that even though MonetDB's CSV parser is highly tuned, converting from a textual format to the internal binary format can still be time consuming.

If very large amounts of data need to be inserted quickly, COPY BINARY INTO can be a good alternative. COPY BINARY INTO allows you to skip the parsing overhead by passing the data in a binary form that MonetDB can ingest very efficiently. The main down side is that it is entirely MonetDB specific and probably needs custom application code to generate the data in the appropriate form, unless you use COPY select query INTO BINARY files to export data from an existing MonetDB instance.

Syntax

COPY [ BIG ENDIAN | LITTLE ENDIAN | NATIVE ENDIAN ] BINARY
INTO [ schema_name . ] table_name
     [ '(' column_name [ ',' column_name [ ... ] ] ')' ]
FROM file_name [ ',' file_name [ ... ] ]
[ ON CLIENT | ON SERVER ]

With COPY BINARY INTO, every column is read from a a separate file. The number and order of file names given must match the number of columns listed, or the number of columns of the table if no column list is present. As with the regular COPY INTO, the data can be read on the server or on the client. Because every column has its own file, no delimiters can be configured and there also is no BEST EFFORT mode.

Every column is read from a separate file. For fixed width data such as numbers, the data is simply the concatenation of the binary representation of each value. Both Big Endian and Little Endian byte orders are supported, see below.

Text is represented as NUL terminated UTF-8 encoded strings.

Temporal types are represented as structured values containing a small integer for each field.

For each data type there is a special binary pattern to represent NULL values.

When the optional endianess is not specified, NATIVE ENDIAN is used.

When the optional ON CLIENT or ON SERVER is not specified, ON SERVER is used.

Example

Assume we have the following table definition:

CREATE TABLE foo(i INT, t TEXT);

We will use Python's struct module to generate some binary data to fill it. Note that for large amounts of data this might not be very efficient as it first materializes the data as a large in-memory Python array.

#!/usr/bin/env python3
import struct
open('col_i.bin', 'wb').write(struct.pack('<4i', 42, 43, 44, 45))
open('col_t.bin', 'wb').write(b'foo\x00bar\x00baz\x00quux\x00')

This program creates two files: col_i.bin is a 16-byte file containing four 4-byte/32-bits integers:

┌─────────────────────────┬─────────────────────────┐
│ 2a 00 00 00 2b 00 00 00 ┊ 2c 00 00 00 2d 00 00 00 │
└─────────────────────────┴─────────────────────────┘

The other file is a 17-byte file containing the four strings, each terminated with a NUL byte:

┌─────────────────────────┬─────────────────────────┬────┐
│ 66 6f 6f 00 62 61 72 00 ┊ 62 61 7a 00 71 75 75 78 ┊ 00 │
│ f  o  o  \0 b  a  r  \0 ┊ b  a  z  \0 q  u  u  x  ┊ \0 │
└─────────────────────────┴─────────────────────────┴────┘

To copy this data into MonetDB, issue the command

COPY LITTLE ENDIAN BINARY
INTO foo
FROM 'col_i.bin', 'col_t.bin' ON CLIENT;

Numeric data layout

First a word about byte order. On modern systems, bytes are 8 bits wide. That means that to store a 16-bit number we need 2 bytes. Take for example the number 2021 = 7 x 256 + 229. It is stored as a combination as a high byte 7 and a low byte 229. There are two obvious ways to store the number 2021 in a byte file:

  1. on Little Endian systems, the low byte comes first: [ 229 | 7 ]. Intel/AMD-based computers systems and most ARM-based computer systems use a little endian byte order.

  2. on Big Endian systems, the high byte comes first: [ 7 | 229 ]. Many older computer systems such as IBM mainframes and SPARC/MIPS systems use a big endian byte order. The Java virtual machine also prefers big endian numbers, see for example Java's DataOutputStream.

In the first few words of the COPY BINARY INTO statement you can indicate the byte order of the incoming data: BIG ENDIAN, LITTLE ENDIAN or NATIVE ENDIAN. Native endian means the preferred byte order of the system MonetDB is running on. When omitted, native endian is assumed.

The byte width of the file for each numeric column depends on the data type:

SQL data typebyte width
tinyint1
smallint2
int4
bigint8
hugeint16, if supported
decimal(1..2,X)1
decimal(3..4,X)2
decimal(5..9,X)4
decimal(10..18,X)8
decimal(19..38,X)16, if supported
float(24)4
float(53)8

The NULL representation for the integers and decimals has 0x80 in the highest byte and 0 in the other bytes. When interpreted as a signed integer, this byte sequence is just outside the range for the corresponding SQL integer. For example, when reading little endian smallint values, the bytes [ 00 80 ] are interpreted as the signed integer -32768 which falls outside smallint's supported range of -32767..+32767.

The NULL representation for floating point values is any not-a-number (NaN) value.

Text data layout

For text columns (VARCHAR(*), CHAR(*), CLOB, STRING, TEXT) and text-like datatypes such as URL and JSON, the file should contain the concatenated values, each followed by a NUL byte. That is, the values are NUL-terminated, not NUL-separated. The text encoding MUST be UTF-8.

The NULL representation for textual columns consists of the single byte \x80, followed by the usual column terminator \x00. This is not a valid UTF-8 sequence so it cannot be confused with a value.

Blob layout

For BLOB columns, the file is a concatenation of blob records. Each record consists of an 8 byte (64 bit) integer indicating the length of the blob, followed by the bytes of the blob. There is no padding. The byte order of the length header is the byte order indicated in the COPY BINARY statement. Null blobs have the length header set to all-ones (0xFFFF_FFFF_FFFF_FFFF) without any following bytes.

For example, the statement

COPY VALUES
	(BLOB 'aabbcc'),
	(BLOB ''),
	(NULL)
INTO BIG ENDIAN BINARY '/tmp/x' ON CLIENT;

results in the bytes

00 00 00 00  00 00 00 03   aa bb cc
00 00 00 00  00 00 00 00
ff ff ff ff  ff ff ff ff

Conversely, with INTO LITTLE ENDIAN BINARY it results in

03 00 00 00  00 00 00 00   aa bb cc
00 00 00 00  00 00 00 00
ff ff ff ff  ff ff ff ff

Temporal types

Temporal types are represented as small structured values consisting of several small integers holding the day, the hour, the month, etc. The exact layout is defined in the C header file monetdb/copybinary.h provided with MonetDB and reproduced below:

typedef struct {
	uint8_t day;
	uint8_t month;
	int16_t year;
} copy_binary_date; // natural size: 32, natural alignment: 16

typedef struct {
	uint32_t ms;
	uint8_t seconds;
	uint8_t minutes;
	uint8_t hours;
	uint8_t padding; // implied in C, explicit elsewhere
} copy_binary_time;		 // natural size: 64, natural alignment: 32

typedef struct {
	copy_binary_time time;
	copy_binary_date date;
} copy_binary_timestamp; // natural size: 96, natural alignment: 32

The day and month fields are 1-based.

When reading temporal types in binary, any out-of-bounds value is treated as a NULL. For example, month=32. When writing temporal types as binary, MonetDB sets all bits to one, so a NULL date and time read as

{
	.day=255,
	.month=255,
	.year=-1
},

and

{
	.ms=4294967295, // 0xFFFFFFFF
	.seconds=255,
	.minutes=255,
	.hours=255,
	.padding=255
}

Interval types

Interval types represent the difference between two dates. There are many varieties, for example,

  • INTERVAL YEAR,
  • INTERVAL SECOND,
  • INTERVAL DAY TO MINUTE,

etc. For a full description, see section Temporal Types.

The binary representation of interval types depends on the granularity. Types with granularity MONTH or YEAR are represented as a 32 bits integer indicating the number of months. For example, INTERVAL '1-3' YEAR TO MONTH is represented as 15. The other, more fine grained interval types are all represented as 64 bits integers indicating the number of milliseconds. For example, INTERVAL '1' HOUR is represented as 3600000, or equivalently, as the DECIMAL 3600.000.