Loading from CSV files

Data extracted from external applications is often in CSV form (Comma Separated Value), or a variant that uses another delimiter such as a semicolon ;, a vertical bar | or a TAB character. MonetDB provides the COPY INTO statement to load data from this kind of file.

Loading data using COPY INTO is much faster than inserting it with a series of INSERT statements. This is because loading a line of CSV data is much more efficient than parsing and executing an INSERT statement, and also because execution of a SQL script is inherently sequential while the work of COPY INTO can be spread over multiple CPU cores.

Example:

COPY INTO mytable
FROM '/path/to/my/data.csv' ON CLIENT
USING DELIMITERS ',', E'\n', '"';

The details of the supported file formats are described in section Supported file formats. If an error occurs during loading, details of the error are stored in a special system table and loading stops unless the BEST EFFORT clause is set. For details, see section Error Handling.

Syntax

COPY  [ n RECORDS ] [ OFFSET k ]
INTO  [ schema_name '.' ] table_name [ '(' column_name [ ',' ... ] ')' ]
FROM {
    file_name [ ',' ... ] [ '(' header [ ',' ... ] ) ] [ ON SERVER ]
  | file_name [ ',' ... ] [ '(' header [ ',' ... ] ) ] ON CLIENT
  | STDIN [ '(' header [ ',' ... ] ) ]
}
[ [ USING ] DELIMITERS field_sep [ ',' record_sep [ ',' string_quote ] ] ]
[ [ NO ] ESCAPE ]
[  NULL [ AS ] null_string ]
[ BEST EFFORT ]
[ FWF '(' width [ ',' ... ] ')' ]

COPY clause

The start of the statement has 5 possible forms. In each case, n is the maximum number of records to read, and k is the line number of the first record to read.

COPY INTO
COPY n RECORDS INTO
COPY OFFSET k INTO
COPY n OFFSET k RECORDS INTO
COPY n RECORDS OFFSET k INTO

If n is given, MonetDB will not load more than n records. It also uses the number to preallocate storage, so if you know in advance there will be (at most) n records, letting MonetDB know may improve performance.

The line number k starts counting at 1. This means that values 0 and 1 both mean "start reading at the top of the file", 2 means "skip one line", 3 means "skip two lines", etc.

Many CSV files start with a header line giving the names of the columns. The OFFSET clause can be used to skip this header line.

INTO clause

INTO  [ schema_name '.' ] table_name [ '(' column_name [ ',' ... ] ')' ]

The INTO table_name clause takes an optional list of column names. If given, the columns listed are loaded from the input file while the other columns are filled with default values.

FROM clause

FROM {
    file_name [ ',' ... ] [ '(' header [ ',' ... ] ) ] [ ON SERVER ]
  | file_name [ ',' ... ] [ '(' header [ ',' ... ] ) ] ON CLIENT
  | STDIN [ '(' header [ ',' ... ] ) ]
}

The FROM clause has three forms, each with an optional header list. Though not indicated in the grammar above, it is allowed to enclose the file names in parentheses. If multiple file names are given, the data from those file names is concatenated while reading. The RECORDS option gives the total number of records read from all files together, but if an OFFSET clause is present, this offset is applied to each file in turn.

Especially on Windows it is recommended to write the file names using raw strings as this allows you to write a Windows path without doubling the backslashes. Alternatively, you can simply write forward slashes instead of backslashes. This means the following are all equivalent:

COPY INTO mytable FROM R'c:\dumps\data.csv';
COPY INTO mytable FROM E'c:\\dumps\\data.csv';
COPY INTO mytable FROM 'c:\\dumps\\data.csv';
COPY INTO mytable FROM 'c:/dumps/data.csv';

FROM clause header list

The header list can be used to give names to the fields in the CSV file. In combination with with the column list given in the INTO clause, this can be used to reorder the data from the file, or to use only a subset of the fields. For example,

COPY INTO foo(x, y) FROM 'file' (y, dummy, x)

reads a three-column CSV file, storing the third CSV column in the first table column, the first CSV column in the second table column, and ignoring the middle CSV column.

It is important to note that if no header list is given, the header names are taken from the columns of the destination table even if a column list was specified in the INTO clause. This means for example that

CREATE TABLE foo(i INT, j INT, k INT);
COPY INTO foo(i, j) FROM 'file.csv';

will assume that file.csv has three columns i, j and k, not two as you might expect.

FROM files ON SERVER

With ON SERVER, which is the default, the file name must be an absolute path on the system on which the database server (mserver5) is running. The server must have read privileges on the file. Note that for example if MonetDB is running on a Linux server with SELinux enabled, it might not be easy to find a location on the server where you as a user have write access and where MonetDB has read access.

FROM files ON CLIENT

When the ON CLIENT is set, the server will not itself open the file, but will ask the client to open the file for it. This is more flexible than ON SERVER but the client application needs to be specifically written to support this. Currently this is possible for C/C++-based applications written using the Mapi library, Java applications using the the JDBC driver, and Python applications using pymonetdb.

With ON CLIENT, the file name need not be an absolute path. In fact, because the client application can interpret the file name however it pleases, it does not need to be a real file at all. With the implementation built into jdbcclient, the file name is interpreted as an absolute path or relative to the current working directory of the client. With the implementation built into mclient, the file is first tried relative to the current working directory of the client, and if that fails and a file is known from which the query is being read, relative to that file.

FROM stdin

With FROM STDIN, the data follows directly under the COPY INTO statement. This form is only really usable from mclient and is used by msqldump.

sql>COPY 3 RECORDS INTO foo FROM STDIN;
more>1|one
more>2|two
more>3|three
3 affected rows
sql>

If no n RECORDS is specified, loading stops at the first empty line. This can cause confusion if the table only has a single column and the input may contain empty fields.

USING DELIMITERS clause

[ [ USING ] DELIMITERS field_sep [ ',' record_sep [ ',' string_quote ] ] ]

This clause allows you to set the field separator, record terminator and quote character as described in section Supported file formats. It is recommended to write the record separator as an E-string to make the meaning of the backslash explicit:

COPY INTO mytable FROM 'file.csv' ON CLIENT
USING DELIMITERS ',', E'\n', '"';

Note that record separator \n also matches \r\n as commonly used on Windows systems. Explicitly specifying \r\n yields a warning and is treated equivalent to \n.

ESCAPE clause

ESCAPE
NO ESCAPE

The ESCAPE clause can be used to disable or enable backslash escapes as described in section Supported file formats.

NULL AS clause

This clause indicates how NULL is represented in the CSV file. By default, NULL values in the CSV are indicated by the word "NULL", but another frequently used convention is to use the empty string. This can be configured as follows:

COPY INTO table FROM 'file'
USING DELIMITERS ',', E'\n', '"'
NULL AS '';

BEST EFFORT clause

This clause enables BEST EFFORT mode as described in section Error Handling.

FWF clause

FWF ( width1, width2, ... )

This clause can be used instead of the USING DELIMITERS clause to indicate that the fields in the file have fixed widths and are not separated by a separator character. It lists the widths of the fields. Leading and trailing whitespace in the fields is stripped.

Supported file formats

The details of the file format can be configured using the USING DELIMITERS clause. As an alternative to column separators, fixed width columns can be used using the FWF clause.

The input file must use the UTF-8 character encoding. On Windows this is often called a code page. Note that legacy encodings such as latin1 and UTF-16 are not supported, the file must be UTF-8.

To convert file data to UTF-8 character encoding you can use the iconv command-line program. Example:
iconv -f ISO−8859−1 -t UTF−8 "my-data-file.csv" -o "my-data-file.utf8.csv"
If you do not know the encoding of a source data file you can use the linux file command:
file -i "my-data-file.csv"
It will show the detected file type and character set encoding.

The input consists of records each terminated by a a record terminator. By default, every line is a record. Both Windows-style line endings and Unix/Linux/macOS-style line endings are allowed.

The fields in a record are separated or terminated by a column separator character. By default this is the vertical bar |. By "separated or terminated" we mean that it is allowed to have an additional column separator character at the end of the line. For example, when loading two columns both the following lines are valid:

1|one
2|two|

If the column separator can occur within the fields, a quote character can be used to indicate the start and end of the field. The most common choice of quote character is the double quote ". By default, no quote character is used. If the quote character itself occurs in the value it can either be doubled or escaped with a backslash:

1,"Text containing a doubled "" quote character"
2,"Text containing an escaped \" quote character"

Backslash escapes are enabled by default but can be disabled using the NO ESCAPE clause. Backslash escapes also supports octal, hexadecimal and unicode escapes but it is not recommended to use those:

"a doubled "" quote"
"a backslash-escaped \" quote"
"a hex-escaped \x22 quote"
"an oct-escaped \042 quote"
"a unicode-escaped \u0022 quote and \u2665 heart ♥"
"a long unicode-escaped \U00000022 quote and \U0001F308 rainbow 🌈"

Error handling

If an error occurs while loading, for example because of malformed data, the row, column and cause of the error are retained in a special system table which can be accessed as sys.rejects. The contents of this table persist until the end of the session, or until sys.clearrejects() is called.

For example, suppose file data.csv contains the following three lines:

Einstein|1879
Newton|1643
Eratosthenes|276 BC

The SQL script

CREATE TABLE foo(i INT, t TEXT);
CREATE TABLE foo(name VARCHAR(20), birth INT);
COPY INTO foo FROM 'j.csv' ON CLIENT;

Will fail:

Failed to import table 'foo', line 3 field birth 'int' expected in '276 BC'
SELECT * FROM sys.rejects;
+-------+-------+-----------------------------------------------+----------------------+
| rowid | fldid | message                                       | input                |
+=======+=======+===============================================+======================+
|     3 |     2 | line 3 field birth 'int' expected in '276 BC' | Eratosthenes|276 BC  |
+-------+-------+-----------------------------------------------+----------------------+

Normally, loading stops at the first error so every COPY INTO attempt adds at most one line to the table. However, if you use BEST EFFORT the number of entries added to sys.rejects is essentially unlimited. Do not forget to call sys.clearrejects() to release these system resources.