Loading using SQL

Every SQL-based application understands INSERT INTO. For small amounts of data its performance is sufficient. For large number of rows you should be aware of ways to significantly increase the performance. In a nutshell,

  1. Disable autocommit mode and wrap the INSERT statements in a single transaction

  2. Use prepared insert statements

  3. When inserting programmatically, use batch processing

  4. Set SET sys.optimizer = 'minimal_pipe'

  5. Temporarily remove table constraints (fkey, pkey, unique, check) from the table and restore them after the INSERTs.

INSERT statement

The standard way to insert data into a table is to use the SQL INSERT INTO statement.

INSERT INTO my_table(c1, c2, c3) VALUES (1, 'ab', 3);

If table my_table has three columns, this statement inserts a new row containing the given values. If the table has more columns, the additional columns are set to NULL or a default value.

In MonetDB it is also possible to insert multiple rows using a single insert statement:

INSERT INTO my_table(c1, c2, c3) VALUES (1, 'ab', 3), (4, 'xy', 6);

This insert statement adds two rows. It is not recommended to have more than at most a few dozen rows per INSERT statement as the parsing overhead will quickly become a bottleneck.

Run all INSERTs in a single transaction

By default, MonetDB runs in autocommit mode which means that every statement executes as a separate transaction. This adds considerable overhead if the amount of work per statement is very small, such as with an INSERT statement. Wrapping the sequence of inserts between a START TRANSACTION statement and a COMMIT makes it a single transaction, so the overhead is reduced.

As an alternative to START TRANSACTION you can also disable autocommit mode at the connection level.
With mclient, this is done using the startup flag -a.
When using JDBC (Java), use method connection.setAutoCommit(false);.
With ODBC, use function SQLSetConnectAttr(hdbc, SQL_ATTR_AUTOCOMMIT, FALSE, SQL_IS_UINTEGER).
With pymonetd autocommit is already disabled by default.

Use prepared statements

If your language binding supports it you should use prepared statements to reduce the per-insert parsing and verification overhead. A prepared statement allows you to provide the INSERT statement once, with question marks ? used as placeholders for the values to insert. After preparing, you then repeatedly bind the placeholders to new values and execute the prepared statement.

Some language bindings provide support for prepared statements. For example, in Java:

connection.setAutoCommit(false);
String sql = "INSERT INTO my_table(c1, c2, c3) VALUES (?, ?, ?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
int maxrows = 100000;
for (int n = 1; n <= maxrows; n++) {
   // for each row:
   pstmt.setInt(1, /* a int value */);
   pstmt.setString(2, /* a String value */);
   pstmt.setLong(3, /* a bigint value */);
   pstmt.executeUpdate();
}
connection.commit();

Or in C using ODBC API:

SQLSetConnectAttr(hdbc, SQL_ATTR_AUTOCOMMIT, FALSE, SQL_IS_UINTEGER);
SQLCHAR * sql = "INSERT INTO my_table(c1, c2, c3) VALUES (?, ?, ?)";
SQLPrepare(hstmt, sql, SQL_NTS);
int maxrows = 100000;
for (int n = 1; n <= maxrows; n++) {
   // for each row:
   SQLBindParameter(hstmt, 1, SQL_PARAM_INPUT, ...);
   SQLBindParameter(hstmt, 2, SQL_PARAM_INPUT, ...);
   SQLBindParameter(hstmt, 3, SQL_PARAM_INPUT, ...);
   SQLExecute(hstmt);
}
SQLEndTran(SQL_HANDLE_DBC, hdbc, SQL_COMMIT);

Use batch processing

Some language bindings provide support for batch processing. This combines multiple client statements -such as inserts- into a batch, and sends the complete batch to the server. This will save time by reducing the number of communication trips between client and server. For example, in Java:

connection.setAutoCommit(false);
String sql = "INSERT INTO my_table(c1, c2, c3) VALUES (?, ?, ?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
int maxrows = 100000;
int batchsize = 2500;
int batches = maxrows / batchsize;
for (int b = 1; b <= batches; b++) {
  for (int n = 1; n <= batchsize; n++) {
    // for each row:
    pstmt.setInt(1, /* a int value */);
    pstmt.setString(2, /* a String value */);
    pstmt.setLong(3, /* a bigint value */);
    pstmt.addBatch();
  }
  pstmt.executeBatch();
}
connection.commit();

Change the SQL optimizer

Before execution, every SQL statement is translated into an internal language called MAL. After translation and before execution the system analyzes the MAL plan and tries to make improvements to it. This is very effective for SELECT queries, especially complex ones, but there is not much to be optimized for INSERT statements.

If you're executing many INSERTs, changing the MAL optimizer pipe line can make a big difference. To do so, simply execute SQL statement:

SET sys.optimizer = 'minimal_pipe';

This setting applies to the current connection/session only, so after completing the INSERTs, you should restore it using SQL statement:

SET sys.optimizer = 'default_pipe';

Temporarily remove table constraints

Table constraints such as PRIMARY KEY, FOREIGN KEY, UNIQUE and CHECK are verified on every INSERTed row. Also for those table constraints, internal hash/join indices need to be updated on every INSERTed row. This all introduces considerable overhead and becomes increasingly time consuming when inserting many rows.

Because checking table constraints in bulk is much more efficient than repeatedly checking them for each row, it is often more efficient to first remove the table constraints before the insertions using ALTER TABLE ... DROP CONSTRAINT ..., and after the insertions recreate them using ALTER TABLE ... ADD CONSTRAINT .... When adding the table constraints, they are checked in bulk.

To find out which table constraints exist on a table, query information_schema.table_constraints and/or information_schema.referential_constraints or information_schema.check_constraints.

Note The NOT NULL column constraints do NOT need to be removed as they can be checked very fast.

Performance comparison

A small unscientific benchmark loading a table with 45529 rows and 109 columns, mostly varchar and decimal:

EXPERIMENTAVERAGESPEEDUP
INSERT without transaction413.00s0.10×
INSERT default_pipe43.16s1.00×
PREPARE INSERT default_pipe24.35s1.77×
INSERT minimal_pipe22.27s1.94×
PREPARE INSERT minimal_pipe14.86s2.90×
PREPARE INSERT minimal_pipe, batch9.44s4.57×
COPY INTO0.73s59.4×

The final row is for comparison with the next data loading method we'll describe, loading CSV data using COPY INTO.