Every SQL-based application understands INSERT INTO. For small amounts of data its performance is sufficient. For large number of rows you should be aware of ways to significantly increase the performance. In a nutshell,
Disable autocommit mode and wrap the INSERT statements in a single transaction
Use prepared insert statements
When inserting programmatically, use batch processing
Set SET sys.optimizer = 'minimal_pipe'
Temporarily remove table constraints (fkey, pkey, unique) from the table and restore them after the INSERTs.
The standard way to insert data into a table is to use the SQL INSERT INTO statement.
INSERT INTO my_table(c1, c2, c3) VALUES (1, 'ab', 3);
If table my_table
has three columns, this statement inserts a new row containing
the given values. If the table has more columns, the additional columns are set to
NULL or a default value.
In MonetDB it is also possible to insert multiple rows using a single insert statement:
INSERT INTO my_table(c1, c2, c3) VALUES (1, 'ab', 3), (4, 'xy', 6);
This insert statement adds two rows. It is not recommended to have more than at most a few dozen rows per INSERT statement as the parsing overhead will quickly become a bottleneck.
By default, MonetDB runs in autocommit mode which means that every statement
executes as a separate transaction. This adds considerable overhead if the
amount of work per statement is very small, such as with an INSERT statement.
Wrapping the sequence of inserts between a START TRANSACTION
statement and a
COMMIT
makes it a single transaction, so the overhead is reduced.
As an alternative to START TRANSACTION
you can also disable autocommit mode at
the connection level.
With mclient, this is done using the startup flag -a
.
When using JDBC (Java),
use method connection.setAutoCommit(false);
.
With ODBC,
use function SQLSetConnectAttr(hdbc, SQL_ATTR_AUTOCOMMIT, FALSE, SQL_IS_UINTEGER)
.
With pymonetd
autocommit is already disabled by default.
If your language binding supports it you should use
prepared statements
to reduce the per-insert parsing and verification overhead. A prepared statement allows
you to provide the INSERT statement once, with question marks ?
used as placeholders
for the values to insert. After preparing, you then repeatedly bind the placeholders to
new values and execute the prepared statement.
Some language bindings provide support for prepared statements. For example, in Java:
connection.setAutoCommit(false);
String sql = "INSERT INTO my_table(c1, c2, c3) VALUES (?, ?, ?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
int maxrows = 100000;
for (int n = 1; n <= maxrows; n++) {
// for each row:
pstmt.setInt(1, /* a int value */);
pstmt.setString(2, /* a String value */);
pstmt.setLong(3, /* a bigint value */);
pstmt.executeUpdate();
}
connection.commit();
Or in C using ODBC API:
SQLSetConnectAttr(hdbc, SQL_ATTR_AUTOCOMMIT, FALSE, SQL_IS_UINTEGER);
SQLCHAR * sql = "INSERT INTO my_table(c1, c2, c3) VALUES (?, ?, ?)";
SQLPrepare(hstmt, sql, SQL_NTS);
int maxrows = 100000;
for (int n = 1; n <= maxrows; n++) {
// for each row:
SQLBindParameter(hstmt, 1, SQL_PARAM_INPUT, ...);
SQLBindParameter(hstmt, 2, SQL_PARAM_INPUT, ...);
SQLBindParameter(hstmt, 3, SQL_PARAM_INPUT, ...);
SQLExecute(hstmt);
}
SQLEndTran(SQL_HANDLE_DBC, hdbc, SQL_COMMIT);
Some language bindings provide support for batch processing. This combines multiple client statements -such as inserts- into a batch, and sends the complete batch to the server. This will save time by reducing the number of communication trips between client and server. For example, in Java:
connection.setAutoCommit(false);
String sql = "INSERT INTO my_table(c1, c2, c3) VALUES (?, ?, ?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
int maxrows = 100000;
int batchsize = 2500;
int batches = maxrows / batchsize;
for (int b = 1; b <= batches; b++) {
for (int n = 1; n <= batchsize; n++) {
// for each row:
pstmt.setInt(1, /* a int value */);
pstmt.setString(2, /* a String value */);
pstmt.setLong(3, /* a bigint value */);
pstmt.addBatch();
}
pstmt.executeBatch();
}
connection.commit();
Before execution, every SQL statement is translated into an internal language called MAL. After translation and before execution the system analyzes the MAL plan and tries to make improvements to it. This is very effective for SELECT queries, especially complex ones, but there is not much to be optimized for INSERT statements.
If you're executing many INSERTs, changing the MAL optimizer pipe line can make a big difference. To do so, simply execute SQL statement:
SET sys.optimizer = 'minimal_pipe';
This setting applies to the current connection/session only, so after completing the INSERTs, you should restore it using SQL statement:
SET sys.optimizer = 'default_pipe';
Table constraints such as PRIMARY KEY, UNIQUE and FOREIGN KEY are verified on every INSERTed row. Also for those table constraints, internal hash/join indices need to be updated on every INSERTed row. This all introduces considerable overhead and becomes increasingly time consuming when inserting many rows.
Because checking table constraints in bulk is much more efficient than repeatedly checking them for each row, it is often more efficient to first drop the table constraints before the insertions using ALTER TABLE ... DROP CONSTRAINT ..., then recreate them after using ALTER TABLE ... ADD CONSTRAINT .... When adding the table constraints, they are checked in bulk.
Note The NOT NULL column constraints do NOT need to be removed.
A small unscientific benchmark loading a table with 45529 rows and 109 columns, mostly varchar and decimal:
EXPERIMENT | AVERAGE | SPEEDUP |
---|---|---|
INSERT without transaction | 413.00s | 0.10× |
INSERT default_pipe | 43.16s | 1.00× |
PREPARE INSERT default_pipe | 24.35s | 1.77× |
INSERT minimal_pipe | 22.27s | 1.94× |
PREPARE INSERT minimal_pipe | 14.86s | 2.90× |
PREPARE INSERT minimal_pipe, batch | 9.44s | 4.57× |
COPY INTO | 0.73s | 59.4× |
The final row is for comparison with the next data loading method we'll describe, loading CSV data using COPY INTO.