The primary purpose of a database is to store and manage data. Without data, a database is not very useful. As such,
the first thing you will do when you launch a database is to load your data into the database. In MonetDB, the primary
way of loading large amounts of data into your database is using
the COPY INTO
statement.
Using the COPY INTO statement, you can quickly load large CSV files into your database.
However, the COPY INTO statement has a number of caveats. It only supports CSV and FWF files, and it requires you to specify the columns of the table prior to loading the data. If you want to load data that is stored in a different format, you first have to convert the data to a CSV file. This can take long time, requires a large amount of disk space and is simply inconvenient.
To solve these issues, we are introducing custom loader functions into MonetDB. Rather than having to rely on the built-in functionality of MonetDB, you can now easily write your own function to efficiently load your data into the database.
Loader functions can currently only be written in Python, although we might provide extensions for loaders written in different languages in the future. They are created like user-defined functions. However, unlike user-defined functions, you do not need to specify the return types of loaders. The return types are determined automatically based on the output provided by the function.
Another difference between user-defined functions and loader functions is that you do not need to load all the data at once. Instead, the data can be passed to MonetDB incrementally. This is done using the emit.emit() function. This function can be called to hand over a set of rows to MonetDB. This way, you can load your data into MonetDB even if it does not fit into main memory.
The emit function expects a dictionary as parameter. This dictionary should contain the column names as keys, and the values of the columns as values.
Let’s start by creating a simple loader function.
LOADER myloader() LANGUAGE PYTHON {
_emit.emit( { 'a' : 42, 'd' : 1})
};
In this function, we emit a single row into a table with two columns, called ‘a’ and ’d’. We can now create a table using this loader function with the following syntax.
TABLE mytable FROM LOADER myloader();
SELECT * FROM mytable;
a | d |
---|---|
42 | 1 |
We can also use the loader function to append to existing tables by using a syntax similar to COPY INTO . We can use the same loader function to append to the table we just created ( mytable )
COPY LOADER INTO mytable FROM myloader();
SELECT * FROM mytable;
a | d |
---|---|
42 | 1 |
42 | 1 |
We can also emit multiple rows at once using the emit()
function by using lists or NumPy arrays instead of scalar values.
Note that each of the lists must have the same amount of entries. If you attempt to emit lists with a different amount
of entries then the function will throw an error.
LOADER array_loader() LANGUAGE PYTHON {
_emit.emit( { 'a' : [1,2,3], 'b' : numpy.arange(3) * 2})
};
TABLE array_table FROM LOADER array_loader();
SELECT * FROM array_table;
a | b |
---|---|
1 | 0 |
2 | 2 |
3 | 4 |
Of course, static loader functions are not very useful. A more useful example would be loading data from a file into the database. For this example, we will load a JSON file into MonetDB. First, we create a simple loader function that takes as parameter the location of the JSON file. We then load the file using the ‘json’ library, and return the parsed dictionary.
LOADER json_loader(filename STRING ) LANGUAGE PYTHON {
import json
f = open(filename)
_emit.emit( json .load(f))
f.close()
};
We can then use the newly created json_loader function to load json files into the database.
TABLE tbl FROM LOADER json_loader('/local/raasveld/dataset.json' );
SELECT * FROM tbl;
a | c | b |
---|---|---|
1 | 100 | 10 |
2 | 200 | 20 |
3 | 300 | 30 |
4 | 400 | 40 |
5 | 500 | 50 |
When appending to a table, you do not need to append to all columns. Instead, you can append to only a subset of the columns. The missing columns will be added as missing (NULL) values. In the following example, we will use a loader function to append a single row to the table we just created. However, we will only emit two columns (‘a’ and ‘b’), skipping the third column (‘c’).
LOADER partial_loader() LANGUAGE PYTHON {
_emit.emit( { 'a' : 6, 'b' : 60})
};
COPY LOADER INTO tbl FROM partial_loader();
SELECT * FROM tbl;
a | c | b |
---|---|---|
1 | 100 | 10 |
2 | 200 | 20 |
3 | 300 | 30 |
4 | 400 | 40 |
5 | 500 | 50 |
6 | null | 60 |
The same applies when constructing tables. The amount of return columns is automatically determined by the emitted values, if multiple emit statements occur with different column values then NULL values will be added when missing values occur. In the following example, we emit four rows, but only one emit statement supplies all the columns. The rest of the emit statements will result in rows with missing values.
LOADER table_loader() LANGUAGE PYTHON {
_emit.emit( { 'a' : 1 })
_emit.emit( { 'b' : 10 })
_emit.emit( { 'c' : 100 })
_emit.emit( { 'a' : 2, 'b' : 20, 'c' : 200 })
};
TABLE nulltbl FROM LOADER table_loader();
SELECT * FROM nulltbl;
a | b | c |
---|---|---|
1 | null | null |
null | 10 | null |
null | null | 100 |
2 | 20 | 200 |
Currently, loader functions are available in the default branch of the MonetDB repository. You can find instructions for installing MonetDB from source here (under the Installation section). Loader functions will be included with the next feature release of MonetDB as well.
Mark Raasveldt is a PhD student at the CWI. If you have any questions regarding this blogpost, you can contact him at m.raasveldt@cwi.nl.