As of the latest MonetDB release (March 2018), MonetDB has support for JIT C and C++ UDFs. MonetDB has supported regular C/C++ UDFs for a long time. However, these are quite complicated to write and compile, as they require a lot of knowledge about the internal structure of MonetDB and require a lot of steps just to write a simple function. These problems are solved by the JIT UDFs, which can be created using a single SQL statement and don’t require any internal knowledge of MonetDB data structures while still being very fast.
JIT UDFs can be created using the regular CREATE FUNCTION syntax, but with the language set to either C or C++. Below is an example of a JIT UDF that multiplies the numbers of an input column by two.
CREATE FUNCTION multiply(input INTEGER)
RETURNS INTEGER
LANGUAGE C {
result->initialize(result, input.count);
// loop over the input values
for(size_t i = 0; i < input.count; i++) {
if (input.is_null(input.data[i])) {
// handle NULL values
result->data[i] = result->null_value;
} else {
// handle regular values
result->data[i] = input.data[i] * 2;
}
}
};
CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (1), (2), (NULL), (3), (4);
SELECT i, multiply(i) FROM integers;
i | L3 |
---|---|
1 | 2 |
2 | 4 |
null | null |
3 | 6 |
4 | 8 |
This function shows the general shape of a JIT UDF. First is the header, that specifies the input types and return types as well as the function name and the language to use (this can be either C or C++ for JIT UDFs, but Python and R UDFs are also supported). Afterwards, the main function body is provided.
Both the input columns and the result columns are stored as a simple struct. The exact types within the struct depend on the types passed to the function, however, the basic structure is always identical. For the INTEGER type, the structure looks as follows.
struct cudf_data_struct_int {
int *data;
size_t count;
int null_value;
int (*is_null)(int value);
void (*initialize)(void *self, size_t count);
double scale;
};
Understanding this structure reveals how we should build our JIT UDFs. First, we call the initialize function on the result object to initialize our result array. Then, we loop over the input array using its count property. For every element in the input array, we then either multiply it by two (in case it is not a NULL), or set the result to NULL (in case it is one). That is the basis of creating a JIT UDF.
It is also possible to include header files in these UDFs. Simply add an #include statement to the top of your program, and the file will be included in your UDF. For example, suppose we want to make use of string.h to operate on character strings. In the following UDF, we will create a function that reverses the input strings.
CREATE FUNCTION string_reverse(input STRING)
RETURNS STRING
LANGUAGE C {
#include <string.h>
size_t i, j;
result->initialize(result, input.count);
for(i = 0; i < input.count; i++) {
if (input.is_null(input.data[i])) {
// handle NULL values
result->data[i] = result->null_value;
} else {
// reverse the input string
char* input_string = input.data[i];
size_t len = strlen(input_string);
result->data[i] = malloc(len + 1);
for(j = 0; j < len; j++) {
result->data[i][j] = input_string[len - j - 1];
}
result->data[i][len] = '\0';
}
}
};
CREATE TABLE strings(i STRING);
INSERT INTO strings VALUES ('Hello'), ('World'), (NULL), ('olleH'), ('dlroW');
SELECT i, string_reverse(i) FROM strings;
i | L3 |
---|---|
Hello | olleH |
World | dlroW |
null | null |
olleH | Hello |
dlroW | World |
Note that we call the malloc function to allocate space for the output strings, and that we never call the free function. Any memory allocated using the malloc function is automatically deallocated by the server. Calls to free are not necessary!
It is possible you might want to use an external library as well. In this case, you might need to set certain compilation flags. This can be done by including two special PRAGMA commands: #PRAGMA CFLAGS [CFLAGS] passes the specified CFLAGS to the compiler. #PRAGMA LDFLAGS [LDFLAGS] passes the specified LDFLAGS to the linker. Note that the library must reside on the server machine, not on the client machine.
As an example, suppose we want to use the Snappy library to compress a column of blobs. We can add the include directory to our program using the #PRAGMA CFLAGS. Then, we can add the library path and the snappy library to our program using the #PRAGMA LDFLAGS command. After this is done, we can include the snappy library and use its functons inside of our UDF.
CREATE FUNCTION snappy_compress_blob(input BLOB)
RETURNS BLOB
LANGUAGE C {
#pragma CFLAGS -I/opt/local/include
#pragma LDFLAGS -L/opt/local/lib -lsnappy
#include <snappy-c.h>
result->initialize(result, input.count);
for(size_t i = 0; i < input.count; i++) {
if (input.is_null(input.data[i])) {
result->data[i] = result->null_value;
} else {
size_t result_size =
snappy_max_compressed_length(input.data[i].size);
char* result_blob = malloc(result_size);
if (snappy_compress(
input.data[i].data,
input.data[i].size,
result_blob,
&result_size) !=
SNAPPY_OK) {
return "color:firebrick;">"Snappy failed to compress!";
}
result->data[i].data = result_blob;
result->data[i].size = result_size;
}
}
};
CREATE TABLE blobs(b BLOB);
INSERT INTO blobs VALUES (blob '00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'), (NULL);
SELECT b, snappy_compress_blob(b) AS compressed FROM blobs;
b | compressed |
---|---|
0000000000000000000000000000000000000000000000 | 340000CA0100 |
null | null |
In this function, we also introduce two new concepts. First, a string can be returned from the function to indicate an error message. If the function returns a non-NULL string, the function will return that character string as an error message to the client. Second, we can see that the C type used for the BLOB type is a special C structure. Similar special structures exist for DATE, TIME and TIMESTAMP types. These special types are shown below.
typedef struct {
unsigned char day;
unsigned char month;
int year;
} data_date;
typedef struct {
unsigned int ms;
unsigned char seconds;
unsigned char minutes;
unsigned char hours;
} data_time;
typedef struct {
cudf_data_date date;
cudf_data_time time;
} data_timestamp;
typedef struct {
size_t size;
void* data;
} data_blob;
Below is a conversion table of SQL types to C types.
SQL Type | C Type |
---|---|
BOOLEAN | int8_t |
TINYINT | int8_t |
SMALLINT | int16_t |
INTEGER | int32_t |
BIGINT | int64_t |
REAL | float |
DOUBLE | double |
STRING | char* |
DATE | data_date |
TIME | data_time |
TIMESTAMP | data_timestamp |
BLOB | data_blob |
OID | size_t |
JIT C UDFs can be used as table-producing functions as well. This can be done by providing a TABLE as return type instead of a single SQL type. Below is an example of a table-producing function. Note that instead of the output column being called result, there is now an output column for each of the columns in the resulting table.
CREATE FUNCTION create_table(inp INTEGER)
RETURNS TABLE (i INTEGER, d DOUBLE)
LANGUAGE C
{
#include <math.h>
size_t j;
// we use a constant as input that signifies the amount of rows in the table
size_t count = inp.data[0];
// initialize the two output columns of the table
i->initialize(i, count);
d->initialize(d, count);
// now set the output
for(j = 0; j < count; j++) {
i->data[j] = j;
d->data[j] = round(j > 0 ? 42.0 / j : 42.0);
}
};
SELECT * FROM create_table(5) AS R;
i | d |
---|---|
0 | 42 |
1 | 42 |
2 | 21 |
3 | 14 |
4 | 11 |
Likewise, JIT functions can be used to implement (grouped) aggregate functions as well. For this, the CREATE AGGREGATE function syntax should be used. Aggregate functions take an additional input column called aggr_group. This is a column of type size_t where each entry indicates which group a specific row belongs to. Below is an example of an aggregate function that computes the sum for each of the input groups.
CREATE AGGREGATE jit_sum(input INTEGER)
RETURNS BIGINT
LANGUAGE C {
// initialize one aggregate per group
result->initialize(result, aggr_group.count);
// zero initialize the sums
memset(result->data, 0, result->count * sizeof(result->null_value));
// gather the sums for each of the groups
for(size_t i = 0; i < input.count; i++) {
result->data[aggr_group.data[i]] += input.data[i];
}
};
CREATE TABLE vals(grp INTEGER, value INTEGER);
INSERT INTO vals VALUES (1, 100), (2, 200), (1, 50), (2, 300);
SELECT grp, jit_sum(value) FROM vals GROUP BY grp;
grp | L3 |
---|---|
1 | 150 |
2 | 500 |
JIT UDFs also support the creation of C++ functions. Instead of using LANGUAGE C to create the function, LANGUAGE CPP should be used instead.
CREATE FUNCTION cpp_sort(input INTEGER)
RETURNS INTEGER
LANGUAGE CPP {
#include <vector>
#include <algorithm>
std::vector<int> elements;
for(size_t i = 0; i < input.count; i++) {
elements.push_back(input.data[i]);
}
std::sort(elements.begin(), elements.end());
result->initialize(result, input.count);
for(size_t i = 0; i < input.count; i++) {
result->data[i] = elements[i];
}
};
CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (3), (4), (1), (2), (5);
SELECT i, cpp_sort(i) FROM integers;
i | L3 |
---|---|
3 | 1 |
4 | 2 |
1 | 3 |
2 | 4 |
5 | 5 |
Debugging UDFs is challenging because they run inside the server process. However, JIT UDFs can be easily debugging using a normal debugger if you can stop the server process and attach a debugger to it. By passing the command line parameter –set capi_use_debug=true to mserver5, JIT UDFs will be compiled in debug mode. A standard C/C++ debugger can then be attached to the server process. Set a breakpoint on the name of the UDF as you have created it (e.g. cpp_sort for the previously mentioned function). Then the debugger will stop when it reaches your UDF. There will be some wrapper code visible, like below, but you can then use the debugger to step through the code and debug it.
65 extern "C"
66 const char* cpp_sort(void** __inputs, void** __outputs, malloc_function_ptr malloc) {
67 struct cudf_data_struct_int inp = *((struct cudf_data_struct_int*)__inputs[0]);
> 68 struct cudf_data_struct_int* result = ((struct cudf_data_struct_int*)__outputs[0]);
To use JIT C/C++ UDFs, the mserver5 must be started with the –set embedded_c=true flag. These functions can only be used on OSX and Linux as of now. They are not supported on Windows. In addition, there must be a valid C or C++ compiler on your system, pointed to by either the aliases cc, c++ or specified using the additional options –set capi_cc=[C_COMPILER], –set capi_cpp=[CPP_COMPILER].
Mark Raasveldt is a PhD student at the Database Architectures group at the CWI. You can reach him at the email address m.raasveldt@cwi.nl.