MonetDB has supported regular C/C++ UDFs for a long time. However, these are quite complicated to write and compile, as they require a lot of knowledge about the internal structure of MonetDB and require a lot of steps just to write a simple function. These problems are solved by the JIT UDFs, which can be created using a single SQL statement and don't require any internal knowledge of MonetDB data structures while still being very fast.
JIT UDFs can be created using the regular CREATE FUNCTION
syntax, but with the language set to
either C or C++. Below is an example of a JIT UDF that multiplies the numbers of an input column by two.
CREATE FUNCTION multiply(input INTEGER)
RETURNS INTEGER
LANGUAGE C {
result->initialize(result, input.count);
// loop over the input values
for(size_t i = 0; i < input.count; i++) {
if (input.is_null(input.data[i])) {
// handle NULL values
result->data[i] = result->null_value;
} else {
// handle regular values
result->data[i] = input.data[i] * 2;
}
}
};
CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (1), (2), (NULL), (3), (4);
SELECT i, multiply(i) FROM integers;
i | L3 |
---|---|
1 | 2 |
2 | 4 |
null | null |
3 | 6 |
4 | 8 |
This function shows the general shape of a JIT UDF. First is the header, that specifies the input types and return types as well as the function name and the language to use (this can be either C or C++ for JIT UDFs, but Python and R UDFs are also supported). Afterwards, the main function body is provided.
Both the input columns and the result columns are stored as a simple struct. The exact types within the struct depend on the types passed to the function, however, the basic structure is always identical. For the INTEGER type, the structure looks as follows.
struct cudf_data_struct_int {
int *data;
size_t count;
int null_value;
int (*is_null)(int value);
void (*initialize)(void *self, size_t count);
double scale;
};
Understanding this structure reveals how we should build our JIT UDFs. First, we call the initialize function on the result object to initialize our result array. Then, we loop over the input array using its count property. For every element in the input array, we then either multiply it by two (in case it is not a NULL), or set the result to NULL (in case it is one). That is the basis of creating a JIT UDF.
It is also possible to include header files in these UDFs. Simply add an #include statement to the top of your program, and the file will be included in your UDF. For example, suppose we want to make use of string.h to operate on character strings. In the following UDF, we will create a function that reverses the input strings.
CREATE FUNCTION string_reverse(input STRING)
RETURNS STRING
LANGUAGE C {
#include <string.h>
size_t i, j;
result->initialize(result, input.count);
for(i = 0; i < input.count; i++) {
if (input.is_null(input.data[i])) {
// handle NULL values
result->data[i] = result->null_value;
} else {
// reverse the input string
char* input_string = input.data[i];
size_t len = strlen(input_string);
result->data[i] = malloc(len + 1);
for(j = 0; j < len; j++) {
result->data[i][j] = input_string[len - j - 1];
}
result->data[i][len] = '\0';
}
}
};
CREATE TABLE strings(i STRING);
INSERT INTO strings VALUES ('Hello'),('World'),
(NULL), ('olleH'), ('dlroW');
SELECT i, string_reverse(i) FROM strings;
i | L3 |
---|---|
Hello | olleH |
World | dlroW |
null | null |
olleH | Hello |
dlroW | World |
Note that we call the malloc
to allocate space for the output strings, and that we never call
the free function. Any memory allocated using malloc
is automatically deallocated by the server.
Calls to free are not necessary!
It is possible you might want to use an external library as well. In this case, you might need to set certain
compilation flags. This can be done by including two special PRAGMA commands: #pragma CFLAGS [CFLAGS]
passes
the specified CFLAGS to the compiler, #pragma LDFLAGS [LDFLAGS]
passes the specified LDFLAGS to the linker.
Note that the library must reside on the server machine, not on the client machine.
CREATE FUNCTION snappy_compress_blob(input BLOB)
RETURNS BLOB
LANGUAGE C {
#pragma CFLAGS -I/opt/local/include
#pragma LDFLAGS -L/opt/local/lib -lsnappy
#include <snappy-c.h>
result->initialize(result, input.count);
for(size_t i = 0; i < input.count; i++) {
if (input.is_null(input.data[i])) {
result->data[i] = result->null_value;
} else {
size_t result_size =
snappy_max_compressed_length(input.data[i].size);
char* result_blob = malloc(result_size);
if (snappy_compress(
input.data[i].data,
input.data[i].size,
result_blob,
&result_size) !=
SNAPPY_OK) {
return "Snappy failed to compress!";
}
result->data[i].data = result_blob;
result->data[i].size = result_size;
}
}
};
CREATE TABLE blobs(b BLOB);
INSERT INTO blobs VALUES (blob '0000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000'), (NULL);
SELECT b, snappy_compress_blob(b) AS compressed FROM blobs;
b | compressed |
---|---|
0000000000000000000000000000000000000000000000 | 340000CA0100 |
null | null |
In this function, we also introduce two new concepts. First, a string can be returned from the function to indicate an error message. If the function returns a non-NULL string, the function will return that character string as an error message to the client. Second, we can see that the C type used for the BLOB type is a special C structure. Similar special structures exist for DATE, TIME and TIMESTAMP types. These special types are shown below.
typedef struct {
unsigned char day;
unsigned char month;
int year;
} data_date;
typedef struct {
unsigned int ms;
unsigned char seconds;
unsigned char minutes;
unsigned char hours;
} data_time;
typedef struct {
cudf_data_date date;
cudf_data_time time;
} data_timestamp;
typedef struct {
size_t size;
void* data;
} data_blob;
SQL Type | C Type |
---|---|
BOOLEAN | int8_t |
TINYINT | int8_t |
SMALLINT | int16_t |
INTEGER | int32_t |
BIGINT | int64_t |
REAL | float |
DOUBLE | double |
STRING | char * |
DATE | data_date |
TIME | data_time |
TIMESTAMP | data_timestamp |
BLOB | data_blob |
OID | size_t |
JIT C UDFs can be used as table-producing functions as well. This can be done by providing a TABLE as return type instead of a single SQL type. Below is an example of a table-producing function. Note that instead of the output column being called result, there is now an output column for each of the columns in the resulting table.
CREATE FUNCTION create_table(inp INTEGER)
RETURNS TABLE (i INTEGER, d DOUBLE)
LANGUAGE C
{
#include <math.h>
size_t j;
// we use a constant as input that signifies the amount of rows in the table
size_t count = inp.data[0];
// initialize the two output columns of the table
i->initialize(i, count);
d->initialize(d, count);
// now set the output
for(j = 0; j < count; j++) {
i->data[j] = j;
d->data[j] = round(j > 0 ? 42.0 / j : 42.0);
}
};
SELECT * FROM create_table(5) AS R;
i | d |
---|---|
0 | 42 |
1 | 42 |
2 | 21 |
3 | 14 |
4 | 11 |
Likewise, JIT functions can be used to implement (grouped) aggregate functions as well.
For this, the CREATE AGGREGATE
function syntax should be used.
Aggregate functions take an additional input called aggr_group. This holds an array of type size_t where
each entry indicates which group a specific row belongs to. Below is an example of an aggregate function
that computes the sum for each of the input groups.
CREATE AGGREGATE jit_sum(input INTEGER)t
RETURNS BIGINT
LANGUAGE C {
// initialize one aggregate per group
result->initialize(result, aggr_group.count);
// zero initialize the sums
memset(result->data, 0, result->count * sizeof(result->null_value));
// gather the sums for each of the groups
for(size_t i = 0; i < input.count; i++) {
result->data[aggr_group.data[i]] += input.data[i];
}
};
CREATE TABLE vals(grp INTEGER, value INTEGER);
INSERT INTO vals VALUES (1, 100), (2, 200), (1, 50), (2, 300);
SELECT grp, jit_sum(value) FROM vals GROUP BY grp;
grp | L3 |
---|---|
1 | 150 |
2 | 500 |
Note that in this example we have four rows and two groups. In the for-loop,
aggr_group.data
is indexed using i
which runs from 0 to input.count
which
is 4. However as you can see from the initialization of result
,
aggr_group.count
counts not the number of elements of aggr_group.data
, but
the number of groups which is 2.
JIT UDFs also support the creation of C++ functions. Instead of using LANGUAGE C to create the function, LANGUAGE CPP should be used instead.
CREATE FUNCTION cpp_sort(input INTEGER)
RETURNS INTEGER
LANGUAGE CPP {
#include <vector>
#include <algorithm>
std::vector<int> elements;
for(size_t i = 0; i < input.count; i++) {
elements.push_back(input.data[i]);
}
std::sort(elements.begin(), elements.end());
result->initialize(result, input.count);
for(size_t i = 0; i < input.count; i++) {
result->data[i] = elements[i];
}
};
CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (3), (4), (1), (2), (5);
SELECT i, cpp_sort(i) FROM integers;
i | L3 |
---|---|
3 | 1 |
4 | 2 |
1 | 3 |
2 | 4 |
5 | 5 |
Debugging UDFs is challenging because they run inside the server process.
However, JIT UDFs can be easily debugging using a normal debugger if you can stop the server process
and attach a debugger to it. By passing the command line parameter --set capi_use_debug=true
to mserver5, JIT UDFs
will be compiled in debug mode. A standard C/C++ debugger can then be attached to the server process.
Set a breakpoint on the name of the UDF as you have created it (e.g. cpp_sort for the previously mentioned function).
Then the debugger will stop when it reaches your UDF. There will be some wrapper code visible, like below, but you can
then use the debugger to step through the code and debug it.