JIT C/C++ UDFs in MonetDB

As of the latest MonetDB release (March 2018), MonetDB has support for JIT C and C++ UDFs. MonetDB has supported regular C/C++ UDFs for a long time. However, these are quite complicated to write and compile, as they require a lot of knowledge about the internal structure of MonetDB and require a lot of steps just to write a simple function. These problems are solved by the JIT UDFs, which can be created using a single SQL statement and don’t require any internal knowledge of MonetDB data structures while still being very fast.

Standard Scalar Functions

JIT UDFs can be created using the regular CREATE FUNCTION syntax, but with the language set to either C or C++. Below is an example of a JIT UDF that multiplies the numbers of an input column by two.

CREATE FUNCTION multiply(input INTEGER)
RETURNS INTEGER
LANGUAGE C {
    result->initialize(result, input.count);
    // loop over the input values
    for(size_t i = 0; i < input.count; i++) {
        if (input.is_null(input.data[i])) {
           // handle NULL values
            result->data[i] = result->null_value;
        } else {
            // handle regular values
            result->data[i] = input.data[i] * 2;
        }
    }
};

CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (1), (2), (NULL), (3), (4);

SELECT i, multiply(i) FROM integers;
iL3
12
24
nullnull
36
48

This function shows the general shape of a JIT UDF. First is the header, that specifies the input types and return types as well as the function name and the language to use (this can be either C or C++ for JIT UDFs, but Python and R UDFs are also supported). Afterwards, the main function body is provided.

Both the input columns and the result columns are stored as a simple struct. The exact types within the struct depend on the types passed to the function, however, the basic structure is always identical. For the INTEGER type, the structure looks as follows.

struct cudf_data_struct_int {
    int *data;
    size_t count;
    int null_value;
    int (*is_null)(int value);
    void (*initialize)(void *self, size_t count);
    double scale;
};
  • data: This field provides a pointer to the underlying data, which is stored as an array of integers in this case.
  • count: This field provides the amount of elements in the array.
  • null_value: This field is used to assign NULL values in the result. It should not be compared against directly to check for null values, because this might give incorrect results for certain types! Instead, the is_null function should be used for this.
  • is_null: This function should be used to check for NULL values in the input columns.
  • initialize: This function is used to initialize the result structures with the desired amount of elements.
  • scale: This field is only relevant for columns of type DECIMAL, it specifies the position of the comma in the decimal value.

Understanding this structure reveals how we should build our JIT UDFs. First, we call the initialize function on the result object to initialize our result array. Then, we loop over the input array using its count property. For every element in the input array, we then either multiply it by two (in case it is not a NULL), or set the result to NULL (in case it is one). That is the basis of creating a JIT UDF.

External Libraries

It is also possible to include header files in these UDFs. Simply add an #include statement to the top of your program, and the file will be included in your UDF. For example, suppose we want to make use of string.h to operate on character strings. In the following UDF, we will create a function that reverses the input strings.

CREATE FUNCTION string_reverse(input STRING)
RETURNS STRING
LANGUAGE C {
#include <string.h>
    size_t i, j;
    result->initialize(result, input.count);
    for(i = 0; i < input.count; i++) {
        if (input.is_null(input.data[i])) {
            // handle NULL values
            result->data[i] = result->null_value;
        } else  {
            // reverse the input string
            char* input_string = input.data[i];
            size_t len = strlen(input_string);
            result->data[i] = malloc(len + 1);
            for(j = 0; j < len; j++) {
                result->data[i][j] = input_string[len - j - 1];
            }
            result->data[i][len] = '\0';
        }
    }
};

CREATE TABLE strings(i STRING);
INSERT INTO strings VALUES ('Hello'), ('World'), (NULL), ('olleH'), ('dlroW');

SELECT i, string_reverse(i) FROM strings;
iL3
HelloolleH
WorlddlroW
nullnull
olleHHello
dlroWWorld

Note that we call the malloc function to allocate space for the output strings, and that we never call the free function. Any memory allocated using the malloc function is automatically deallocated by the server. Calls to free are not necessary!

It is possible you might want to use an external library as well. In this case, you might need to set certain compilation flags. This can be done by including two special PRAGMA commands: #PRAGMA CFLAGS [CFLAGS] passes the specified CFLAGS to the compiler. #PRAGMA LDFLAGS [LDFLAGS] passes the specified LDFLAGS to the linker. Note that the library must reside on the server machine, not on the client machine.

As an example, suppose we want to use the Snappy library to compress a column of blobs. We can add the include directory to our program using the #PRAGMA CFLAGS. Then, we can add the library path and the snappy library to our program using the #PRAGMA LDFLAGS command. After this is done, we can include the snappy library and use its functons inside of our UDF.

CREATE FUNCTION snappy_compress_blob(input BLOB)
RETURNS BLOB
LANGUAGE C {
#pragma CFLAGS -I/opt/local/include
#pragma LDFLAGS -L/opt/local/lib -lsnappy
#include <snappy-c.h>
    result->initialize(result, input.count);
    for(size_t i = 0; i < input.count; i++) {
        if (input.is_null(input.data[i])) {
            result->data[i] = result->null_value;
        } else {
            size_t result_size =
                snappy_max_compressed_length(input.data[i].size);
            char* result_blob = malloc(result_size);
            if (snappy_compress(
                      input.data[i].data,
                      input.data[i].size,
                      result_blob,
                      &result_size) !=
                      SNAPPY_OK) {
                return "color:firebrick;">"Snappy failed to compress!";
            }
            result->data[i].data = result_blob;
            result->data[i].size = result_size;
        }
    }
};

CREATE TABLE blobs(b BLOB);
INSERT INTO blobs VALUES (blob '00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'), (NULL);

SELECT b, snappy_compress_blob(b) AS compressed FROM blobs;
bcompressed
0000000000000000000000000000000000000000000000340000CA0100
nullnull

Data Types

In this function, we also introduce two new concepts. First, a string can be returned from the function to indicate an error message. If the function returns a non-NULL string, the function will return that character string as an error message to the client. Second, we can see that the C type used for the BLOB type is a special C structure. Similar special structures exist for DATE, TIME and TIMESTAMP types. These special types are shown below.

typedef struct {
    unsigned char day;
    unsigned char month;
    int year;
} data_date;

typedef struct {
    unsigned int ms;
    unsigned char seconds;
    unsigned char minutes;
    unsigned char hours;
} data_time;

typedef struct {
    cudf_data_date date;
    cudf_data_time time;
} data_timestamp;

typedef struct {
    size_t size;
    void* data;
} data_blob;

Below is a conversion table of SQL types to C types.

SQL TypeC Type
BOOLEANint8_t
TINYINTint8_t
SMALLINTint16_t
INTEGERint32_t
BIGINTint64_t
REALfloat
DOUBLEdouble
STRINGchar*
DATEdata_date
TIMEdata_time
TIMESTAMPdata_timestamp
BLOBdata_blob
OIDsize_t

Table Producing Functions

JIT C UDFs can be used as table-producing functions as well. This can be done by providing a TABLE as return type instead of a single SQL type. Below is an example of a table-producing function. Note that instead of the output column being called result, there is now an output column for each of the columns in the resulting table.

CREATE FUNCTION create_table(inp INTEGER)
RETURNS TABLE (i INTEGER, d DOUBLE)
LANGUAGE C
{
#include <math.h>
    size_t j;
    // we use a constant as input that signifies the amount of rows in the table
    size_t count = inp.data[0];
    // initialize the two output columns of the table
    i->initialize(i, count);
    d->initialize(d, count);
    // now set the output
    for(j = 0; j < count; j++) {
        i->data[j] = j;
        d->data[j] = round(j > 0 ? 42.0 / j : 42.0);
    }
};

SELECT * FROM create_table(5) AS R;
id
042
142
221
314
411

Aggregate Functions

Likewise, JIT functions can be used to implement (grouped) aggregate functions as well. For this, the CREATE AGGREGATE function syntax should be used. Aggregate functions take an additional input column called aggr_group. This is a column of type size_t where each entry indicates which group a specific row belongs to. Below is an example of an aggregate function that computes the sum for each of the input groups.

CREATE AGGREGATE jit_sum(input INTEGER)
RETURNS BIGINT
LANGUAGE C {
    // initialize one aggregate per group
    result->initialize(result, aggr_group.count);
    // zero initialize the sums
    memset(result->data, 0, result->count * sizeof(result->null_value));
    // gather the sums for each of the groups
    for(size_t i = 0; i < input.count; i++) {
        result->data[aggr_group.data[i]] += input.data[i];
    }
};

CREATE TABLE vals(grp INTEGER, value INTEGER);
INSERT INTO vals VALUES (1, 100), (2, 200), (1, 50), (2, 300);

SELECT grp, jit_sum(value) FROM vals GROUP BY grp;
grpL3
1150
2500

C++ Functions

JIT UDFs also support the creation of C++ functions. Instead of using LANGUAGE C to create the function, LANGUAGE CPP should be used instead.

CREATE FUNCTION cpp_sort(input INTEGER)
RETURNS INTEGER
LANGUAGE CPP {
#include <vector>
#include <algorithm>
    std::vector<int> elements;
    for(size_t i = 0; i < input.count; i++) {
        elements.push_back(input.data[i]);
    }
    std::sort(elements.begin(), elements.end());
    result->initialize(result, input.count);
    for(size_t i = 0; i < input.count; i++) {
        result->data[i] = elements[i];
    }
};

CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (3), (4), (1), (2), (5);

SELECT i, cpp_sort(i) FROM integers;
iL3
31
42
13
24
55

Debugging

Debugging UDFs is challenging because they run inside the server process. However, JIT UDFs can be easily debugging using a normal debugger if you can stop the server process and attach a debugger to it. By passing the command line parameter –set capi_use_debug=true to mserver5, JIT UDFs will be compiled in debug mode. A standard C/C++ debugger can then be attached to the server process. Set a breakpoint on the name of the UDF as you have created it (e.g. cpp_sort for the previously mentioned function). Then the debugger will stop when it reaches your UDF. There will be some wrapper code visible, like below, but you can then use the debugger to step through the code and debug it.

   65   extern "C"
   66   const char* cpp_sort(void** __inputs, void** __outputs, malloc_function_ptr malloc) {
   67       struct cudf_data_struct_int inp = *((struct cudf_data_struct_int*)__inputs[0]);
 > 68       struct cudf_data_struct_int* result = ((struct cudf_data_struct_int*)__outputs[0]);

Usage and Extra Options

To use JIT C/C++ UDFs, the mserver5 must be started with the –set embedded_c=true flag. These functions can only be used on OSX and Linux as of now. They are not supported on Windows. In addition, there must be a valid C or C++ compiler on your system, pointed to by either the aliases cc, c++ or specified using the additional options –set capi_cc=[C_COMPILER], –set capi_cpp=[CPP_COMPILER].

About the Author

Mark Raasveldt is a PhD student at the Database Architectures group at the CWI. You can reach him at the email address m.raasveldt@cwi.nl.