Embedded R in MonetDB

For some time now, we have had the MonetDB.R package on CRAN, which allows you to connect to a MonetDB server from an R session. This package uses the MAPI protocol to talk to MonetDB over a standard socket. In addition to running queries, the package also supports the more high-level dplyr API.

While we worked hard on making this integration as painless as possible, there was one issue that we could not solve: Processing data from the database with R required transferring the relevant data over the socket first. This is fine and “fast enough” for up to – say – several thousand rows, but not for much more. We have had a lot of demand for transferring larger amounts of data from users. Hence, we chose to go in a different direction.

Starting with the Oct2014 release, MonetDB will ship with a feature we call R-Integration. R has some support for running embedded in another application, a fact we leverage heavily. What it does is make R scripts a first class citizen of the SQL world in MonetDB. Users can create ad-hoc functions much in the style of SQL functions. In the remainder of this blog post, we will describe the installation, the usage and some initial performance observations of RIntegration.

Usage

R-Integration works by wrapping R code in an SQL function definition, so that the SQL compiler knows the input and output schema of the function. R functions can be used in various parts of an SQL query, for instance, as table-producing functions (used in the FROM clause), as projection functions (used in the SELECT clause), as filtering functions (used in the WHERE clause) and as aggregation functions (also in the SELECT clause, but together with a GROUP BY clause).

We begin with a table-producing function

CREATE FUNCTION rapi00() RETURNS TABLE (d INTEGER) LANGUAGE R {
    seq(1,10);
};

In this example, there are no parameters going into the function, but it returns a table with a single integer-typed column. Inside the curly braces we see the function body, which is plain R code. This function generates the sequence of numbers from 1 to 10. If we want to call it, we can do so from the SQL prompt:

SELECT d FROM rapi00() AS r WHERE d > 5;

Here, the R function takes the FROM position in the query. Note how we filter the output by only selecting values greater than 5. The output of this function is

sql>SELECT d FROM rapi00() AS r WHERE d > 5;
+------+
| d    |
+======+
|    6 |
|    7 |
|    8 |
|    9 |
|   10 |
+------+

Here is a slightly more complicated example with a constant input parameter and two output columns:

CREATE FUNCTION rapi01(i integer) RETURNS TABLE (i INTEGER, d DOUBLE) LANGUAGE R {
    data.frame(i=seq(1, i), d=42.0);
};

Note how we can use the SQL input parameter inside the R code. It will be automatically translated into the R world. Also note how we construct a to return the two-column result SQL expects. An example invocation and result would be

sql>SELECT i, d FROM rapi01(42) AS R WHERE i > 40;
+------+--------------------------+
| i    | d                        |
+======+==========================+
|   41 |                       42 |
|   42 |                       42 |
+------+--------------------------+

For the remaining examples, let’s create some sample data:

CREATE TABLE rval(i INTEGER);
INSERT INTO rval VALUES (42),(43),(44),(45);
CREATE TABLE rvalg(groupcol INTEGER, datacol INTEGER);
INSERT INTO rvalg VALUES (1, 42), (1, 84), (2, 42), (2, 21);

The following R function will multiply the passed values. We can use it to do some computations on the projected columns in an SQL query:

CREATE FUNCTION rapi02(i INTEGER, j INTEGER) RETURNS INTEGER LANGUAGE R { i * j };

In this example, we will call this function with a constant as the second parameter:

sql>SELECT rapi02(i, 2) FROM rval;
+----------+
| rapi02_i |
+==========+
|       84 |
|       86 |
|       88 |
|       90 |
+----------+

We can also use the R functions in the WHERE clause of the query, where they can decide whether a row is included in the result or not:

CREATE FUNCTION rapi03(i INTEGER, j INTEGER) RETURNS BOOLEAN LANGUAGE R { i > j };

SELECT * FROM rval WHERE rapi03(i, 44);
+------+
| i    |
+======+
|   45 |
+------+

Finally, we can use an R function to calculate an aggregation within the projection of a GROUP BY query:

CREATE AGGREGATE rapi04(val INTEGER) RETURNS DOUBLE LANGUAGE R {
    aggregate(val, by=list(aggr_group), FUN=median)$x
};

Note the keyword AGGREGATE and the magical R variable aggr_group, which contains an integer number denoting which group a row belongs to. If we run this function, we get the following result

SELECT groupcol, rapi04(datacol) FROM rvalg GROUP BY groupcol;

+----------+--------------------------+
| groupcol | L1                       |
+==========+==========================+
|        1 |                       63 |
|        2 |                     31.5 |
+----------+--------------------------+

As a final note on usage, you can use any R package from CRAN within your function without explicitly installing it. We overloaded the library function in R to automatically install missing packages.

Performance

Since we no longer need to serialize data onto a socket with our new solution, we expect great performance benefits. Also, since both MonetDB and R use a columnar data layout, little effort has to be spent on converting the values between these two systems. In fact, we have shown how we can gain a zero-copy integration between MonetDB and R in prototypical work. In the released version however, we invest one in-memory copy of the data for increased peace of mind. For this experiment, we have used the lineitem table of the well-known TPC-H benchmark at scale factor 100. This table contains 100 million rows. The test task was to calculate the .05 and .95 quantiles of the l_extendedprice column. The contenders in the benchmark were

  • MonetDB itself, we have added quantile support some time ago (MonetDB)
  • R using the MonetDB.R socket connector, retrieving all columns, then calculate quantiles (R-full)
  • R using the MonetDB.R socket connector, retrieving a single column, then calculate quantiles (R-col)
  • PL/R, the R integration into PostgreSQL, initial approach (PL/R-naive)
  • PL/R, the R integration into PostgreSQL, tuned after feedback from Joe Conway (PL/R-tuned)
  • MonetDB with embedded R (RInt)

All experiments were run on a desktop-class machine with 16 GB of main memory and a 8-core Intel i7 processor. We tuned PostgreSQL using the pgtune utility for analytical workloads. We ran the aggregation over an increasing number of rows, from 1.000 to 100.000.000. All queries were repeated five times for each system and then averaged. The results are given below:

Performance Comparison

In this plot, we can see that R-Integration delivers superior performance to all other solutions, curiously also that of MonetDB’s quantile implementation. The reason for this is that R uses partial sorting for its quantile implementation, whereas MonetDB does a full sorting pass over the data to calculate quantiles. Missing data points are due to timeouts (60 seconds limit) or crashes.

For reference, we used the following R UDF for the winning system in this experiment:

CREATE AGGREGATE rquantile(arg1 double, arg2 double) RETURNS double LANGUAGE R { quantile(arg1, arg2) };

This was then run using the SQL query

SELECT rquantile(CAST(l_extendedprice AS DOUBLE), 0.05) As q5, rquantile(CAST(l_extendedprice AS DOUBLE), 0.95) AS q95 FROM lineitem;

Installation

For now, R-Integration is not shipped with the binary distributions of MonetDB due to compatibility and security concerns, Hence, you need to compile MonetDB from source with (just a few) custom parameters. In addition, you need to install an R version (or compile from source, too) that includes the R shared library. On Linux systems, you can check whether you have a file named libR.so. If you have installed R using your OS’s package manager, it will most likely be the case. Compiling MonetDB from source is not hard. To enable R-Integration, you need to run the ./configure script as follows:

./configure --enable-rintegration=yes

If all goes well, the output will contain the following line in the summary at the end

rintegration is enabled

If not, check that the R binary is in your $PATH, and that R was configured with the --enable-R-shlib=yes flag. You may also be required to set the $LD_LIBRARY_PATH variable to include the path to libr.so before starting MonetDB.

Startup

After R-Integration has been compiled into the MonetDB installation, you need to explicitly enable it during server startup. If you are using the monetdbd daemon, you can enable R-Integration as follows

monetdb stop rtest
monetdb set embedr=true rtest
monetdb start rtest

Where rtest is the name of your database.

If you are running mserver5 directly, the parameter name is different:

mserver5 --set embedded_r=true

In the latter case, you will see MonetDB’s startup message. If you see no error messages and

# MonetDB/R   module loaded

all is well.