For some time now, we have had the MonetDB.R package on CRAN, which allows you to connect to a MonetDB server from an R session. This package uses the MAPI protocol to talk to MonetDB over a standard socket. In addition to running queries, the package also supports the more high-level dplyr API.
While we worked hard on making this integration as painless as possible, there was one issue that we could not solve: Processing data from the database with R required transferring the relevant data over the socket first. This is fine and “fast enough” for up to – say – several thousand rows, but not for much more. We have had a lot of demand for transferring larger amounts of data from users. Hence, we chose to go in a different direction.
Starting with the Oct2014 release, MonetDB will ship with a feature we call R-Integration. R has some support for running embedded in another application, a fact we leverage heavily. What it does is make R scripts a first class citizen of the SQL world in MonetDB. Users can create ad-hoc functions much in the style of SQL functions. In the remainder of this blog post, we will describe the installation, the usage and some initial performance observations of RIntegration.
R-Integration works by wrapping R code in an SQL function definition, so that the SQL compiler knows the
input and output schema of the function. R functions can be used in various parts of an SQL query, for instance,
as table-producing functions (used in the FROM
clause), as projection functions
(used in the SELECT
clause), as filtering functions (used in the WHERE
clause) and as aggregation functions (also in the SELECT
clause, but together with a GROUP BY
clause).
We begin with a table-producing function
CREATE FUNCTION rapi00() RETURNS TABLE (d INTEGER) LANGUAGE R {
seq(1,10);
};
In this example, there are no parameters going into the function, but it returns a table with a single integer-typed column. Inside the curly braces we see the function body, which is plain R code. This function generates the sequence of numbers from 1 to 10. If we want to call it, we can do so from the SQL prompt:
SELECT d FROM rapi00() AS r WHERE d > 5;
Here, the R function takes the FROM
position in the query. Note how we filter the output by only selecting
values greater than 5. The output of this function is
sql>SELECT d FROM rapi00() AS r WHERE d > 5;
+------+
| d |
+======+
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
+------+
Here is a slightly more complicated example with a constant input parameter and two output columns:
CREATE FUNCTION rapi01(i integer) RETURNS TABLE (i INTEGER, d DOUBLE) LANGUAGE R {
data.frame(i=seq(1, i), d=42.0);
};
Note how we can use the SQL input parameter inside the R code. It will be automatically translated into the R world. Also note how we construct a to return the two-column result SQL expects. An example invocation and result would be
sql>SELECT i, d FROM rapi01(42) AS R WHERE i > 40;
+------+--------------------------+
| i | d |
+======+==========================+
| 41 | 42 |
| 42 | 42 |
+------+--------------------------+
For the remaining examples, let’s create some sample data:
CREATE TABLE rval(i INTEGER);
INSERT INTO rval VALUES (42),(43),(44),(45);
CREATE TABLE rvalg(groupcol INTEGER, datacol INTEGER);
INSERT INTO rvalg VALUES (1, 42), (1, 84), (2, 42), (2, 21);
The following R function will multiply the passed values. We can use it to do some computations on the projected columns in an SQL query:
CREATE FUNCTION rapi02(i INTEGER, j INTEGER) RETURNS INTEGER LANGUAGE R { i * j };
In this example, we will call this function with a constant as the second parameter:
sql>SELECT rapi02(i, 2) FROM rval;
+----------+
| rapi02_i |
+==========+
| 84 |
| 86 |
| 88 |
| 90 |
+----------+
We can also use the R functions in the WHERE
clause of the query, where they can decide whether a row is included in the result or not:
CREATE FUNCTION rapi03(i INTEGER, j INTEGER) RETURNS BOOLEAN LANGUAGE R { i > j };
SELECT * FROM rval WHERE rapi03(i, 44);
+------+
| i |
+======+
| 45 |
+------+
Finally, we can use an R function to calculate an aggregation within the projection of a GROUP BY
query:
CREATE AGGREGATE rapi04(val INTEGER) RETURNS DOUBLE LANGUAGE R {
aggregate(val, by=list(aggr_group), FUN=median)$x
};
Note the keyword AGGREGATE
and the magical R variable aggr_group
, which contains an integer number denoting
which group a row belongs to. If we run this function, we get the following result
SELECT groupcol, rapi04(datacol) FROM rvalg GROUP BY groupcol;
+----------+--------------------------+
| groupcol | L1 |
+==========+==========================+
| 1 | 63 |
| 2 | 31.5 |
+----------+--------------------------+
As a final note on usage, you can use any R package from CRAN within your function without explicitly installing it.
We overloaded the library
function in R to automatically install missing packages.
Since we no longer need to serialize data onto a socket with our new solution, we expect great performance benefits.
Also, since both MonetDB and R use a columnar data layout, little effort has to be spent on converting the values
between these two systems. In fact, we have shown how we can gain a
zero-copy integration between MonetDB and R
in prototypical work. In the released version however, we invest one in-memory copy of the data for increased
peace of mind. For this experiment, we have used the lineitem
table of the well-known
TPC-H benchmark at scale factor 100. This table contains 100 million rows.
The test task was to calculate the .05 and .95 quantiles of the l_extendedprice
column.
The contenders in the benchmark were
All experiments were run on a desktop-class machine with 16 GB of main memory and a 8-core Intel i7 processor. We tuned PostgreSQL using the pgtune utility for analytical workloads. We ran the aggregation over an increasing number of rows, from 1.000 to 100.000.000. All queries were repeated five times for each system and then averaged. The results are given below:
In this plot, we can see that R-Integration delivers superior performance to all other solutions, curiously also that of MonetDB’s quantile implementation. The reason for this is that R uses partial sorting for its quantile implementation, whereas MonetDB does a full sorting pass over the data to calculate quantiles. Missing data points are due to timeouts (60 seconds limit) or crashes.
For reference, we used the following R UDF for the winning system in this experiment:
CREATE AGGREGATE rquantile(arg1 double, arg2 double) RETURNS double LANGUAGE R { quantile(arg1, arg2) };
This was then run using the SQL query
SELECT rquantile(CAST(l_extendedprice AS DOUBLE), 0.05) As q5, rquantile(CAST(l_extendedprice AS DOUBLE), 0.95) AS q95 FROM lineitem;
For now, R-Integration is not shipped with the binary distributions of MonetDB due to compatibility and security concerns, Hence, you need to compile MonetDB from source with (just a few) custom parameters. In addition, you need to install an R version (or compile from source, too) that includes the R shared library. On Linux systems, you can check whether you have a file named libR.so
. If you have installed R using your OS’s package manager, it will most likely be the case.
Compiling MonetDB from source is not hard. To enable R-Integration, you need to run the ./configure
script as follows:
./configure --enable-rintegration=yes
If all goes well, the output will contain the following line in the summary at the end
rintegration is enabled
If not, check that the R binary is in your $PATH, and that R was configured with the --enable-R-shlib=yes
flag.
You may also be required to set the $LD_LIBRARY_PATH
variable to include the path to libr.so
before starting MonetDB.
After R-Integration has been compiled into the MonetDB installation, you need to explicitly enable it during server startup. If you are using the monetdbd daemon, you can enable R-Integration as follows
monetdb stop rtest
monetdb set embedr=true rtest
monetdb start rtest
Where rtest is the name of your database.
If you are running mserver5 directly, the parameter name is different:
mserver5 --set embedded_r=true
In the latter case, you will see MonetDB’s startup message. If you see no error messages and
# MonetDB/R module loaded
all is well.