Re: Reproducing EmbeddedR results from MonetDB.org site

25 Jun 2015

      Hi

The R interpreter is not thread safe. This means that the
system can not simply fork multiple instances and your
R function will become the bottleneck.

Using SQL predicates to select portions to be handled
by your R script are ran in parallel.

regards, Martin

On 25/06/15 18:44, George, Glover E ERDC-RDE-ITL-MS wrote:
...
Hi all,
I’ve recently been profiling several techniques for a workflow that we’ve been trying to improve here at USACE. Originally we used python scripts with sqlite, but we ran into scalability problems on large data sets. This led us to MonetDB, with the promise
of columnar-based analysis and the hope of both parallel query’s “under-the-hood” and possibly a distributed workflow across an HPC system.  Looking at my profiling results, this has led me to a number of questions that hopefully you all can help us with.
I have a single table from the TPC-H benchmark – lineitem – populated with 360million entries:
"l_orderkey"      INTEGER       NOT NULL,
"l_partkey"       INTEGER       NOT NULL,
"l_suppkey"       INTEGER       NOT NULL,
"l_linenumber"    INTEGER       NOT NULL,
"l_quantity"      DECIMAL(15,2) NOT NULL,
"l_extendedprice" DECIMAL(15,2) NOT NULL,
"l_discount"      DECIMAL(15,2) NOT NULL,
"l_tax"           DECIMAL(15,2) NOT NULL,
"l_returnflag"    CHAR(1)       NOT NULL,
"l_linestatus"    CHAR(1)       NOT NULL,
"l_shipdate"      DATE          NOT NULL,
"l_commitdate"    DATE          NOT NULL,
"l_receiptdate"   DATE          NOT NULL,
"l_shipinstruct"  CHAR(25)      NOT NULL,
"l_shipmode"      CHAR(10)      NOT NULL,
"l_comment"       VARCHAR(44)   NOT NULL,
System Setup: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz - 16 cores (HT off) / 256G RAM
MonetDB version 11.19.15
R version 3.2.0
1.  I have followed the instructions here: https://www.monetdb.org/content/embedded-r-monetdb and tried to reproduce these results.  I only attempt to reproduce the results from R (reading the data from CSV since – as noted in a previous email to this list
– R fails to read in the 360 million rows using MonetDB.R but works fine reading the same amount from CSV), MonetDB, as well as MonetDB with Embedded R.
- I compile both R and MonetDB from source.  When building R, I include the option to build the R shared library (libR.so) and when compiling MonetDB I include the option for embedded R.
- When creating the MonetDB database, I set the option for embedr=true.
- The R function from the above URL is used, as well as the same SQL query – and it works….
However, the performance is much worse than using R alone.  I know I must be overlooking something.  Please see the image at https://goo.gl/wUvB2J (png on Google Drive).  The X-axis is the number of rows, and the Y axis is the time in seconds.  As you can
see, embedded R is unexpectedly much worse than the other two.
2.  While running the quantile function, only one core is active – whether embeddedR or just MonetDB’s.
3.  When running the following query (after altering the table to include a new column z, of course):
sql>update lineitem set z=l_extendedprice / l_quantity;
360011594 affected rows (3m 49s)
… Multiple cores are active for the first 10 seconds, then it resorts to single core.
A. What am I not understanding about MonetDB’s ability to use multiple cores? The only time I really see it use multiple cores seems to be when doing “copy into”.
B. Can someone provide assistance in trying to replicate the Embedded R results?  Should it be running in parallel?  Or might something in my setup be configured incorrectly.
Cheers!
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list

Re: Reproducing EmbeddedR results from MonetDB.org site

Martin Kersten