Hi The R interpreter is not thread safe. This means that the system can not simply fork multiple instances and your R function will become the bottleneck. Using SQL predicates to select portions to be handled by your R script are ran in parallel. regards, Martin On 25/06/15 18:44, George, Glover E ERDC-RDE-ITL-MS wrote:
Hi all,
I’ve recently been profiling several techniques for a workflow that we’ve been trying to improve here at USACE. Originally we used python scripts with sqlite, but we ran into scalability problems on large data sets. This led us to MonetDB, with the promise of columnar-based analysis and the hope of both parallel query’s “under-the-hood” and possibly a distributed workflow across an HPC system. Looking at my profiling results, this has led me to a number of questions that hopefully you all can help us with.
I have a single table from the TPC-H benchmark – lineitem – populated with 360million entries:
"l_orderkey" INTEGER NOT NULL,
"l_partkey" INTEGER NOT NULL,
"l_suppkey" INTEGER NOT NULL,
"l_linenumber" INTEGER NOT NULL,
"l_quantity" DECIMAL(15,2) NOT NULL,
"l_extendedprice" DECIMAL(15,2) NOT NULL,
"l_discount" DECIMAL(15,2) NOT NULL,
"l_tax" DECIMAL(15,2) NOT NULL,
"l_returnflag" CHAR(1) NOT NULL,
"l_linestatus" CHAR(1) NOT NULL,
"l_shipdate" DATE NOT NULL,
"l_commitdate" DATE NOT NULL,
"l_receiptdate" DATE NOT NULL,
"l_shipinstruct" CHAR(25) NOT NULL,
"l_shipmode" CHAR(10) NOT NULL,
"l_comment" VARCHAR(44) NOT NULL,
System Setup: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz - 16 cores (HT off) / 256G RAM
MonetDB version 11.19.15
R version 3.2.0
1. I have followed the instructions here: https://www.monetdb.org/content/embedded-r-monetdb and tried to reproduce these results. I only attempt to reproduce the results from R (reading the data from CSV since – as noted in a previous email to this list – R fails to read in the 360 million rows using MonetDB.R but works fine reading the same amount from CSV), MonetDB, as well as MonetDB with Embedded R. - I compile both R and MonetDB from source. When building R, I include the option to build the R shared library (libR.so) and when compiling MonetDB I include the option for embedded R. - When creating the MonetDB database, I set the option for embedr=true. - The R function from the above URL is used, as well as the same SQL query – and it works…. However, the performance is much worse than using R alone. I know I must be overlooking something. Please see the image at https://goo.gl/wUvB2J (png on Google Drive). The X-axis is the number of rows, and the Y axis is the time in seconds. As you can see, embedded R is unexpectedly much worse than the other two.
2. While running the quantile function, only one core is active – whether embeddedR or just MonetDB’s. 3. When running the following query (after altering the table to include a new column z, of course):
sql>update lineitem set z=l_extendedprice / l_quantity;
360011594 affected rows (3m 49s)
… Multiple cores are active for the first 10 seconds, then it resorts to single core.
A. What am I not understanding about MonetDB’s ability to use multiple cores? The only time I really see it use multiple cores seems to be when doing “copy into”.
B. Can someone provide assistance in trying to replicate the Embedded R results? Should it be running in parallel? Or might something in my setup be configured incorrectly.
Cheers!
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list