Reproducing EmbeddedR results from MonetDB.org site
Hi all, I’ve recently been profiling several techniques for a workflow that we’ve been trying to improve here at USACE. Originally we used python scripts with sqlite, but we ran into scalability problems on large data sets. This led us to MonetDB, with the promise of columnar-based analysis and the hope of both parallel query’s “under-the-hood” and possibly a distributed workflow across an HPC system. Looking at my profiling results, this has led me to a number of questions that hopefully you all can help us with. I have a single table from the TPC-H benchmark – lineitem – populated with 360million entries: "l_orderkey" INTEGER NOT NULL, "l_partkey" INTEGER NOT NULL, "l_suppkey" INTEGER NOT NULL, "l_linenumber" INTEGER NOT NULL, "l_quantity" DECIMAL(15,2) NOT NULL, "l_extendedprice" DECIMAL(15,2) NOT NULL, "l_discount" DECIMAL(15,2) NOT NULL, "l_tax" DECIMAL(15,2) NOT NULL, "l_returnflag" CHAR(1) NOT NULL, "l_linestatus" CHAR(1) NOT NULL, "l_shipdate" DATE NOT NULL, "l_commitdate" DATE NOT NULL, "l_receiptdate" DATE NOT NULL, "l_shipinstruct" CHAR(25) NOT NULL, "l_shipmode" CHAR(10) NOT NULL, "l_comment" VARCHAR(44) NOT NULL, System Setup: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz - 16 cores (HT off) / 256G RAM MonetDB version 11.19.15 R version 3.2.0 1. I have followed the instructions here: https://www.monetdb.org/content/embedded-r-monetdb and tried to reproduce these results. I only attempt to reproduce the results from R (reading the data from CSV since – as noted in a previous email to this list – R fails to read in the 360 million rows using MonetDB.R but works fine reading the same amount from CSV), MonetDB, as well as MonetDB with Embedded R. - I compile both R and MonetDB from source. When building R, I include the option to build the R shared library (libR.so) and when compiling MonetDB I include the option for embedded R. - When creating the MonetDB database, I set the option for embedr=true. - The R function from the above URL is used, as well as the same SQL query – and it works…. However, the performance is much worse than using R alone. I know I must be overlooking something. Please see the image at https://goo.gl/wUvB2J (png on Google Drive). The X-axis is the number of rows, and the Y axis is the time in seconds. As you can see, embedded R is unexpectedly much worse than the other two. 2. While running the quantile function, only one core is active – whether embeddedR or just MonetDB’s. 3. When running the following query (after altering the table to include a new column z, of course): sql>update lineitem set z=l_extendedprice / l_quantity; 360011594 affected rows (3m 49s) … Multiple cores are active for the first 10 seconds, then it resorts to single core. A. What am I not understanding about MonetDB’s ability to use multiple cores? The only time I really see it use multiple cores seems to be when doing “copy into”. B. Can someone provide assistance in trying to replicate the Embedded R results? Should it be running in parallel? Or might something in my setup be configured incorrectly. Cheers!
Hi The R interpreter is not thread safe. This means that the system can not simply fork multiple instances and your R function will become the bottleneck. Using SQL predicates to select portions to be handled by your R script are ran in parallel. regards, Martin On 25/06/15 18:44, George, Glover E ERDC-RDE-ITL-MS wrote:
Hi all,
I’ve recently been profiling several techniques for a workflow that we’ve been trying to improve here at USACE. Originally we used python scripts with sqlite, but we ran into scalability problems on large data sets. This led us to MonetDB, with the promise of columnar-based analysis and the hope of both parallel query’s “under-the-hood” and possibly a distributed workflow across an HPC system. Looking at my profiling results, this has led me to a number of questions that hopefully you all can help us with.
I have a single table from the TPC-H benchmark – lineitem – populated with 360million entries:
"l_orderkey" INTEGER NOT NULL,
"l_partkey" INTEGER NOT NULL,
"l_suppkey" INTEGER NOT NULL,
"l_linenumber" INTEGER NOT NULL,
"l_quantity" DECIMAL(15,2) NOT NULL,
"l_extendedprice" DECIMAL(15,2) NOT NULL,
"l_discount" DECIMAL(15,2) NOT NULL,
"l_tax" DECIMAL(15,2) NOT NULL,
"l_returnflag" CHAR(1) NOT NULL,
"l_linestatus" CHAR(1) NOT NULL,
"l_shipdate" DATE NOT NULL,
"l_commitdate" DATE NOT NULL,
"l_receiptdate" DATE NOT NULL,
"l_shipinstruct" CHAR(25) NOT NULL,
"l_shipmode" CHAR(10) NOT NULL,
"l_comment" VARCHAR(44) NOT NULL,
System Setup: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz - 16 cores (HT off) / 256G RAM
MonetDB version 11.19.15
R version 3.2.0
1. I have followed the instructions here: https://www.monetdb.org/content/embedded-r-monetdb and tried to reproduce these results. I only attempt to reproduce the results from R (reading the data from CSV since – as noted in a previous email to this list – R fails to read in the 360 million rows using MonetDB.R but works fine reading the same amount from CSV), MonetDB, as well as MonetDB with Embedded R. - I compile both R and MonetDB from source. When building R, I include the option to build the R shared library (libR.so) and when compiling MonetDB I include the option for embedded R. - When creating the MonetDB database, I set the option for embedr=true. - The R function from the above URL is used, as well as the same SQL query – and it works…. However, the performance is much worse than using R alone. I know I must be overlooking something. Please see the image at https://goo.gl/wUvB2J (png on Google Drive). The X-axis is the number of rows, and the Y axis is the time in seconds. As you can see, embedded R is unexpectedly much worse than the other two.
2. While running the quantile function, only one core is active – whether embeddedR or just MonetDB’s. 3. When running the following query (after altering the table to include a new column z, of course):
sql>update lineitem set z=l_extendedprice / l_quantity;
360011594 affected rows (3m 49s)
… Multiple cores are active for the first 10 seconds, then it resorts to single core.
A. What am I not understanding about MonetDB’s ability to use multiple cores? The only time I really see it use multiple cores seems to be when doing “copy into”.
B. Can someone provide assistance in trying to replicate the Embedded R results? Should it be running in parallel? Or might something in my setup be configured incorrectly.
Cheers!
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
On 6/25/15, 3:28 PM, "users-list on behalf of Martin Kersten"
Hi
The R interpreter is not thread safe. This means that the system can not simply fork multiple instances and your R function will become the bottleneck.
By system, do you mean MonetDB cannot fork multiple instances of R to perform my query?
Using SQL predicates to select portions to be handled by your R script are ran in parallel.
For the quantile function, I need the entire column to determine the top and bottom 5%, so limiting the query with a predicate defeats the purpose. I may have confused the issue by combining multiple questions in the same email. To be clear: 1. What, if any, parallel query/calculations are available in MonetDB (irregardless of R). ³copy into² as detailed in the Recipe ³Bulk Loading of Data² seems to utilize multiple cores. Specifically what I¹m looking for is the ability to compute z = a + b + c on a single table, where a, b, c and z are columns in this table. I would hope that MonetDB could compute this in parallel as each operation is independent of the other. 2. With regards to replicating the Embedded R test at https://www.monetdb.org/content/embedded-r-monetdb, I¹m not expecting it to be done in parallel. I just assumed that was the reason I see much slower performance with EmbeddedR calling the R quantile function as opposed to MonetDB¹s quantile function (which is also only using 1 core).
regards, Martin
On 25/06/15 18:44, George, Glover E ERDC-RDE-ITL-MS wrote:
Hi all,
I¹ve recently been profiling several techniques for a workflow that we¹ve been trying to improve here at USACE. Originally we used python scripts with sqlite, but we ran into scalability problems on large data sets. This led us to MonetDB, with the promise of columnar-based analysis and the hope of both parallel query¹s ³under-the-hood² and possibly a distributed workflow across an HPC system. Looking at my profiling results, this has led me to a number of questions that hopefully you all can help us with.
I have a single table from the TPC-H benchmark lineitem populated with 360million entries:
"l_orderkey" INTEGER NOT NULL,
"l_partkey" INTEGER NOT NULL,
"l_suppkey" INTEGER NOT NULL,
"l_linenumber" INTEGER NOT NULL,
"l_quantity" DECIMAL(15,2) NOT NULL,
"l_extendedprice" DECIMAL(15,2) NOT NULL,
"l_discount" DECIMAL(15,2) NOT NULL,
"l_tax" DECIMAL(15,2) NOT NULL,
"l_returnflag" CHAR(1) NOT NULL,
"l_linestatus" CHAR(1) NOT NULL,
"l_shipdate" DATE NOT NULL,
"l_commitdate" DATE NOT NULL,
"l_receiptdate" DATE NOT NULL,
"l_shipinstruct" CHAR(25) NOT NULL,
"l_shipmode" CHAR(10) NOT NULL,
"l_comment" VARCHAR(44) NOT NULL,
System Setup: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz - 16 cores (HT off) / 256G RAM
MonetDB version 11.19.15
R version 3.2.0
1. I have followed the instructions here: https://www.monetdb.org/content/embedded-r-monetdb and tried to reproduce these results. I only attempt to reproduce the results from R (reading the data from CSV since as noted in a previous email to this list R fails to read in the 360 million rows using MonetDB.R but works fine reading the same amount from CSV), MonetDB, as well as MonetDB with Embedded R. - I compile both R and MonetDB from source. When building R, I include the option to build the R shared library (libR.so) and when compiling MonetDB I include the option for embedded R. - When creating the MonetDB database, I set the option for embedr=true. - The R function from the above URL is used, as well as the same SQL query and it worksŠ. However, the performance is much worse than using R alone. I know I must be overlooking something. Please see the image at https://goo.gl/wUvB2J (png on Google Drive). The X-axis is the number of rows, and the Y axis is the time in seconds. As you can see, embedded R is unexpectedly much worse than the other two.
2. While running the quantile function, only one core is active whether embeddedR or just MonetDB¹s. 3. When running the following query (after altering the table to include a new column z, of course):
sql>update lineitem set z=l_extendedprice / l_quantity;
360011594 affected rows (3m 49s)
Š Multiple cores are active for the first 10 seconds, then it resorts to single core.
A. What am I not understanding about MonetDB¹s ability to use multiple cores? The only time I really see it use multiple cores seems to be when doing ³copy into².
B. Can someone provide assistance in trying to replicate the Embedded R results? Should it be running in parallel? Or might something in my setup be configured incorrectly.
Cheers!
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
On 25 Jun 2015, at 23:24, George, Glover E ERDC-RDE-ITL-MS
wrote: On 6/25/15, 3:28 PM, "users-list on behalf of Martin Kersten"
wrote: Hi
The R interpreter is not thread safe. This means that the system can not simply fork multiple instances and your R function will become the bottleneck.
By system, do you mean MonetDB cannot fork multiple instances of R to perform my query?
Using SQL predicates to select portions to be handled by your R script are ran in parallel.
For the quantile function, I need the entire column to determine the top and bottom 5%, so limiting the query with a predicate defeats the purpose.
I may have confused the issue by combining multiple questions in the same email. To be clear:
1. What, if any, parallel query/calculations are available in MonetDB (irregardless of R). ³copy into² as detailed in the Recipe ³Bulk Loading of Data² seems to utilize multiple cores. Specifically what I¹m looking for is the ability to compute z = a + b + c on a single table, where a, b, c and z are columns in this table. I would hope that MonetDB could compute this in parallel as each operation is independent of the other.
Hai, most relational operations and math computations are executed in parallel in monetdb. Basic computations like z = a + b + c will already be done in parallel. Functions like QUANTILE can't (really) be done in parallel, so only a single core is used. There are probably still operations that can be executed in parallel, but not yet. That's mostly just because their bulk version is not implemented yet. If you notice such cases, please fire a feature request. Regards, Jennie
2. With regards to replicating the Embedded R test at https://www.monetdb.org/content/embedded-r-monetdb, I¹m not expecting it to be done in parallel. I just assumed that was the reason I see much slower performance with EmbeddedR calling the R quantile function as opposed to MonetDB¹s quantile function (which is also only using 1 core).
regards, Martin
On 25/06/15 18:44, George, Glover E ERDC-RDE-ITL-MS wrote: Hi all,
I¹ve recently been profiling several techniques for a workflow that we¹ve been trying to improve here at USACE. Originally we used python scripts with sqlite, but we ran into scalability problems on large data sets. This led us to MonetDB, with the promise of columnar-based analysis and the hope of both parallel query¹s ³under-the-hood² and possibly a distributed workflow across an HPC system. Looking at my profiling results, this has led me to a number of questions that hopefully you all can help us with.
I have a single table from the TPC-H benchmark lineitem populated with 360million entries:
"l_orderkey" INTEGER NOT NULL,
"l_partkey" INTEGER NOT NULL,
"l_suppkey" INTEGER NOT NULL,
"l_linenumber" INTEGER NOT NULL,
"l_quantity" DECIMAL(15,2) NOT NULL,
"l_extendedprice" DECIMAL(15,2) NOT NULL,
"l_discount" DECIMAL(15,2) NOT NULL,
"l_tax" DECIMAL(15,2) NOT NULL,
"l_returnflag" CHAR(1) NOT NULL,
"l_linestatus" CHAR(1) NOT NULL,
"l_shipdate" DATE NOT NULL,
"l_commitdate" DATE NOT NULL,
"l_receiptdate" DATE NOT NULL,
"l_shipinstruct" CHAR(25) NOT NULL,
"l_shipmode" CHAR(10) NOT NULL,
"l_comment" VARCHAR(44) NOT NULL,
System Setup: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz - 16 cores (HT off) / 256G RAM
MonetDB version 11.19.15
R version 3.2.0
1. I have followed the instructions here: https://www.monetdb.org/content/embedded-r-monetdb and tried to reproduce these results. I only attempt to reproduce the results from R (reading the data from CSV since as noted in a previous email to this list R fails to read in the 360 million rows using MonetDB.R but works fine reading the same amount from CSV), MonetDB, as well as MonetDB with Embedded R. - I compile both R and MonetDB from source. When building R, I include the option to build the R shared library (libR.so) and when compiling MonetDB I include the option for embedded R. - When creating the MonetDB database, I set the option for embedr=true. - The R function from the above URL is used, as well as the same SQL query and it worksŠ. However, the performance is much worse than using R alone. I know I must be overlooking something. Please see the image at https://goo.gl/wUvB2J (png on Google Drive). The X-axis is the number of rows, and the Y axis is the time in seconds. As you can see, embedded R is unexpectedly much worse than the other two.
2. While running the quantile function, only one core is active whether embeddedR or just MonetDB¹s. 3. When running the following query (after altering the table to include a new column z, of course):
sql>update lineitem set z=l_extendedprice / l_quantity;
360011594 affected rows (3m 49s)
Š Multiple cores are active for the first 10 seconds, then it resorts to single core.
A. What am I not understanding about MonetDB¹s ability to use multiple cores? The only time I really see it use multiple cores seems to be when doing ³copy into².
B. Can someone provide assistance in trying to replicate the Embedded R results? Should it be running in parallel? Or might something in my setup be configured incorrectly.
Cheers!
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (3)
-
George, Glover E ERDC-RDE-ITL-MS
-
Martin Kersten
-
Ying Zhang