R aggregation fails when input is exactly 200K tuples
Hi, I have a suspect failure at exactly 200K tuples in input. I declared a simple aggregate on two columns (here it doesn't aggregate, for simplicity, it just returns 42). START TRANSACTION; CREATE table test (customer int, d string, n int); INSERT INTO test VALUES(1,'2015-01-01', 100); INSERT INTO test VALUES(1, '2015-01-02', 100); INSERT INTO test VALUES(2, '2015-01-03', 100); INSERT INTO test VALUES(2, '2015-01-01', 100); INSERT INTO test VALUES(2, '2015-01-02', 100); CREATE AGGREGATE sow(d string, n int) RETURNS DOUBLE LANGUAGE R { # aggregation function is constant, not important here sow_aggr <- function(df) { 42.0 } df <- cbind(d,n) as.vector(by(df, aggr_group, sow_aggr)) }; select customer, sow(d,n) from test group by customer; ROLLBACK; +----------+--------------------------+ | customer | L1 | +==========+==========================+ | 1 | 42 | | 2 | 42 | +----------+--------------------------+ The result is what I had expected. That is true until table test is long 199999 tuples. When it's exactly 200000 tuples, I get: Error running R expression. Error message: Error in tapply(seq_len(200000L), list(INDICES = c(0, 1, 2, 3, 4, 5, 6, : arguments must have same length Calls: as.data.frame ... by.data.frame -> structure -> eval -> eval -> tapply I checked the vector aggr_group, and indeed it is not 200000 long, as it should be. Instead, it is just one longer than then number of distinct values for customer (the grouping column). Any thought? Roberto
The previous example is a simplification of a real aggregate I'm working
on, hence the two columns.
I just tried on one column, and it still fails at exactly 200000 tuples in
input.
Here, however, I get a SIGSEGV.
This is a reproducible example:
START TRANSACTION;
-- the (fake) aggregate function
CREATE AGGREGATE sow(n int) RETURNS DOUBLE LANGUAGE R {
sow_aggr <- function(df) { 42.0 }
aggregate(n, list(aggr_group), sow_aggr)$x
};
-- function to generate input data
CREATE FUNCTION tt() RETURNS TABLE (g int, n int) LANGUAGE R {
g <- rep(1:500, rep(400,500))
data.frame(g,as.integer(10))
};
CREATE TABLE good as select * from tt() limit 199999 with data;
CREATE TABLE bad as select * from tt() limit 200000 with data;
select count(distinct g) from good;
select count(distinct g) from bad;
select g, sow(n) from good group by g;
select g, sow(n) from bad group by g;
ROLLBACK;
On 1 October 2015 at 10:06, Roberto Cornacchia wrote: Hi, I have a suspect failure at exactly 200K tuples in input. I declared a simple aggregate on two columns (here it doesn't aggregate,
for simplicity, it just returns 42). START TRANSACTION; CREATE table test (customer int, d string, n int);
INSERT INTO test VALUES(1,'2015-01-01', 100);
INSERT INTO test VALUES(1, '2015-01-02', 100);
INSERT INTO test VALUES(2, '2015-01-03', 100);
INSERT INTO test VALUES(2, '2015-01-01', 100);
INSERT INTO test VALUES(2, '2015-01-02', 100); CREATE AGGREGATE sow(d string, n int) RETURNS DOUBLE LANGUAGE R { # aggregation function is constant, not important here
sow_aggr <- function(df) {
42.0
} df <- cbind(d,n)
as.vector(by(df, aggr_group, sow_aggr))
}; select customer, sow(d,n) from test group by customer; ROLLBACK;
+----------+--------------------------+
| customer | L1 |
+==========+==========================+
| 1 | 42 |
| 2 | 42 |
+----------+--------------------------+ The result is what I had expected. That is true until table test is long
199999 tuples. When it's exactly 200000 tuples, I get: Error running R expression. Error message: Error in
tapply(seq_len(200000L), list(INDICES = c(0, 1, 2, 3, 4, 5, 6, :
arguments must have same length
Calls: as.data.frame ... by.data.frame -> structure -> eval -> eval ->
tapply I checked the vector aggr_group, and indeed it is not 200000 long, as it
should be. Instead, it is just one longer than then number of distinct
values for customer (the grouping column). Any thought? Roberto
Hi Roberto, thanks for the report and the test case, we have added it to our suite: http://dev.monetdb.org/hg/MonetDB/file/3b913e66ba5d/sql/backends/monet5/Test... Regarding the cause, we are not getting a crash, but will investigate. Hannes
On 01 Oct 2015, at 10:50, Roberto Cornacchia
wrote: The previous example is a simplification of a real aggregate I'm working on, hence the two columns.
I just tried on one column, and it still fails at exactly 200000 tuples in input.
Here, however, I get a SIGSEGV.
This is a reproducible example:
START TRANSACTION;
-- the (fake) aggregate function CREATE AGGREGATE sow(n int) RETURNS DOUBLE LANGUAGE R { sow_aggr <- function(df) { 42.0 }
aggregate(n, list(aggr_group), sow_aggr)$x };
-- function to generate input data CREATE FUNCTION tt() RETURNS TABLE (g int, n int) LANGUAGE R { g <- rep(1:500, rep(400,500)) data.frame(g,as.integer(10)) };
CREATE TABLE good as select * from tt() limit 199999 with data; CREATE TABLE bad as select * from tt() limit 200000 with data;
select count(distinct g) from good; select count(distinct g) from bad;
select g, sow(n) from good group by g; select g, sow(n) from bad group by g;
ROLLBACK;
On 1 October 2015 at 10:06, Roberto Cornacchia
wrote: Hi, I have a suspect failure at exactly 200K tuples in input.
I declared a simple aggregate on two columns (here it doesn't aggregate, for simplicity, it just returns 42).
START TRANSACTION;
CREATE table test (customer int, d string, n int); INSERT INTO test VALUES(1,'2015-01-01', 100); INSERT INTO test VALUES(1, '2015-01-02', 100); INSERT INTO test VALUES(2, '2015-01-03', 100); INSERT INTO test VALUES(2, '2015-01-01', 100); INSERT INTO test VALUES(2, '2015-01-02', 100);
CREATE AGGREGATE sow(d string, n int) RETURNS DOUBLE LANGUAGE R {
# aggregation function is constant, not important here sow_aggr <- function(df) { 42.0 }
df <- cbind(d,n) as.vector(by(df, aggr_group, sow_aggr)) };
select customer, sow(d,n) from test group by customer;
ROLLBACK; +----------+--------------------------+ | customer | L1 | +==========+==========================+ | 1 | 42 | | 2 | 42 | +----------+--------------------------+
The result is what I had expected. That is true until table test is long 199999 tuples. When it's exactly 200000 tuples, I get:
Error running R expression. Error message: Error in tapply(seq_len(200000L), list(INDICES = c(0, 1, 2, 3, 4, 5, 6, : arguments must have same length Calls: as.data.frame ... by.data.frame -> structure -> eval -> eval -> tapply
I checked the vector aggr_group, and indeed it is not 200000 long, as it should be. Instead, it is just one longer than then number of distinct values for customer (the grouping column).
Any thought?
Roberto
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Thanks Hannes,
Do you mean you don't get a SIGSEGV but a failure like in my first example
(aggr_group), or do you get a correct result?
I am using Jul2015 updated to yesterday, 30 Sept 2015
Below the SIGSEGV.
This conversation should probably be in a bug report, I first posted it
here because the 200000 looked too strange to be a coincidence, I thought
it could have been due to some hard-coded debug limitation.
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f5d31e00700 (LWP 29057)]
0x00007f5d3499cf10 in RAPIeval (cntxt=0x7f5d35a12328, mb=0x7f5d2431cbe0,
stk=0x7f5d24270ec0, pci=0x7f5d24270230, grouped=1 '\001') at
/opt/spinque/MonetDBServer/MonetDB.Spinque_Jul2015/src/monetdb5/extras/rapi/rapi.c:518
518 BAT_TO_REALSXP(b, lng, varvalue);
(gdb) bt
#0 0x00007f5d3499cf10 in RAPIeval (cntxt=0x7f5d35a12328,
mb=0x7f5d2431cbe0, stk=0x7f5d24270ec0, pci=0x7f5d24270230, grouped=1
'\001') at
/opt/spinque/MonetDBServer/MonetDB.Spinque_Jul2015/src/monetdb5/extras/rapi/rapi.c:518
#1 0x00007f5d3499b276 in RAPIevalAggr (cntxt=0x7f5d35a12328,
mb=0x7f5d2431cbe0, stk=0x7f5d24270ec0, pci=0x7f5d24270230) at
/opt/spinque/MonetDBServer/MonetDB.Spinque_Jul2015/src/monetdb5/extras/rapi/rapi.c:387
#2 0x00007f5d3d21766e in runMALsequence (cntxt=0x7f5d35a12328,
mb=0x7f5d2431cbe0, startpc=39, stoppc=40, stk=0x7f5d24270ec0, env=0x0,
pcicaller=0x0) at
/opt/spinque/MonetDBServer/MonetDB.Spinque_Jul2015/src/monetdb5/mal/mal_interpreter.c:631
#3 0x00007f5d3d21c9be in DFLOWworker (T=0x7f5d3d696900
Hi Roberto,
thanks for the report and the test case, we have added it to our suite:
http://dev.monetdb.org/hg/MonetDB/file/3b913e66ba5d/sql/backends/monet5/Test...
Regarding the cause, we are not getting a crash, but will investigate.
Hannes
On 01 Oct 2015, at 10:50, Roberto Cornacchia < roberto.cornacchia@gmail.com> wrote:
The previous example is a simplification of a real aggregate I'm working on, hence the two columns.
I just tried on one column, and it still fails at exactly 200000 tuples in input.
Here, however, I get a SIGSEGV.
This is a reproducible example:
START TRANSACTION;
-- the (fake) aggregate function CREATE AGGREGATE sow(n int) RETURNS DOUBLE LANGUAGE R { sow_aggr <- function(df) { 42.0 }
aggregate(n, list(aggr_group), sow_aggr)$x };
-- function to generate input data CREATE FUNCTION tt() RETURNS TABLE (g int, n int) LANGUAGE R { g <- rep(1:500, rep(400,500)) data.frame(g,as.integer(10)) };
CREATE TABLE good as select * from tt() limit 199999 with data; CREATE TABLE bad as select * from tt() limit 200000 with data;
select count(distinct g) from good; select count(distinct g) from bad;
select g, sow(n) from good group by g; select g, sow(n) from bad group by g;
ROLLBACK;
On 1 October 2015 at 10:06, Roberto Cornacchia < roberto.cornacchia@gmail.com> wrote: Hi,
I have a suspect failure at exactly 200K tuples in input.
I declared a simple aggregate on two columns (here it doesn't aggregate, for simplicity, it just returns 42).
START TRANSACTION;
CREATE table test (customer int, d string, n int); INSERT INTO test VALUES(1,'2015-01-01', 100); INSERT INTO test VALUES(1, '2015-01-02', 100); INSERT INTO test VALUES(2, '2015-01-03', 100); INSERT INTO test VALUES(2, '2015-01-01', 100); INSERT INTO test VALUES(2, '2015-01-02', 100);
CREATE AGGREGATE sow(d string, n int) RETURNS DOUBLE LANGUAGE R {
# aggregation function is constant, not important here sow_aggr <- function(df) { 42.0 }
df <- cbind(d,n) as.vector(by(df, aggr_group, sow_aggr)) };
select customer, sow(d,n) from test group by customer;
ROLLBACK; +----------+--------------------------+ | customer | L1 | +==========+==========================+ | 1 | 42 | | 2 | 42 | +----------+--------------------------+
The result is what I had expected. That is true until table test is long 199999 tuples. When it's exactly 200000 tuples, I get:
Error running R expression. Error message: Error in tapply(seq_len(200000L), list(INDICES = c(0, 1, 2, 3, 4, 5, 6, : arguments must have same length Calls: as.data.frame ... by.data.frame -> structure -> eval -> eval -> tapply
I checked the vector aggr_group, and indeed it is not 200000 long, as it should be. Instead, it is just one longer than then number of distinct values for customer (the grouping column).
Any thought?
Roberto
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (2)
-
Hannes Mühleisen
-
Roberto Cornacchia