R aggregation fails when input is exactly 200K tuples

1 Oct 2015

      Hi,

I have a suspect failure at exactly 200K tuples in input.

I declared a simple aggregate on two columns (here it doesn't aggregate,
for simplicity, it just returns 42).

START TRANSACTION;

CREATE table test (customer int, d string, n int);
INSERT INTO test VALUES(1,'2015-01-01', 100);
INSERT INTO test VALUES(1, '2015-01-02', 100);
INSERT INTO test VALUES(2, '2015-01-03', 100);
INSERT INTO test VALUES(2, '2015-01-01', 100);
INSERT INTO test VALUES(2, '2015-01-02', 100);

CREATE AGGREGATE sow(d string, n int) RETURNS DOUBLE LANGUAGE R {

  # aggregation function is constant, not important here
  sow_aggr <- function(df) {
   42.0
  }

  df <- cbind(d,n)
  as.vector(by(df, aggr_group, sow_aggr))
};

select customer, sow(d,n) from test group by customer;

ROLLBACK;
+----------+--------------------------+
| customer | L1                       |
+==========+==========================+
|        1 |                       42 |
|        2 |                       42 |
+----------+--------------------------+

The result is what I had expected. That is true until table test is long
199999 tuples. When it's exactly 200000 tuples, I get:

Error running R expression. Error message: Error in
tapply(seq_len(200000L), list(INDICES = c(0, 1, 2, 3, 4, 5, 6,  :
  arguments must have same length
Calls: as.data.frame ... by.data.frame -> structure -> eval -> eval ->
tapply

I checked the vector aggr_group, and indeed it is not 200000 long, as it
should be. Instead, it is just one longer than then number of distinct
values for customer (the grouping column).

Any thought?

Roberto

Roberto Cornacchia

Roberto Cornacchia

Hannes Mühleisen

Roberto Cornacchia

tags

participants (2)