Hi, I am trying to create a simple aggregate function entirely in MAL and am now struggling at the final hurdle. I have created the aggregate function such that it aggregates across an entire table but in the sub aggregate function I can't work out how to link the group to data BAT (g) to the data. The aggregate is designed to simply collect a sample of the values from the named column and return them as a pipe delimited string value. Can anyone tell me how to use the 'g' bat to join to the 'b' bat in the sub... aggregate function to get only the data rows for a specific group - currently I only get 2 rows of data per group even though I know that there should be 100k+ rows in each group? Here's the code: function bin_sample(b:bat[:oid, :any]):str; value := ""; bsamp := sample.uniform(b, 1000:wrd); barrier (h, t) := iterator.new(bsamp); tmp := "" + t; type := bat.getTailType(bsamp); test := str.startsWith(type, "str"); # Force the string type to be cast to a string or you just end up with numeric values barrier xxxx := test; tmp := str.str(t); exit xxxx; value := value + tmp; value := value + "||"; redo (h, t) := iterator.next(bsamp); exit (h, t); return value; end bin_sample; function subbin_sample(b:bat[:oid,:any_1],g:bat[:oid,:oid],e:bat[:oid,:any_2],skip_nils:bit) :bat[:oid,:str]; bn:=bat.new(:oid, :str); # check 'e' has some values - this is the list of group IDs in the head # 'g' is the group to the data BAT head ID mapping # 'b' is the data column BAT that we're aggregating over barrier (grpid, t) := iterator.new(e); # create a single row BAT of the group id we're processing bn_gid := bat.new(:oid, :oid); bat.insert(bn_gid, grpid, grpid); # join this to the GID->DATA map BAT to get the Group DATA oids b_g2dr := bat.reverse(g); b_gid2data := algebra.leftjoin(bn_gid, b_g2dr); grpval := ""; # Get the BAT of the actual data to be processed # # !!! HOW TO GET THIS TO WORK???????? # b_data := algebra.join(b_gid2data, b); grpval := bin_sample(b_data); # Store the result for this group bat.insert(bn, grpid, grpval); redo (grpid, t) := iterator.next(e); exit (grpid, t); return bn; end subbin_sample; Scott Mathieson, ________________________________
On Fri, Jul 05, 2013 at 10:08:31AM +0000, Scott Mathieson wrote:
Hi,
I am trying to create a simple aggregate function entirely in MAL and am now struggling at the final hurdle. I have created the aggregate function such that it aggregates across an entire table but in the sub aggregate function I can't work out how to link the group to data BAT (g) to the data. The aggregate is designed to simply collect a sample of the values from the named column and return them as a pipe delimited string value. Can anyone tell me how to use the 'g' bat to join to the 'b' bat in the sub... aggregate function to get only the data rows for a specific group - currently I only get 2 rows of data per group even though I know that there should be 100k+ rows in each group?
Here's the code:
function bin_sample(b:bat[:oid, :any]):str; value := ""; bsamp := sample.uniform(b, 1000:wrd); barrier (h, t) := iterator.new(bsamp); tmp := "" + t; type := bat.getTailType(bsamp); test := str.startsWith(type, "str"); # Force the string type to be cast to a string or you just end up with numeric values barrier xxxx := test; tmp := str.str(t); exit xxxx;
value := value + tmp; value := value + "||"; redo (h, t) := iterator.next(bsamp); exit (h, t); return value; end bin_sample;
why not simply cast any type to string? i.e., function bin_sample(b:bat[:oid, :any]):str; value := ""; bsamp := sample.uniform(b, 1000:wrd); barrier (h, t) := iterator.new(bsamp); tmp := calc.str(t); value := value + tmp; value := value + "||"; redo (h, t) := iterator.next(bsamp); exit (h, t); return value; end bin_sample; or even better, do the cast "in bulk": function bin_sample(b:bat[:oid, :any]):str; value := ""; bsamp := sample.uniform(b, 1000:wrd); bsampstr := batcalc.str(bsamp); barrier (h, t) := iterator.new(bsampstr); value := value + t; value := value + "||"; redo (h, t) := iterator.next(bsampstr); exit (h, t); return value; end bin_sample;
function subbin_sample(b:bat[:oid,:any_1],g:bat[:oid,:oid],e:bat[:oid,:any_2],skip_nils:bit) :bat[:oid,:str]; bn:=bat.new(:oid, :str);
# check 'e' has some values - this is the list of group IDs in the head # 'g' is the group to the data BAT head ID mapping # 'b' is the data column BAT that we're aggregating over barrier (grpid, t) := iterator.new(e);
# create a single row BAT of the group id we're processing bn_gid := bat.new(:oid, :oid); bat.insert(bn_gid, grpid, grpid);
# join this to the GID->DATA map BAT to get the Group DATA oids b_g2dr := bat.reverse(g); b_gid2data := algebra.leftjoin(bn_gid, b_g2dr);
grpval := "";
# Get the BAT of the actual data to be processed # # !!! HOW TO GET THIS TO WORK???????? # b_data := algebra.join(b_gid2data, b); grpval := bin_sample(b_data);
# Store the result for this group bat.insert(bn, grpid, grpval); redo (grpid, t) := iterator.next(e); exit (grpid, t); return bn;
end subbin_sample;
what about: function subbin_sample(b:bat[:oid,:any_1],g:bat[:oid,:oid],e:bat[:oid,:any_2],skip_nils:bit) :bat[:oid,:str]; nw := aggr.count(e); nl := calc.lng(nw); bn := bat.new(:oid, :str, nl); # check 'e' has some values - this is the list of group IDs in the head # 'g' is the group to the data BAT head ID mapping # 'b' is the data column BAT that we're aggregating over barrier (grpid, t) := iterator.new(e); # select GID from TID->GID map to get matching TIDs TIDs := algebra.uselect(g,grpid); # get DATA for matching TIDs b_data := algebra.kintersect(b,TIDs); # aggregate grpval := bin_sample(b_data); # Store the result for this group bat.insert(bn, grpid, grpval); redo (grpid, t) := iterator.next(e); exit (grpid, t); return bn; end subbin_sample; Stefan
Scott Mathieson,
________________________________
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
Hi Stefan, Thanks for your help, however I'm still seeing the same issue with your suggested changes - the 'g' BAT only contains 4 rows (2 groups x 2 threads) and the 'b' BAT contains the 249k rows which is the size of the data table - I was assuming that either the 'g' and 'b' BATs are the same size or that there is some form of indexing of 'g' to 'b'. The aggregation I'm testing is this: select "Gender", bin_samp("CustomerAge"), count(*) from "Customers" group by "Gender"; and the result I get is: +--------+----------+--------+ | Gender | L1 | L2 | +========+==========+========+ | Female | 12||12|| | 106024 | | Male | 12||50|| | 143703 | +--------+----------+--------+ I would have expected the bin_samp aggregate to have returned strings of 1000 values pipe delimited per Gender value rather than only 2 values - I'm assuming something is wrong in the mapping of 'g' to 'b' entries The 'g' bat look like this: #-----------------# # h t # name # void oid # type #-----------------# [ 0@0, 0@0 ] [ 1@0, 1@0 ] [ 2@0, 0@0 ] [ 3@0, 1@0 ] And the 'b' bat is: #-------------------------# # h t # name # void int # type #-------------------------# [ 0@0, 12 ] [ 1@0, 12 ] [ 2@0, 12 ] [ 3@0, 50 ] [ 4@0, 56 ] [ 5@0, 12 ] [ 6@0, 12 ] [ 7@0, 34 ] [ 8@0, 12 ] [ 9@0, 12 ] ... continues to @249727@0 Kind reagards, Scott Mathieson ________________________________
Further investigation shows that the aggregate works as expected when running on a single threaded server. Seems to me that I'm missing something in a multi-threaded server that I can use to link the 'g' entries to their actual data OIDs i.e. the links should be 'e' -> 'g' -> ? -> 'b' Where '?' is a BAT with the subgroups OIDs to data OIDs Is there any other MAL functions that I can use to get access to this BAT? Regards, Scott From: users-list [mailto:users-list-bounces+scott.mathieson=pb.com@monetdb.org] On Behalf Of Scott Mathieson Sent: 05 July 2013 14:00 To: Communication channel for MonetDB users Subject: RE: aggregate function in MAL Hi Stefan, Thanks for your help, however I'm still seeing the same issue with your suggested changes - the 'g' BAT only contains 4 rows (2 groups x 2 threads) and the 'b' BAT contains the 249k rows which is the size of the data table - I was assuming that either the 'g' and 'b' BATs are the same size or that there is some form of indexing of 'g' to 'b'. The aggregation I'm testing is this: select "Gender", bin_samp("CustomerAge"), count(*) from "Customers" group by "Gender"; and the result I get is: +--------+----------+--------+ | Gender | L1 | L2 | +========+==========+========+ | Female | 12||12|| | 106024 | | Male | 12||50|| | 143703 | +--------+----------+--------+ I would have expected the bin_samp aggregate to have returned strings of 1000 values pipe delimited per Gender value rather than only 2 values - I'm assuming something is wrong in the mapping of 'g' to 'b' entries The 'g' bat look like this: #-----------------# # h t # name # void oid # type #-----------------# [ 0@0, 0@0 ] [ 1@0, 1@0 ] [ 2@0, 0@0 ] [ 3@0, 1@0 ] And the 'b' bat is: #-------------------------# # h t # name # void int # type #-------------------------# [ 0@0, 12 ] [ 1@0, 12 ] [ 2@0, 12 ] [ 3@0, 50 ] [ 4@0, 56 ] [ 5@0, 12 ] [ 6@0, 12 ] [ 7@0, 34 ] [ 8@0, 12 ] [ 9@0, 12 ] ... continues to @249727@0 Kind reagards, Scott Mathieson ________________________________ ________________________________
participants (2)
-
Scott Mathieson
-
Stefan Manegold