
I'm don''t seem to get the expected result, let's see if I'm doing
something silly.
- SQL signature:
create function tokenize(id integer, s string, prob double)
returns table (id integer, token string, prob double)
external name batstr."UTF8tokenize";
- MAL signature:
command
batstr.UTF8tokenize(id:bat[:oid,:int],s:bat[:oid,:str],prob:bat[:oid,:dbl])
(:bat[:oid,:int],:bat[:oid,:str],:bat[:oid,:dbl])
address STRbat_utf8_tokenize_id_prob;
- C signature:
batstr_export str STRbat_utf8_tokenize_id_prob(bat *r1, bat *r2, bat *r3,
const bat *idx, const bat *s, const bat *prob);
Inspecting a mal plan for a query like
SELECT *
FROM tokenize (select id, s, prob from x);
I see that the bat version of the function being used inside the same
tuple-oriented loop.
| X_64 := bat.new(nil:oid,nil:int);
| X_67 := bat.new(nil:oid,nil:str);
| X_69 := bat.new(nil:oid,nil:dbl);
| barrier (X_72,X_73) := iterator.new(X_8);
| X_75 := algebra.fetch(X_11,X_72);
| X_77 := algebra.fetch(X_14,X_72);
| (X_79,X_80,X_81) := batstr.UTF8tokenize(X_73,X_75,X_77);
| bat.append(X_64,X_79);
| bat.append(X_67,X_80);
| bat.append(X_69,X_81);
| redo (X_72,X_73) := iterator.next(X_8);
| exit (X_72,X_73);
Executing this fails, obviously.
Can you spot where the problem is?
Roberto
On 6 June 2015 at 10:29, Niels Nes
On Thu, Jun 04, 2015 at 11:46:08AM +0200, Martin Kersten wrote:
On 04/06/15 11:36, Roberto Cornacchia wrote:
Hi Niels,
I have tried this in default and indeed it does work like a charm. (my UTF8tokenize UDF takes two values and outputs a 3-column table)
I noticed, though that it results in a MAL loop:
| barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
This of course is not going to be efficient. What if I write the bulk version of this function? Would that work? In general, yes. If a bulk version exist, this code would not be generated.
str.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oid,:str]):bat[:oid,:str]
batstr.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oid,:str]):bat[:oid,:str]
Niels
And if it does, would it then also work in Oct2014, as it would no
Roberto
On 11 April 2015 at 14:06, Niels Nes
Niels.Nes@cwi.nl>> wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote: > Hi there, > > I need a string tokenizer in MonetDB. > The problem I have is not with the function itself, but with the
fact
> that this is a 1 to N rows function. > > Implementing this for a single string value is easy enough,
using a
> table function that takes a string a returns a table: > > create function tokenize(s string) > returns table (token string) > external name tokenize; > > select * > from tokenize("one two three"); > > That's fine. > The issue I'm having is with extending this to a column of
strings.
> > Ideally, given a string column > > one two three > four five six > seven eight > > I'd like to get an output along these lines (simplistic
representation
> here): > > one two three | one > one two three | two > one two three | three > four five six | four > four five six | five > four five six | six > seven eight | seven > seven eight | eight > > > I can sure code the c function and the mal wrapper to implement
longer need the "union" trick? this,
> but I can't see how to map it to SQL, given that table functions
don't
> accept identifiers as parameters. > > Any idea? Any possible workaround? In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token.
Niels > Thanks, Roberto >
> _______________________________________________ > users-list mailing list > users-list@monetdb.org mailto:users-list@monetdb.org > https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 tel:%2B%2B31%2020%20592-4098
sip:4098@sip.cwi.nl mailto:sip%3A4098@sip.cwi.nl
url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list