Hi Niels,

I have tried this in default and indeed it does work like a charm.
(my UTF8tokenize UDF takes two values and outputs a 3-column table)

I noticed, though that it results in a MAL loop:

| barrier (X_72,X_73) := iterator.new(X_8);
|     X_75 := algebra.fetch(X_11,X_72);
|     X_77 := algebra.fetch(X_14,X_72);
|     (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77);
|     bat.append(X_64,X_79);
|     bat.append(X_67,X_80);
|     bat.append(X_69,X_81);
|     redo (X_72,X_73) := iterator.next(X_8);
| exit (X_72,X_73);

This of course is not going to be efficient.
What if I write the bulk version of this function? Would that work?
And if it does, would it then also work in Oct2014, as it would no longer need the "union" trick?

Roberto


On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl> wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote:
> Hi there,
>
> I need a string tokenizer in MonetDB.
> The problem I have is not with the function itself, but with the fact
> that this is a 1 to N rows function.
>
> Implementing this for a single string value is easy enough, using a
> table function that takes a string a returns a table:
>
> create function tokenize(s string)
> returns table (token string)
> external name tokenize;
>
> select *
> from tokenize("one two three");
>
> That's fine.
> The issue I'm having is with extending this to a column of strings.
>
> Ideally, given a string column
>
> one two three
> four five six
> seven eight
>
> I'd like to get an output along these lines (simplistic representation
> here):
>
> one two three | one
> one two three | two
> one two three | three
> four five six | four
> four five six | five
> four five six | six
> seven eight   | seven
> seven eight   | eight
>
>
> I can sure code the c function and the mal wrapper to implement this,
> but I can't see how to map it to SQL, given that table functions don't
> accept identifiers as parameters. 
>
> Any idea? Any possible workaround?
In default you should be able to call tokenize on a column.
It will output the 'union' of all per row calls.
If you would like the 2 column output, you should take care of
this in your tokenize function, ie return both input and token.

Niels
> Thanks, Roberto
>

> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list


--
Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI)
Science Park 123, 1098 XG Amsterdam, The Netherlands
room L3.14,  phone ++31 20 592-4098     sip:4098@sip.cwi.nl
url: https://www.cwi.nl/people/niels    e-mail: Niels.Nes@cwi.nl

_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list