-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Clone the following (small) repository and read the README.rst file in
there:
http://dev.monetdb.org/hg/MonetDB-extend/
In short, you need to define the SQL scalar function which should
point to the MAL function str.UTF8tokenize, and you need to have the
MAL bulk function batstr.UTF8tokenize. If you have those, SQL should
figure it all out. (In particular, you should *not* have the SQL bulk
function.)
On 08/06/15 11:36, Roberto Cornacchia wrote:
> I'm don''t seem to get the expected result, let's see if I'm doing
> something silly.
>
> - SQL signature: create function tokenize(id integer, s string,
> prob double) returns table (id integer, token string, prob double)
> external name batstr."UTF8tokenize";
>
> - MAL signature: command
> batstr.UTF8tokenize(id:bat[:oid,:int],s:bat[:oid,:str],prob:bat[:oid,:
dbl])
>
>
(:bat[:oid,:int],:bat[:oid,:str],:bat[:oid,:dbl])
> address STRbat_utf8_tokenize_id_prob;
>
> - C signature: batstr_export str STRbat_utf8_tokenize_id_prob(bat
> *r1, bat *r2, bat *r3, const bat *idx, const bat *s, const bat
> *prob);
>
>
> Inspecting a mal plan for a query like
>
> SELECT * FROM tokenize (select id, s, prob from x);
>
> I see that the bat version of the function being used inside the
> same tuple-oriented loop.
>
> | X_64 := bat.new(nil:oid,nil:int); | X_67 :=
> bat.new(nil:oid,nil:str); | X_69 := bat.new(nil:oid,nil:dbl); |
> barrier (X_72,X_73) := iterator.new(X_8); | X_75 :=
> algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); |
> (X_79,X_80,X_81) := batstr.UTF8tokenize(X_73,X_75,X_77); |
> bat.append(X_64,X_79); | bat.append(X_67,X_80); |
> bat.append(X_69,X_81); | redo (X_72,X_73) :=
> iterator.next(X_8); | exit (X_72,X_73);
>
> Executing this fails, obviously.
>
> Can you spot where the problem is? Roberto
>
>
> On 6 June 2015 at 10:29, Niels Nes <Niels.Nes@cwi.nl
> <mailto:Niels.Nes@cwi.nl> <mailto:Niels.Nes@cwi.nl> <mailto:Niels.Nes@cwi.nl>> wrote:
>
> On Thu, Jun 04, 2015 at 11:46:08AM +0200, Martin Kersten wrote:
>> On 04/06/15 11:36, Roberto Cornacchia wrote:
>>> Hi Niels,
>>>
>>> I have tried this in default and indeed it does work like a
>>> charm. (my UTF8tokenize UDF takes two values and outputs a
>>> 3-column table)
>>>
>>> I noticed, though that it results in a MAL loop:
>>>
>>> | barrier (X_72,X_73) := iterator.new(X_8); | X_75 :=
>>> algebra.fetch(X_11,X_72); | X_77 :=
>>> algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) :=
>>> str.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79);
>>> | bat.append(X_67,X_80); | bat.append(X_69,X_81); |
>>> redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
>>>
>>> This of course is not going to be efficient. What if I write
>>> the bulk version of this function? Would that work?
>> In general, yes. If a bulk version exist, this code would not be
>> generated.
>>
>> str.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oi
d,:str]):bat[:oid,:str]
>
>>
batstr.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oi
d,:str]):bat[:oid,:str]
>
> Niels
>>
>>> And if it does, would it then also work in Oct2014, as it
>>> would
> no longer need the "union" trick?
>>>
>>> Roberto
>>>
>>>
>>> On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl
> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>>> <mailto:Niels.Nes@cwi.nl>>> wrote:
>>>
>>> On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia
> wrote:
>>>> Hi there,
>>>>
>>>> I need a string tokenizer in MonetDB. The problem I have is
>>>> not with the function itself, but
> with the fact
>>>> that this is a 1 to N rows function.
>>>>
>>>> Implementing this for a single string value is easy
> enough, using a
>>>> table function that takes a string a returns a table:
>>>>
>>>> create function tokenize(s string) returns table (token
>>>> string) external name tokenize;
>>>>
>>>> select * from tokenize("one two three");
>>>>
>>>> That's fine. The issue I'm having is with extending this to a
>>>> column of
> strings.
>>>>
>>>> Ideally, given a string column
>>>>
>>>> one two three four five six seven eight
>>>>
>>>> I'd like to get an output along these lines (simplistic
> representation
>>>> here):
>>>>
>>>> one two three | one one two three | two one two three |
>>>> three four five six | four four five six | five four five six
>>>> | six seven eight | seven seven eight | eight
>>>>
>>>>
>>>> I can sure code the c function and the mal wrapper to
> implement this,
>>>> but I can't see how to map it to SQL, given that table
> functions don't
>>>> accept identifiers as parameters.
>>>>
>>>> Any idea? Any possible workaround?
>>> In default you should be able to call tokenize on a column. It
>>> will output the 'union' of all per row calls. If you would like
>>> the 2 column output, you should take care of this in your
>>> tokenize function, ie return both input and token.
>>>
>>> Niels
>>>> Thanks, Roberto
>>>>
>>>
>>>> _______________________________________________ users-list
>>>> mailing list users-list@monetdb.org
>>>> <mailto:users-list@monetdb.org>
>>>> https://www.monetdb.org/mailman/listinfo/users-list
>>>
>>>
>>> -- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica
>>> (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room
>>> L3.14, phone ++31 20 592-4098
> <tel:%2B%2B31%2020%20592-4098> <tel:%2B%2B31%2020%20592-4098>
> sip:4098@sip.cwi.nl <mailto:sip%3A4098@sip.cwi.nl>
> <mailto:sip%3A4098@sip.cwi.nl <mailto:sip%253A4098@sip.cwi.nl>>
>>> url: https://www.cwi.nl/people/niels e-mail:
> Niels.Nes@cwi.nl <mailto:Niels.Nes@cwi.nl>
> <mailto:Niels.Nes@cwi.nl <mailto:Niels.Nes@cwi.nl>>
>>>
>>> _______________________________________________ users-list
>>> mailing list users-list@monetdb.org
>>> <mailto:users-list@monetdb.org>
> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>>
>>> https://www.monetdb.org/mailman/listinfo/users-list
>>>
>>>
>>>
>>>
>>> _______________________________________________ users-list
>>> mailing list users-list@monetdb.org
>>> <mailto:users-list@monetdb.org>
>>> https://www.monetdb.org/mailman/listinfo/users-list
>>>
>>
>> _______________________________________________ users-list
>> mailing list users-list@monetdb.org
>> <mailto:users-list@monetdb.org>
>> https://www.monetdb.org/mailman/listinfo/users-list
>
> -- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI)
> Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14,
> phone ++31 20 592-4098 <tel:%2B%2B31%2020%20592-4098>
> sip:4098@sip.cwi.nl <mailto:sip%3A4098@sip.cwi.nl> url:
> https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
> <mailto:Niels.Nes@cwi.nl>
>
> _______________________________________________ users-list mailing
> list users-list@monetdb.org <mailto:users-list@monetdb.org>
> https://www.monetdb.org/mailman/listinfo/users-list
>
>
>
>
> _______________________________________________ users-list mailing
> list users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>
- --
Sjoerd Mullender
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAEBCAAGBQJVdYEEAAoJEISMxT6LrWYgun4IAIU6hskhxHCgAF7+R1vAyoZC
refsxd9voT4xOKuODBuc32NDlS96zotinoMTJ1i4hGCjueEuCY/ty8gF0kIQXNbY
PEMQujcYmn74I21Wv8NrUfXQhpnNAhapHMuIY7O3n4MteDWUIwYy0QvxEWG0jSZv
bzEDhRSnXhUmhMYrA/sKzkbQAdcHiYRO+ie+/iHcNQhvnF7Xo2Wq6ysTs+KyF7GF
eGx1oRxArv9OJHsY8VRr1Ah5o9Dp09oAhDDzOl/aD9yAwQVYsmjkBm5IuG9mfpNk
2hDb3QJopFSXrpqgegj79wbrs1Wh8G0wPDa7Eq0cjd4eLAVsnDmmoKvkK4d6G14=
=eS+c
-----END PGP SIGNATURE-----
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list