Mercurial > hg > MonetDB-extend
view regexp/README.rst @ 29:e44cffee8312
Implemented bulk variant of match function.
author | Sjoerd Mullender <sjoerd@acm.org> |
---|---|
date | Mon, 06 Aug 2018 20:59:23 +0200 (2018-08-06) |
parents | e925d55b369b |
children | e70b12c15507 |
line wrap: on
line source
.. This Source Code Form is subject to the terms of the Mozilla Public .. License, v. 2.0. If a copy of the MPL was not distributed with this .. file, You can obtain one at http://mozilla.org/MPL/2.0/. .. .. Copyright 2018 MonetDB B.V. .. This document is written in reStructuredText (see http://docutils.sourceforge.net/ for more information). Use ``rst2html.py`` to convert this file to HTML. ============================== Implementing a Filter Function ============================== :Author: Sjoerd Mullender :Organization: CWI, MonetDB B.V. :Abstract: In this short tutorial we describe how to create a filter function for MonetDB/SQL. Introduction ------------ We want to create a function or set of functions that allows us to filter a column based on whether a regular expression matches. When we're done, we will be able to issue a query such as this:: SELECT name FROM t WHERE [name] rematch ['^mue?ller$']; What this query does is select all rows from table ``t`` where the column ``name`` matches the regular expression ``^mue?ller$``. This tutorial is not about how to create regular expressions or what they mean. This tutorial is about how to extend MonetDB/SQL with functions so that a query like this is possible. Note that this query is similar to queries using the LIKE predicate, although regular expressions are more powerful than the patterns allowed by LIKE. The following three queries are equivalent:: SELECT name FROM t WHERE name LIKE '%foo%'; SELECT name FROM t WHERE [name] sys."like" ['%foo%']; SELECT name FROM t WHERE [name] sys.rematch ['.*foo.*']; The first two of these queries work out of the box (given a table ``t`` witch a ``STRING`` column ``name``). We need to write ``"like"`` instead of ``LIKE`` or ``like`` since ``LIKE`` is an SQL keyword and in this case we want to use the SQL function called ``sys.like``. It is also possible to have a reference to a column in some table on both sides of the filter function. Given two tables with STRING columns where one table contains values and the other regular expressions, we can issue queries such as the following (these two queries are equivalent):: SELECT t1.value FROM t1 JOIN t2 ON [t1.value] sys.rematch [t2.pattern]; SELECT t1.value FROM t1, t2 WHERE [t1.value] sys.rematch [t2.pattern]; The function ``rematch`` that we will create, and also the function ``like`` that already exists are FILTER functions. We can also use the filter function for simple (scalar) queries, such as:: SELECT ['some string'] sys."like" ['%foo%']; The value on the left is matched with the pattern on the right and the result is a Boolean value indicating whether the value matches the pattern (``false`` in this case). In addition to this simple filter function that matches a value with a pattern, we will create a variant that also has a ``flags`` argument. It can be useful to pass some extra flags to the matching operator to tell it to do case insensitive matching, or always match the whole string vs. finding a matching substring. We will be able to call this variant is as follows:: SELECT ['some string'] sys.rematch ['.*foo.*', 'i']; SELECT name FROM t WHERE [name] sys.rematch ['.*foo.*', 'i']; SELECT t1.value FROM t1 JOIN t2 ON [t1.value] sys.rematch [t2.pattern, 'i']; SELECT t1.value FROM t1, t2 WHERE [t1.value] sys.rematch [t2.pattern, 'i']; Implementation -------------- Three things need to be done. We need to add a function definition to the SQL catalog, we need to describe the MAL interface for the MAL interpreter, and we need to create the C implementation of the function. We will describe the three steps here in this order. SQL ... The SQL interface is simple:: CREATE FILTER FUNCTION rematch(val STRING, pat STRING) EXTERNAL NAME regexp.rematch; CREATE FILTER FUNCTION rematch(val STRING, pat STRING, flags STRING) EXTERNAL NAME regexp.rematch; When we create a ``FILTER FUNCTION`` in SQL, the SQL system expects three MAL functions whose names are based on the external name given. The functions are named ``regexp.rematch``, ``regexp.rematchselect``, and ``regexp.rematchjoin``. In other words, the name as given, and the name with the string ``select`` and ``join`` appended. These functions are used to implement a simple scalar version of the filter, a version that is used in simple select queries, and one that is used in joins. Note that we use the same MAL names for both variants. The interpreter will be able to distinguish the variants based on the arguments given. This statement will normally be executed once when the database is created, after which it is part of the SQL catalog. This is accomplished by having the statement in a file in the ``$libdir/monetdb/createdb`` directory. Since files in that directory are executed in order, the convention is to add a two digit number at the front of the file name to force the order. So we have a file ``81_regexp.sql`` where we put this statement. At the SQL side we don't have to do anything more. MAL ... The MAL interface consists of three or four functions whose names are based on the names specified in the SQL interface. The three functions that need to be implemented have names that are equal to the name given as the ``EXTERNAL NAME`` in the SQL interface, plus that same name with ``join`` and ``select`` appended. A fourth function can be optionally implemented. It is the *bulk* variant of the first function. This bulk function has a name that is equal to the name given in SQL with ``bat`` prepended. The bulk variant returns a BAT with a single value for each input value. See the *reverse* tutorial. The interface looks like this. First the variant without the ``flags`` argument:: module regexp; command rematch(val:str, pat:str) :bit address regexpmatch comment "Return true when the value 'val' matches the regular expression 'pat'"; command rematchselect(val:bat[:str], s:bat[:oid], pat:str, anti:bit):bat[:oid] address regexpmatchselect comment "Return the list of matches in 'val' that match the regular expression 'pat'"; command rematchjoin(val:bat[:str], pat:bat[:str], sl:bat[:oid], sr:bat[:oid], nil_matches:bit, estimate:lng)(lr:bat[:oid],rr:bat[:oid]) address regexpmatchjoin comment "Return the matching pairs from the 'val' and 'pat' columns"; module batregexp; command rematch(val:bat[:str], pat:str) :bat[:bit] address regexpmatchbulk comment "Return a BAT with true for match and false for no match"; The variant with the ``flags`` argument looks like this:: module regexp; command rematch(val:str, pat:str, flags:str) :bit address regexpmatchf comment "Return true when the value 'val' matches the regular expression 'pat'"; command rematchselect(val:bat[:str], s:bat[:oid], pat:str, flags:str, anti:bit):bat[:oid] address regexpmatchfselect comment "Return the list of matches in 'val' that match the regular expression 'pat'"; command rematchjoin(val:bat[:str], pat:bat[:str], flags:str, sl:bat[:oid], sr:bat[:oid], nil_matches:bit, estimate:lng)(lr:bat[:oid],rr:bat[:oid]) address regexpmatchfjoin comment "Return the matching pairs from the 'val' and 'pat' columns"; module batregexp; command rematch(val:bat[:str], pat:str, flags:str) :bat[:bit] address regexpmatchfbulk comment "Return a BAT with true for match and false for no match"; We put these MAL commands in the file ``$libdir/monetdb5/regexp.mal``. In addition we create a file ``$libdir/monetdb5/autoload/81_regexp.mal`` that just contains:: include regexp; The files in the ``autoload`` directory are executed in order every time the server is started so that by putting the ``81_regexp.mal`` file there, we make sure that the system knows about these functions. C Implementation ................ First we'll describe the core of the matching process. This has nothing to do with how MonetDB works and everything with how the `PCRE`__ library works. However, for a complete example, we need this part. Here we ignore all errors. Consult the PCRE documentation for more information. __ http://pcre.org/ :: pcre *re; /* compiled regular expression */ pcre_extra *sd; /* studied regular expression */ const char *err; /* if non-NULL, error message */ int pos; /* error position or match position */ re = pcre_compile(patter, options, &err, &pos, NULL); sd = pcre_study(re, 0, &err); /* optional: else use NULL for sd */ pos = pcre_exec(re, sd, value, (int) strlen(value), 0, 0, NULL, 0); /* pos >= 0 if there is a match, pos == PCRE_ERROR_NOMATCH if not */ pcre_free_study(sd); pcre_free(re); Match ````` With this information, we can implement the first function easily. The first function is a simple scalar version, i.e. a version that works on a single value. We also give the version with flags argument. Since they are very similar, they share all code:: char * regexpmatch(bit *ret, const char **val, const char **pat) { return do_match(ret, *val, *pat, ""); } char * regexpmatchf(bit *ret, const char **val, const char **pat, const char **flags) { return do_match(ret, *val, *pat, *flags); } The function ``do_match`` does all the work and is in essence the above code (in real code we need to add error handling):: static char * do_match(bit *ret, const char *val, const char *pat, const char *flags) { int options; /* calculate value from flags */ const char *err = NULL; int errpos = 0; pcre *re; if (GDK_STRNIL(val) || GDK_STRNIL(pat) || GDK_STRNIL(flags)) { /* special case for NIL inputs: NILs don't match anything */ *ret = 0; return MAL_SUCCEED; } re = pcre_compile(pat, options, &err, &pos, NULL); pos = pcre_exec(re, NULL, val, (int) strlen(val), 0, 0, NULL, 0); *ret = pos >= 0; return MAL_SUCCEED; } We will not here describe the bulk variant. The code is in the source file, though. Select `````` The C interface of the two select functions (with and without flags) is as follows:: char *regexpmatchselect(bat *ret, const bat *bid, const bat *sid, const char **pat, const bit *anti); char *regexpmatchfselect(bat *ret, const bat *bid, const bat *sid, const char **pat, const char **flags, const bit *anti); The select function is essentially a bulk version of the match function. There is a single pattern that is matched against a complete column of values. There are a few complicating issues, though. Select returns a list of OID values of matching rows. This list *must* be sorted and all values *must* be distinct. This must also be indicated in the BAT properties. Because of the way we go through the values, we know that the result we produce is already sorted and distinct. The select function has a parameter ``const bit *anti``. If ``*anti`` is set, the match is reversed, except that ``nil`` never matches. The select function also has a parameter ``const bat *sid``. This value specifies a candidate list. The candidate list is optional. Not having a candidate list is indicated by ``*sid`` being ``0`` or ``bat_nil``. To be sure, we also check whether ``sid`` itself is equal to the ``NULL`` pointer. If there is no candidate list, all values of the main input BAT referred to by ``*bid`` are considered for matching. If there is a candidate list, it is a sorted list of OID values of values from ``*bid`` that are to be matched. In our implementation we use the same code for a *dense* candidate list and for no candidate list. We just iterate over all values of ``*bid`` between a start and end index. If there is a non dense candidate list, we iterate over it and use an indirection to get to the value in ``*bid``. When creating the result BAT, we allocate enough space in case all input values match. This means that inside the matching loop we don't need to check whether there is enough space. This is, of course, a trade off between speed and simplicity (whether or not we check) and memory consumption. An alternative would be to allocate much less and use the function ``BUNappend`` to add values or check for space inside the loop. When using ``BUNappend`` we also don't need to set any properties afterwards, since they are maintained by ``BUNappend``. Speaking of BAT properties, since we're not using ``BUNappend`` we need to maintain the properties ourselves. Properties must never lie. That is to say, if a property is set, whatever the property indicates must be true. An unset property indicates we don't know whether or not the property holds. The two C functions referenced above are so similar that they share all code:: char * regexpmatchselect(bat *ret, const bat *bid, const bat *sid, const char **pat, const bit *anti) { return do_select(ret, *bid, sid ? *sid : 0, *pat, "", *anti); } char * regexpmatchfselect(bat *ret, const bat *bid, const bat *sid, const char **pat, const char **flags, const bit *anti) { return do_select(ret, *bid, sid ? *sid : 0, *pat, *flags, *anti); } The function ``do_select`` does all the work:: static char * do_select(bat *ret, bat bid, bat sid, const char *pat, const char *flags, bit anti) { ... } First we check whether any of the string input arguments is NIL. If they are, there are no matches and we're done quickly:: if (GDK_STRNIL(pat) || GDK_STRNIL(flags)) { /* no matches when the pattern or the flags is NIL * we return an empty BAT of the correct type */ if ((bn = BATdense(0, 0, 0)) == NULL) throw(MAL, "regexp.rematchselect", GDK_EXCEPTION); *ret = bn->batCacheid; BBPkeepref(*ret); return MAL_SUCCEED; } After this we convert the ``flags`` string to the ``options`` value that the PCRE library wants:: options = parseflags(flags); Then we load the two input BATs. The parameters ``bid`` and ``sid`` are BAT IDs, but we need BAT descriptors. To get from one to the other we use the function ``BATdescriptor``. Once we have a BAT descriptor, we need to free it again using ``BBPunfix``:: s = NULL; b = BATdescriptor(bid); if (!is_bat_nil(sid)) s = BATdescriptor(sid); We create the output BAT. The third arguments of ``COLnew`` is the desired capacity of the BAT. If the call succeeds, it is guaranteed that there is enough space for this many rows, so if we ask for enough capacity to store the maximum potential result, we don't need to check during insertion. We also set up a pointer to the start of the data area of the BAT:: bn = COLnew(0, TYPE_oid, s ? BATcount(s) : BATcount(b), TRANSIENT); outp = (oid *) Tloc(bn, 0); Since we're going to use the search pattern many times (well, depending on the inputs, of course), we invest extra time to study the pattern:: re = pcre_compile(pat, options, &err, &pos, NULL); sd = pcre_study(re, 0, &err); We then set up some auxiliary variables to help us iterate over the input:: CANDINIT(b, s, start, end, cnt, cand, candend); bi = bat_iterator(b); The macro ``CANDINIT``, defined in the file ``gdk_cand.h``, looks at the first two arguments and initializes the remaining arguments. The function ``bat_iterator`` returns a ``BATiter`` value that is needed for the ``BUNtail``, ``BUNtvar``, ``BUNtloc`` and ``BUNtpos`` macros. The macro ``BUNtail`` is the generic version, the other three can be used if we know more about the BAT that is used. In our case, we know that the BAT contains string values which are variabled sized values. This means that we can use ``BUNtvar``. Using ``BUNtvar`` instead of ``BUNtail`` should be slightly faster since it omits a few tests. Now we get to the core of the algorithm. There are two cases: with and without candidate list. In either case, we iterate and in each iteration we check first whether the value is nil, and if not whether the value matches the regular expression. If there is a match we add the ID of the value to the output. Here, match takes the ``anti`` variable into account. The version with candidate list is as follows:: while (cand < candend) { const char *val = BUNtvar(bi, *cand - b->hseqbase); if (!GDK_STRNIL(val)) { pos = pcre_exec(re, sd, val, (int) strlen(val), 0, 0, NULL, 0); if (pos >= 0) { if (!anti) *outp++ = *cand; } else if (pos == PCRE_ERROR_NOMATCH) { if (anti) *outp++ = *cand; } } cand++; } Now we can release all resources:: if (s) BBPunfix(s->batCacheid); BBPunfix(b->batCacheid); pcre_free_study(sd); pcre_free(re); Finally, we need to fix up the result BAT and return it. As is, the BAT is still empty, so we set the size:: BATsetcount(bn, (BUN) (outp - (oid *) Tloc(bn, 0))); We also need to set the properties. If we had used ``BUNappend`` that would have been done for us, but using ``BUNappend`` is a lot more overhead. The properties we need to set are: ``tsorted`` The column is sorted in ascending order. ``trevsorted`` The column is sorted in descending order. ``tkey`` All values are distinct ``tseqbase`` The starting ``oid`` value of a dense column. If a BAT has type ``oid`` or ``void`` (virtual oid), this value gives the start of the dense sequence of values in the BAT. If the BAT is not dense, ``tseqbase`` should be set to ``oid_nil``. ``tdense`` This property has been removed as of the Aug2018 (11.31.X) release. Use tseqbases instead. Only valid for ``TYPE_oid`` columns: each value in the column is exactly one larger than the one before. ``tnil`` There is at least one NIL value in the column. ``tnonil`` There are no NIL values in the column. ``tnosorted`` If ``tsorted == false``, the index in the column that proofs not sorted. Must be ``0`` if ``tsorted == true``. ``tnorevsorted`` If ``trevsorted == false``, the index in the column that proofs not reverse sorted. Must be ``0`` if ``trevsorted == true``. ``tnokey[0]`` and ``tnokey[1]`` If ``tkey == false``, a pair of indexes that together proof not all values are distinct. Must both be ``0`` if ``tkey == true``. Because of the way we created the output we know a number of these properties:: bn->tsorted = true; bn->tkey = true; bn->trevsorted = BATcount(bn) <= 1; bn->tnil = false; bn->tnonil = true; bn->tnosorted = 0; if (!bn->trevsorted) bn->tnorevsorted = 1; We also know that the column is dense if there is at most a single row in it, but we can easily check whether the column is dense even if there are more rows: we just need to look at the difference between the first and last rows and compare that with the number of rows:: bn->tseqbase = oid_nil; if (BATcount(bn) > 1) { outp = (oid *) Tloc(bn, 0); /* pointer to start */ if (outp[BATcount(bn) - 1] - outp[0] == BATcount(bn) - 1) bn->tseqbase = outp[0]; } Now we can return:: *ret = bn->batCacheid; BBPkeepref(*ret); return MAL_SUCCEED; Join ```` The C interface of the two join functions (with and without flags) is as follows:: char *regexpmatchjoin(bat *lres, bat *rres, const bat *lid, const bat *rid, const bat *sl, const bat *sr, const bit *nil_matches, const lng *estimate); char *regexpmatchfjoin(bat *lres, bat *rres, const bat *lid, const bat *rid, const char **flags, const bat *sl, const bat *sr, const bit *nil_matches, const lng *estimate); In MonetDB, a join returns two aligned (same length, same ``hseqbase``) BATs where each corresponding set of values indicates the OIDs of the left and right inputs that matched with each other. The left and right inputs are here given by ``lid`` and ``rid`` in combination with their respective, optional, candidate lists ``sl`` and ``sr``. The outputs are returned through the ``lres`` and ``rres`` pointers. The additional parameters are: ``flags``, the extra argument for the filter function; ``nil_matches``, if set, NILs are considered "normal" values that can match; ``estimate``, an estimate of how large the result is (often this is unknown which is indicated by the value ``lng_nil``). We ignore the value of ``nil_matches`` since it is only relevant for an equality join. The outputs of the join do not have to be in any particular order. In the general case, neither output is a candidate list (there may be duplicates if a value matches multiple times), so there is also no real point in demanding that one or the other output is sorted. Although it helps to set the ``tsorted`` property if we know that the output is sorted. There is no real alternative to implementing the join using a nested loop implementation. For reasons of efficiency, in our implementation of this join function, we will iterate over the pattern (right input) in the outer loop and over the to-be-matched strings (left input) in the inner loop. In this way, we only need to compile (and study) each pattern once and we can use it multiple times. Since the inner loop is executed many more times than the outer loop, we try to be a bit more efficient in the inner loop. That's why we check for ``lcand`` only once and check for ``rcand`` in every iteration.