view regexp/README.rst @ 15:59bbfa0096b3

Added a tutorial for creating a FILTER FUNCTION.
author Sjoerd Mullender <sjoerd@acm.org>
date Mon, 22 Jan 2018 17:56:06 +0100 (2018-01-22)
parents
children 17e6cc835d78
line wrap: on
line source
.. This Source Code Form is subject to the terms of the Mozilla Public
.. License, v. 2.0.  If a copy of the MPL was not distributed with this
.. file, You can obtain one at http://mozilla.org/MPL/2.0/.
..
.. Copyright 2018 MonetDB B.V.

.. This document is written in reStructuredText (see
   http://docutils.sourceforge.net/ for more information).
   Use ``rst2html.py`` to convert this file to HTML.

==============================
Implementing a Filter Function
==============================

:Author: Sjoerd Mullender
:Organization: CWI, MonetDB B.V.
:Abstract: In this short tutorial we describe how to create a filter
   function for MonetDB/SQL.

Introduction
------------

We want to create a function or set of functions that allows us to
filter a column based on whether a regular expression matches.  When
we're done, we will be able to issue a query such as this::

  SELECT * FROM t WHERE [name] rematch ['^mue?ller$'];

What this query does is select all rows from table ``t`` where the
column ``name`` matches the regular expression ``^mue?ller$``.  This
tutorial is not about how to create regular expressions or what they
mean.  This tutorial is about how to extend MonetDB/SQL with functions
so that a query like this is possible.

Note that this query is similar to queries using the LIKE predicate,
although regular expressions are more powerful than the patterns
allowed by LIKE.  The following three queries are equivalent::

  SELECT * FROM t WHERE name LIKE '%foo%';
  SELECT * FROM t WHERE [name] sys."like" ['%foo%'];
  SELECT * FROM t WHERE [name] sys.rematch ['.*foo.*'];

The first two of these queries work out of the box (given a table
``t`` witch a ``STRING`` column ``name``).

We need to write ``"like"`` instead of ``LIKE`` or ``like`` since
``LIKE`` is an SQL keyword and in this case we want to use the SQL
function called ``sys.like``.

The function ``rematch`` that we will create, and also the function
``like`` that already exists are FILTER functions.  We can also use
the filter function for simple (scalar) queries, such as::

  SELECT ['some string'] sys."like" ['%foo%'];

The value on the left is matched with the pattern on the right and the
result is a Boolean value indicating whether the value matches the
pattern (``false`` in this case).

In addition to this simple filter function that matches a value with a
pattern, we will create a variant that also has a ``flags`` argument.
It can be useful to pass some extra flags to the matching operator to
tell it to do case insensitive matching, or always match the whole
string vs. finding a matching substring.  The way we will be able to
call this variant is::

  SELECT * FROM t WHERE [name] sys.rematch ['.*foo.*', 'i'];

This particular query will be equivalent to both::

  SELECT * FROM t WHERE name ILIKE '%foo%';
  SELECT * FROM t WHERE [name] sys."ilike" ['%foo%'];

Implementation
--------------

Three things need to be done.  We need to add a function definition to
the SQL catalog, we need to describe the MAL interface for the MAL
interpreter, and we need to create the C implementation of the
function.  We will describe the three steps here in this order.

SQL
...

The SQL interface is simple::

  CREATE FILTER FUNCTION rematch(val STRING, pat STRING)
         EXTERNAL NAME regexp.rematch;
  CREATE FILTER FUNCTION rematch(val STRING, pat STRING, flags STRING)
         EXTERNAL NAME regexp.rematch;

When we create a ``FILTER FUNCTION`` in SQL, the SQL system expects
three MAL functions whose names are based on the external name given.
The functions are named ``regexp.rematch``, ``regexp.rematchselect``,
and ``regexp.rematchjoin``.  In other words, the name as given, and
the name with the string ``select`` and ``join`` appended.  These
functions are used to implement a simple scalar version of the filter,
a version that is used in simple select queries, and one that is used
in joins.

Note that we use the same MAL names for both variants.  The
interpreter will be able to distinguish the variants based on the
arguments given.

This statement will normally be executed once when the database is
created, after which it is part of the SQL catalog.  This is
accomplished by having the statement in a file in the
``$libdir/monetdb/createdb`` directory.  Since files in that directory
are executed in order, the convention is to add a two digit number at
the front of the file name to force the order.  So we have a file
``81_regexp.sql`` where we put this statement.

At the SQL side we don't have to do anything more.

MAL
...

The MAL interface consists of three functions whose names are based on
the names specified in the SQL interface.  The interface looks like
this.  First the variant without the ``flags`` argument::

  module regexp;

  command rematch(val:str, pat:str) :bit
  address regexpmatch
  comment "Return true when the value 'val' matches the regular expression 'pat'";

  command rematchselect(val:bat[:str], s:bat[:oid], pat:str, anti:bit):bat[:oid]
  address regexpmatchselect
  comment "Return the list of matches in 'val' that match the regular expression 'pat'";

  command rematchjoin(val:bat[:str], pat:bat[:str], sl:bat[:oid], sr:bat[:oid], nil_matches:bit, estimate:lng)(lr:bat[:oid],rr:bat[:oid])
  address regexpmatchjoin
  comment "Return the matching pairs from the 'val' and 'pat'
  columns";

The variant with the ``flags`` argument looks like this::

  module regexp;

  command rematch(val:str, pat:str, flags:str) :bit
  address regexpmatchf
  comment "Return true when the value 'val' matches the regular expression 'pat'";

  command rematchselect(val:bat[:str], s:bat[:oid], pat:str, flags:str, anti:bit):bat[:oid]
  address regexpmatchfselect
  comment "Return the list of matches in 'val' that match the regular expression 'pat'";

  command rematchjoin(val:bat[:str], pat:bat[:str], flags:str, sl:bat[:oid], sr:bat[:oid], nil_matches:bit, estimate:lng)(lr:bat[:oid],rr:bat[:oid])
  address regexpmatchfjoin
  comment "Return the matching pairs from the 'val' and 'pat'
  columns";

We put these MAL commands in the file
``$libdir/monetdb5/regexp.mal``.  In addition we create a file
``$libdir/monetdb5/autoload/81_regexp.mal`` that just contains::

 include regexp;

The files in the ``autoload`` directory are executed in order every
time the server is started so that by putting the ``81_regexp.mal``
file there, we make sure that the system knows about these functions.

C Implementation
................

First we'll describe the core of the matching process.  This has
nothing to do with how MonetDB works and everything with how the PCRE
library works.  However, for a complete example, we need this part.

We have a regular expression in the variable ``pat``, and a set of
option letters in the variable ``flags``.  We first need to convert
the option letters to an options value that is understood by the
library, then we need to compile the pattern to an internal
representation.  Finally we can use this internal representation to
check whether the string value in ``val`` matches the pattern.

::
   /* first the three input variables */
   const char *pat = "the regular expression";
   const char *flags = "i"; /* the flags, here "i", meaning case insensitive */
   const char *val = "the value being matched";

   int options = PCRE_UTF8; /* MonetDB use the UTF-8 encoding exclusively */
   /* the following two variables are for error handling */
   const char *err; /* here we will collect the error message */
   int pos; /* here we collect the position of the error and the result of the match */

   while (*flags) {
       switch (*flags) {
       case 'i': options |= PCRE_CASELESS; break;
       case 'x': options |= PCRE_EXTENDED; break;
       /* more cases are possible: there are more potential options */
       default: /* unrecognized option; here we should generate an error */
       }
       flags++;
   }
   re = pcre_compile(pat, options, &err, &pos, NULL);
   if (re == NULL) {
       /* an error occurred during compilation, the message is in err,
        * the position of the error in pat is in pos */
       return; /* return an error */
   }
   pos = pcre_exec(re, NULL, val, (int) strlen(val), 0, 0, NULL, 0);
   if (pos >= 0) {
       /* there was a match */
   } else if (pos == PCRE_ERROR_NOMATCH) {
       /* there was no match and no other error */
   } else {
       /* there was some error as specified in the value of pos */
       return; /* return an error */
   }

Match
`````

With this information, we can implement the first function easily.
The first function is a simple scalar version, i.e. a version that
works on a single value.  We also give the version with flags
argument.  Since they are very similar, they share all code::

  char *
  regexpmatch(bit *ret, const char **val, const char **pat)
  {
      return do_match(ret, *val, *pat, "");
  }

  char *
  regexpmatchf(bit *ret, const char **val, const char **pat, const char **flags)
  {
      return do_match(ret, *val, *pat, *flags);
  }

The function ``do_match`` does all the work and is in essence the
above code (in real code we need to add error handling)::

  static char *
  do_match(bit *ret, const char *val, const char *pat, const char *flags)
  {
      int options; /* calculate value from flags */
      const char *err = NULL;
      int errpos = 0;
      pcre *re;

      if (strcmp(val, str_nil) == 0 ||
          strcmp(pat, str_nil) == 0 ||
          strcmp(flags, str_nil) == 0) {
          /* special case for NIL inputs: NILs don't match anything */
          *ret = 0;
          return MAL_SUCCEED;
      }

      re = pcre_compile(pat, options, &err, &pos, NULL);
      pos = pcre_exec(re, NULL, val, (int) strlen(val), 0, 0, NULL, 0);
      *ret = pos >= 0;
      return MAL_SUCCEED;
  }

Select
``````

The C interface of the two select functions (with and without flags)
is as follows::

  char *regexpmatchselect(bat *ret, const bat *bid, const bat *sid,
          const char **pat, const bit *anti);
  char *regexpmatchfselect(bat *ret, const bat *bid, const bat *sid,
          const char **pat, const char **flags, const bit *anti);

The select function is essentially a bulk version of the match
function.  There is a single pattern that is matched against a
complete column of values.  There are a few complicating issues,
though.

Select returns a list of OID values of matching rows.  This list
*must* be sorted and all values *must* be distinct.  This must also be
indicated in the BAT properties.  Because of the way we go through the
values, we know that the result we produce is already sorted and
distinct.

The select function has a parameter ``const bit *anti``.  If ``*anti``
is set, the match is reversed, except that ``nil`` never matches.

The select function also has a parameter ``const bat *sid``.  This
value specifies a candidate list.  The candidate list is optional.
Not having a candidate list is indicated by ``*sid`` being ``0`` or
``bat_nil``.  To be sure, we also check whether ``sid`` itself is
equal to the ``NULL`` pointer.  If there is no candidate list, all
values of the main input BAT ``b`` are considered for matching.  If
there is a candidate list, it is a sorted list of OID values of values
from ``b`` that are to be matched.

In our implementation we use the same code for a *dense* candidate
list and for no candidate list.  We just iterate over all values of
``b`` between a start and end index.  If there is a non dense
candidate list, we iterate over it and use an indirection to get to
the value in ``b``.

When creating the result BAT, we allocate enough space in case all
input values match.  This means that inside the matching loop we don't
need to check whether there is enough space.  This is, of course, a
trade off between speed and simplicity (whether or not we check) and
memory consumption.  An alternative would be to allocate much less and
use the function ``BUNappend`` to add values.  When using
``BUNappend`` we also don't need to set any properties afterwards,
since they are maintained by ``BUNappend``.

Speaking of BAT properties, since we're not using ``BUNappend`` we
need to maintain the properties ourselves.  Properties must never
lie.  That is to say, if a property is set, whatever the property
indicates must be true.  An unset property indicates we don't know
whether or not the property holds.

The two C functions referenced above are so similar that they share
all code::

  char *
  regexpmatchselect(bat *ret, const bat *bid, const bat *sid,
          const char **pat, const bit *anti)
  {
      return do_select(ret, *bid, sid ? *sid : 0, *pat, "", *anti);
  }


  char *
  regexpmatchfselect(bat *ret, const bat *bid, const bat *sid,
          const char **pat, const char **flags, const bit *anti)
  {
      return do_select(ret, *bid, sid ? *sid : 0, *pat, *flags, *anti);
  }

The function ``do_select`` does all the work::

  static char *
  do_select(bat *ret, bat bid, bat sid, const char *pat,
          const char *flags, bit anti)
  {
      ...
  }

First we check whether any of the string input arguments is NIL.  If
they are, there are no matches and we're done quickly::

  if (GDK_STRNIL(pat) || GDK_STRNIL(flags)) {
      /* no matches when the pattern or the flags is NIL
       * we return an empty BAT of the correct type */
      if ((bn = BATdense(0, 0, 0)) == NULL)
          throw(MAL, "regexp.rematchselect", GDK_EXCEPTION);
      *ret = bn->batCacheid;
      BBPkeepref(*ret);
      return MAL_SUCCEED;
  }

After this we convert the ``flags`` string to the ``options`` value
that the PCRE library wants::

  options = parseflags(flags);

Then we load the two input BATs.  The parameters ``bid`` and ``sid``
are BAT IDs, but we need BAT descriptors.  To get from one to the
other we use the function ``BATdescriptor``.  Once we have a BAT
descriptor, we need to free it again using ``BBPunfix``::

  s = NULL;
  b = BATdescriptor(bid);
  if (sid != 0 && sid != bat_nil)
      s = BATdescriptor(sid);

We create the output BAT.  The third arguments of ``COLnew`` is the
desired capacity of the BAT.  If the call succeeds, it is guaranteed
that there is enough space for this many rows, so if we ask for enough
capacity to store the maximum potential result, we don't need to check
during insertion.  We also set up a pointer to the start of the data
area of the BAT::

  bn = COLnew(0, TYPE_oid, s ? BATcount(s) : BATcount(b), TRANSIENT);
  outp = (oid *) Tloc(bn, 0);

Since we're going to use the search pattern many times (well,
depending on the inputs, of course), we invest extra time to study the
pattern::

  re = pcre_compile(pat, options, &err, &pos, NULL);
  sd = pcre_study(re, 0, &err);

We then set up some auxiliary variables to help us iterate over the
input::

  CANDINIT(b, s, start, end, cnt, cand, candend);
  bi = bat_iterator(b);

The macro ``CANDINIT``, defined in the file ``gdk_cand.h``, looks at
the first two arguments and initializes the remaining arguments.  The
function ``bat_iterator`` returns a ``BATiter`` value that is needed
for the ``BUNtail``, ``BUNtvar``, ``BUNtloc`` and ``BUNtpos`` macros.
The macro ``BUNtail`` is the generic version, the other three can be
used if we know more about the BAT that is used.  In our case, we know
that the BAT contains string values which are variabled sized values.
This means that we can use ``BUNtvar``.  Using ``BUNtvar`` instead of
``BUNtail`` should be slightly faster since it omits a few tests.

Now we get to the core of the algorithm.  There are two cases: with
and without candidate list.  In either case, we iterate and in each
iteration we check first whether the value is nil, and if not whether
the value matches the regular expression.  If there is a match we add
the ID of the value to the output.  Here, match takes the ``anti``
variable into account.  The version with candidate list is as
follows::

  while (cand < candend) {
      const char *val = BUNtvar(bi, *cand - b->hseqbase);
      if (!GDK_STRNIL(val)) {
          pos = pcre_exec(re, sd, val, (int) strlen(val), 0, 0, NULL, 0);
          if (pos >= 0) {
              if (!anti)
                  *outp++ = *cand;
          } else if (pos == PCRE_ERROR_NOMATCH) {
              if (anti)
                  *outp++ = *cand;
          }
      }
      cand++;
  }

Now we can release all resources::

  if (s) BBPunfix(s->batCacheid);
  BBPunfix(b->batCacheid);
  pcre_free_study(sd);
  pcre_free(re);

Finally, we need to fix up the result BAT and return it.  As is, the
BAT is still empty, so we set the size::

  BATsetcount(bn, (BUN) (outp - (oid *) Tloc(bn, 0)));

We also need to set the properties.  If we had used ``BUNappend`` that
would have been done for us, but using ``BUNappend`` is a lot more
overhead.  The properties we need to set are:

``tsorted``
  The column is sorted in ascending order.

``trevsorted``
  The column is sorted in descending order.

``tkey``
  All values are distinct

``tdense``
  Only valid for ``TYPE_oid`` columns: each value in the column is
  exactly one larger than the previous value.

``tnil``
  There is at least one NIL value in the column.

``tnonil``
  There are no NIL values in the column.

``tnosorted``
  If ``tsorted == 0``, the index in the column that proofs not sorted.

``tnorevsorted``
  If ``trevsorted == 0``, the index in the column that proofs not
  reverse sorted.

``tnokey``
  If ``tkey == 0``, a pair of indexes that together proof not all
  values are distinct.

Because of the way we created the output we know a number of these
properties::

  bn->tsorted = 1;
  bn->tkey = 1;
  bn->trevsorted = BATcount(bn) <= 1;
  bn->tnil = 0;
  bn->tnonil = 1;
  bn->tnosorted = 0;
  if (!bn->trevsorted) bn->tnorevsorted = 1;

We also know that the column is dense if there is at most a single row
in it, but we can easily check whether the column is dense even if
there are more rows: we just need to look at the difference between
the first and last rows and compare that with the number of rows::

  if (BATcount(bn) > 1) {
      outp = (oid *) Tloc(bn, 0); /* pointer to start */
      bn->tdense = outp[BATcount(bn) - 1] - outp[0] == BATcount(bn) - 1;
  }

If the column is dense, we need to set ``tseqbase`` to the first value
(or ``0`` if there are no values)::

  bn->tseqbase = BATcount(bn) > 0 ? * (oid *) Tloc(bn, 0) : 0;

Now we can return::

  *ret = bn->batCacheid;
  BBPkeepref(*ret);
  return MAL_SUCCEED;

Join
````

The C interface of the two join functions (with and without flags) is
as follows::

  char *regexpmatchjoin(bat *lres, bat *rres, const bat *lid, const bat *rid,
          const bat *sl, const bat *sr,
	  const bit *nil_matches, const lng *estimate);
  char *regexpmatchfjoin(bat *lres, bat *rres, const bat *lid, const bat *rid,
          const char **flags, const bat *sl, const bat *sr,
	  const bit *nil_matches, const lng *estimate);

In MonetDB, a join returns two aligned (same length, same
``hseqbase``) BATs where each corresponding set of values indicates
the OIDs of the left and right inputs that matched with each other.
The left and right inputs are here given by ``lid`` and ``rid`` in
combination with their respective, optional, candidate lists ``sl``
and ``sr``.  The outputs are returned through the ``lres`` and
``rres`` pointers.  The additional parameters are: ``flags``, the
extra argument for the filter function; ``nil_matches``, if set, NILs
are considered "normal" values that can match; ``estimate``, an
estimate of how large the result is (often this is unknown which is
indicated by the value ``lng_nil``).  We ignore the value of
``nil_matches`` since it is only relevant for an equality join.

The outputs of the join do not have to be in any particular order.  In
the general case, neither output is a candidate list (there may be
duplicates if a value matches multiple times), so there is also no
real point in demanding that one or the other output is sorted.
Although it helps to set the ``tsorted`` property if we know that the
output is sorted.

There is no real alternative to implementing the join using a nested
loop implementation.  For reasons of efficiency, in our implementation
of this join function, we will iterate over the pattern (right input)
in the outer loop and over the to-be-matched strings (left input) in
the inner loop.  In this way, we only need to compile (and study) each
pattern once and we can use it multiple times.

Since the inner loop is executed many more times than the outer loop,
we try to be a bit more efficient in the inner loop.  That's why we
check for ``lcand`` only once and check for ``rcand`` in every
iteration.