view regexp/README.rst @ 29:e44cffee8312

Implemented bulk variant of match function.
author Sjoerd Mullender <sjoerd@acm.org>
date Mon, 06 Aug 2018 20:59:23 +0200 (2018-08-06)
parents e925d55b369b
children e70b12c15507
line wrap: on
line source
.. This Source Code Form is subject to the terms of the Mozilla Public
.. License, v. 2.0.  If a copy of the MPL was not distributed with this
.. file, You can obtain one at http://mozilla.org/MPL/2.0/.
..
.. Copyright 2018 MonetDB B.V.

.. This document is written in reStructuredText (see
   http://docutils.sourceforge.net/ for more information).
   Use ``rst2html.py`` to convert this file to HTML.

==============================
Implementing a Filter Function
==============================

:Author: Sjoerd Mullender
:Organization: CWI, MonetDB B.V.
:Abstract: In this short tutorial we describe how to create a filter
   function for MonetDB/SQL.

Introduction
------------

We want to create a function or set of functions that allows us to
filter a column based on whether a regular expression matches.  When
we're done, we will be able to issue a query such as this::

  SELECT name FROM t WHERE [name] rematch ['^mue?ller$'];

What this query does is select all rows from table ``t`` where the
column ``name`` matches the regular expression ``^mue?ller$``.  This
tutorial is not about how to create regular expressions or what they
mean.  This tutorial is about how to extend MonetDB/SQL with functions
so that a query like this is possible.

Note that this query is similar to queries using the LIKE predicate,
although regular expressions are more powerful than the patterns
allowed by LIKE.  The following three queries are equivalent::

  SELECT name FROM t WHERE name LIKE '%foo%';
  SELECT name FROM t WHERE [name] sys."like" ['%foo%'];
  SELECT name FROM t WHERE [name] sys.rematch ['.*foo.*'];

The first two of these queries work out of the box (given a table
``t`` witch a ``STRING`` column ``name``).

We need to write ``"like"`` instead of ``LIKE`` or ``like`` since
``LIKE`` is an SQL keyword and in this case we want to use the SQL
function called ``sys.like``.

It is also possible to have a reference to a column in some table on
both sides of the filter function.  Given two tables with STRING
columns where one table contains values and the other regular
expressions, we can issue queries such as the following (these two
queries are equivalent)::

  SELECT t1.value FROM t1 JOIN t2 ON [t1.value] sys.rematch [t2.pattern];
  SELECT t1.value FROM t1, t2 WHERE [t1.value] sys.rematch [t2.pattern];

The function ``rematch`` that we will create, and also the function
``like`` that already exists are FILTER functions.  We can also use
the filter function for simple (scalar) queries, such as::

  SELECT ['some string'] sys."like" ['%foo%'];

The value on the left is matched with the pattern on the right and the
result is a Boolean value indicating whether the value matches the
pattern (``false`` in this case).

In addition to this simple filter function that matches a value with a
pattern, we will create a variant that also has a ``flags`` argument.
It can be useful to pass some extra flags to the matching operator to
tell it to do case insensitive matching, or always match the whole
string vs. finding a matching substring.  We will be able to call this
variant is as follows::

  SELECT ['some string'] sys.rematch ['.*foo.*', 'i'];
  SELECT name FROM t WHERE [name] sys.rematch ['.*foo.*', 'i'];
  SELECT t1.value FROM t1 JOIN t2 ON [t1.value] sys.rematch [t2.pattern, 'i'];
  SELECT t1.value FROM t1, t2 WHERE [t1.value] sys.rematch [t2.pattern, 'i'];

Implementation
--------------

Three things need to be done.  We need to add a function definition to
the SQL catalog, we need to describe the MAL interface for the MAL
interpreter, and we need to create the C implementation of the
function.  We will describe the three steps here in this order.

SQL
...

The SQL interface is simple::

  CREATE FILTER FUNCTION rematch(val STRING, pat STRING)
         EXTERNAL NAME regexp.rematch;
  CREATE FILTER FUNCTION rematch(val STRING, pat STRING, flags STRING)
         EXTERNAL NAME regexp.rematch;

When we create a ``FILTER FUNCTION`` in SQL, the SQL system expects
three MAL functions whose names are based on the external name given.
The functions are named ``regexp.rematch``, ``regexp.rematchselect``,
and ``regexp.rematchjoin``.  In other words, the name as given, and
the name with the string ``select`` and ``join`` appended.  These
functions are used to implement a simple scalar version of the filter,
a version that is used in simple select queries, and one that is used
in joins.

Note that we use the same MAL names for both variants.  The
interpreter will be able to distinguish the variants based on the
arguments given.

This statement will normally be executed once when the database is
created, after which it is part of the SQL catalog.  This is
accomplished by having the statement in a file in the
``$libdir/monetdb/createdb`` directory.  Since files in that directory
are executed in order, the convention is to add a two digit number at
the front of the file name to force the order.  So we have a file
``81_regexp.sql`` where we put this statement.

At the SQL side we don't have to do anything more.

MAL
...

The MAL interface consists of three or four functions whose names are
based on the names specified in the SQL interface.  The three
functions that need to be implemented have names that are equal to the
name given as the ``EXTERNAL NAME`` in the SQL interface, plus that
same name with ``join`` and ``select`` appended.  A fourth function
can be optionally implemented.  It is the *bulk* variant of the first
function.  This bulk function has a name that is equal to the name
given in SQL with ``bat`` prepended.  The bulk variant returns a BAT
with a single value for each input value.  See the *reverse* tutorial.

The interface looks like this.  First the variant without the
``flags`` argument::

  module regexp;

  command rematch(val:str, pat:str) :bit
  address regexpmatch
  comment "Return true when the value 'val' matches the regular expression 'pat'";

  command rematchselect(val:bat[:str], s:bat[:oid], pat:str, anti:bit):bat[:oid]
  address regexpmatchselect
  comment "Return the list of matches in 'val' that match the regular expression 'pat'";

  command rematchjoin(val:bat[:str], pat:bat[:str], sl:bat[:oid], sr:bat[:oid], nil_matches:bit, estimate:lng)(lr:bat[:oid],rr:bat[:oid])
  address regexpmatchjoin
  comment "Return the matching pairs from the 'val' and 'pat'
  columns";

  module batregexp;

  command rematch(val:bat[:str], pat:str) :bat[:bit]
  address regexpmatchbulk
  comment "Return a BAT with true for match and false for no match";

The variant with the ``flags`` argument looks like this::

  module regexp;

  command rematch(val:str, pat:str, flags:str) :bit
  address regexpmatchf
  comment "Return true when the value 'val' matches the regular expression 'pat'";

  command rematchselect(val:bat[:str], s:bat[:oid], pat:str, flags:str, anti:bit):bat[:oid]
  address regexpmatchfselect
  comment "Return the list of matches in 'val' that match the regular expression 'pat'";

  command rematchjoin(val:bat[:str], pat:bat[:str], flags:str, sl:bat[:oid], sr:bat[:oid], nil_matches:bit, estimate:lng)(lr:bat[:oid],rr:bat[:oid])
  address regexpmatchfjoin
  comment "Return the matching pairs from the 'val' and 'pat'
  columns";

  module batregexp;

  command rematch(val:bat[:str], pat:str, flags:str) :bat[:bit]
  address regexpmatchfbulk
  comment "Return a BAT with true for match and false for no match";

We put these MAL commands in the file
``$libdir/monetdb5/regexp.mal``.  In addition we create a file
``$libdir/monetdb5/autoload/81_regexp.mal`` that just contains::

 include regexp;

The files in the ``autoload`` directory are executed in order every
time the server is started so that by putting the ``81_regexp.mal``
file there, we make sure that the system knows about these functions.

C Implementation
................

First we'll describe the core of the matching process.  This has
nothing to do with how MonetDB works and everything with how the
`PCRE`__ library works.  However, for a complete example, we need this
part.  Here we ignore all errors.  Consult the PCRE documentation for
more information.

__ http://pcre.org/

::

   pcre *re;        /* compiled regular expression */
   pcre_extra *sd;  /* studied regular expression */
   const char *err; /* if non-NULL, error message */
   int pos;         /* error position or match position */

   re = pcre_compile(patter, options, &err, &pos, NULL);
   sd = pcre_study(re, 0, &err); /* optional: else use NULL for sd */
   pos = pcre_exec(re, sd, value, (int) strlen(value), 0, 0, NULL, 0);
   /* pos >= 0 if there is a match, pos == PCRE_ERROR_NOMATCH if not */
   pcre_free_study(sd);
   pcre_free(re);

Match
`````

With this information, we can implement the first function easily.
The first function is a simple scalar version, i.e. a version that
works on a single value.  We also give the version with flags
argument.  Since they are very similar, they share all code::

  char *
  regexpmatch(bit *ret, const char **val, const char **pat)
  {
      return do_match(ret, *val, *pat, "");
  }

  char *
  regexpmatchf(bit *ret, const char **val, const char **pat, const char **flags)
  {
      return do_match(ret, *val, *pat, *flags);
  }

The function ``do_match`` does all the work and is in essence the
above code (in real code we need to add error handling)::

  static char *
  do_match(bit *ret, const char *val, const char *pat, const char *flags)
  {
      int options; /* calculate value from flags */
      const char *err = NULL;
      int errpos = 0;
      pcre *re;

      if (GDK_STRNIL(val) || GDK_STRNIL(pat) || GDK_STRNIL(flags)) {
          /* special case for NIL inputs: NILs don't match anything */
          *ret = 0;
          return MAL_SUCCEED;
      }

      re = pcre_compile(pat, options, &err, &pos, NULL);
      pos = pcre_exec(re, NULL, val, (int) strlen(val), 0, 0, NULL, 0);
      *ret = pos >= 0;
      return MAL_SUCCEED;
  }

We will not here describe the bulk variant.  The code is in the source
file, though.

Select
``````

The C interface of the two select functions (with and without flags)
is as follows::

  char *regexpmatchselect(bat *ret, const bat *bid, const bat *sid,
          const char **pat, const bit *anti);
  char *regexpmatchfselect(bat *ret, const bat *bid, const bat *sid,
          const char **pat, const char **flags, const bit *anti);

The select function is essentially a bulk version of the match
function.  There is a single pattern that is matched against a
complete column of values.  There are a few complicating issues,
though.

Select returns a list of OID values of matching rows.  This list
*must* be sorted and all values *must* be distinct.  This must also be
indicated in the BAT properties.  Because of the way we go through the
values, we know that the result we produce is already sorted and
distinct.

The select function has a parameter ``const bit *anti``.  If ``*anti``
is set, the match is reversed, except that ``nil`` never matches.

The select function also has a parameter ``const bat *sid``.  This
value specifies a candidate list.  The candidate list is optional.
Not having a candidate list is indicated by ``*sid`` being ``0`` or
``bat_nil``.  To be sure, we also check whether ``sid`` itself is
equal to the ``NULL`` pointer.  If there is no candidate list, all
values of the main input BAT referred to by ``*bid`` are considered
for matching.  If there is a candidate list, it is a sorted list of
OID values of values from ``*bid`` that are to be matched.

In our implementation we use the same code for a *dense* candidate
list and for no candidate list.  We just iterate over all values of
``*bid`` between a start and end index.  If there is a non dense
candidate list, we iterate over it and use an indirection to get to
the value in ``*bid``.

When creating the result BAT, we allocate enough space in case all
input values match.  This means that inside the matching loop we don't
need to check whether there is enough space.  This is, of course, a
trade off between speed and simplicity (whether or not we check) and
memory consumption.  An alternative would be to allocate much less and
use the function ``BUNappend`` to add values or check for space inside
the loop.  When using ``BUNappend`` we also don't need to set any
properties afterwards, since they are maintained by ``BUNappend``.

Speaking of BAT properties, since we're not using ``BUNappend`` we
need to maintain the properties ourselves.  Properties must never
lie.  That is to say, if a property is set, whatever the property
indicates must be true.  An unset property indicates we don't know
whether or not the property holds.

The two C functions referenced above are so similar that they share
all code::

  char *
  regexpmatchselect(bat *ret, const bat *bid, const bat *sid,
          const char **pat, const bit *anti)
  {
      return do_select(ret, *bid, sid ? *sid : 0, *pat, "", *anti);
  }


  char *
  regexpmatchfselect(bat *ret, const bat *bid, const bat *sid,
          const char **pat, const char **flags, const bit *anti)
  {
      return do_select(ret, *bid, sid ? *sid : 0, *pat, *flags, *anti);
  }

The function ``do_select`` does all the work::

  static char *
  do_select(bat *ret, bat bid, bat sid, const char *pat,
          const char *flags, bit anti)
  {
      ...
  }

First we check whether any of the string input arguments is NIL.  If
they are, there are no matches and we're done quickly::

  if (GDK_STRNIL(pat) || GDK_STRNIL(flags)) {
      /* no matches when the pattern or the flags is NIL
       * we return an empty BAT of the correct type */
      if ((bn = BATdense(0, 0, 0)) == NULL)
          throw(MAL, "regexp.rematchselect", GDK_EXCEPTION);
      *ret = bn->batCacheid;
      BBPkeepref(*ret);
      return MAL_SUCCEED;
  }

After this we convert the ``flags`` string to the ``options`` value
that the PCRE library wants::

  options = parseflags(flags);

Then we load the two input BATs.  The parameters ``bid`` and ``sid``
are BAT IDs, but we need BAT descriptors.  To get from one to the
other we use the function ``BATdescriptor``.  Once we have a BAT
descriptor, we need to free it again using ``BBPunfix``::

  s = NULL;
  b = BATdescriptor(bid);
  if (!is_bat_nil(sid))
      s = BATdescriptor(sid);

We create the output BAT.  The third arguments of ``COLnew`` is the
desired capacity of the BAT.  If the call succeeds, it is guaranteed
that there is enough space for this many rows, so if we ask for enough
capacity to store the maximum potential result, we don't need to check
during insertion.  We also set up a pointer to the start of the data
area of the BAT::

  bn = COLnew(0, TYPE_oid, s ? BATcount(s) : BATcount(b), TRANSIENT);
  outp = (oid *) Tloc(bn, 0);

Since we're going to use the search pattern many times (well,
depending on the inputs, of course), we invest extra time to study the
pattern::

  re = pcre_compile(pat, options, &err, &pos, NULL);
  sd = pcre_study(re, 0, &err);

We then set up some auxiliary variables to help us iterate over the
input::

  CANDINIT(b, s, start, end, cnt, cand, candend);
  bi = bat_iterator(b);

The macro ``CANDINIT``, defined in the file ``gdk_cand.h``, looks at
the first two arguments and initializes the remaining arguments.  The
function ``bat_iterator`` returns a ``BATiter`` value that is needed
for the ``BUNtail``, ``BUNtvar``, ``BUNtloc`` and ``BUNtpos`` macros.
The macro ``BUNtail`` is the generic version, the other three can be
used if we know more about the BAT that is used.  In our case, we know
that the BAT contains string values which are variabled sized values.
This means that we can use ``BUNtvar``.  Using ``BUNtvar`` instead of
``BUNtail`` should be slightly faster since it omits a few tests.

Now we get to the core of the algorithm.  There are two cases: with
and without candidate list.  In either case, we iterate and in each
iteration we check first whether the value is nil, and if not whether
the value matches the regular expression.  If there is a match we add
the ID of the value to the output.  Here, match takes the ``anti``
variable into account.  The version with candidate list is as
follows::

  while (cand < candend) {
      const char *val = BUNtvar(bi, *cand - b->hseqbase);
      if (!GDK_STRNIL(val)) {
          pos = pcre_exec(re, sd, val, (int) strlen(val), 0, 0, NULL, 0);
          if (pos >= 0) {
              if (!anti)
                  *outp++ = *cand;
          } else if (pos == PCRE_ERROR_NOMATCH) {
              if (anti)
                  *outp++ = *cand;
          }
      }
      cand++;
  }

Now we can release all resources::

  if (s) BBPunfix(s->batCacheid);
  BBPunfix(b->batCacheid);
  pcre_free_study(sd);
  pcre_free(re);

Finally, we need to fix up the result BAT and return it.  As is, the
BAT is still empty, so we set the size::

  BATsetcount(bn, (BUN) (outp - (oid *) Tloc(bn, 0)));

We also need to set the properties.  If we had used ``BUNappend`` that
would have been done for us, but using ``BUNappend`` is a lot more
overhead.  The properties we need to set are:

``tsorted``
  The column is sorted in ascending order.

``trevsorted``
  The column is sorted in descending order.

``tkey``
  All values are distinct

``tseqbase``
  The starting ``oid`` value of a dense column.  If a BAT has type
  ``oid`` or ``void`` (virtual oid), this value gives the start of the
  dense sequence of values in the BAT.  If the BAT is not dense,
  ``tseqbase`` should be set to ``oid_nil``.

``tdense``
  This property has been removed as of the Aug2018 (11.31.X) release.
  Use tseqbases instead.
  Only valid for ``TYPE_oid`` columns: each value in the column is
  exactly one larger than the one before.

``tnil``
  There is at least one NIL value in the column.

``tnonil``
  There are no NIL values in the column.

``tnosorted``
  If ``tsorted == false``, the index in the column that proofs not
  sorted.  Must be ``0`` if ``tsorted == true``.

``tnorevsorted``
  If ``trevsorted == false``, the index in the column that proofs not
  reverse sorted.  Must be ``0`` if ``trevsorted == true``.

``tnokey[0]`` and ``tnokey[1]``
  If ``tkey == false``, a pair of indexes that together proof not all
  values are distinct.  Must both be ``0`` if ``tkey == true``.

Because of the way we created the output we know a number of these
properties::

  bn->tsorted = true;
  bn->tkey = true;
  bn->trevsorted = BATcount(bn) <= 1;
  bn->tnil = false;
  bn->tnonil = true;
  bn->tnosorted = 0;
  if (!bn->trevsorted) bn->tnorevsorted = 1;

We also know that the column is dense if there is at most a single row
in it, but we can easily check whether the column is dense even if
there are more rows: we just need to look at the difference between
the first and last rows and compare that with the number of rows::

  bn->tseqbase = oid_nil;
  if (BATcount(bn) > 1) {
      outp = (oid *) Tloc(bn, 0); /* pointer to start */
      if (outp[BATcount(bn) - 1] - outp[0] == BATcount(bn) - 1)
          bn->tseqbase = outp[0];
  }

Now we can return::

  *ret = bn->batCacheid;
  BBPkeepref(*ret);
  return MAL_SUCCEED;

Join
````

The C interface of the two join functions (with and without flags) is
as follows::

  char *regexpmatchjoin(bat *lres, bat *rres, const bat *lid, const bat *rid,
          const bat *sl, const bat *sr,
	  const bit *nil_matches, const lng *estimate);
  char *regexpmatchfjoin(bat *lres, bat *rres, const bat *lid, const bat *rid,
          const char **flags, const bat *sl, const bat *sr,
	  const bit *nil_matches, const lng *estimate);

In MonetDB, a join returns two aligned (same length, same
``hseqbase``) BATs where each corresponding set of values indicates
the OIDs of the left and right inputs that matched with each other.
The left and right inputs are here given by ``lid`` and ``rid`` in
combination with their respective, optional, candidate lists ``sl``
and ``sr``.  The outputs are returned through the ``lres`` and
``rres`` pointers.  The additional parameters are: ``flags``, the
extra argument for the filter function; ``nil_matches``, if set, NILs
are considered "normal" values that can match; ``estimate``, an
estimate of how large the result is (often this is unknown which is
indicated by the value ``lng_nil``).  We ignore the value of
``nil_matches`` since it is only relevant for an equality join.

The outputs of the join do not have to be in any particular order.  In
the general case, neither output is a candidate list (there may be
duplicates if a value matches multiple times), so there is also no
real point in demanding that one or the other output is sorted.
Although it helps to set the ``tsorted`` property if we know that the
output is sorted.

There is no real alternative to implementing the join using a nested
loop implementation.  For reasons of efficiency, in our implementation
of this join function, we will iterate over the pattern (right input)
in the outer loop and over the to-be-matched strings (left input) in
the inner loop.  In this way, we only need to compile (and study) each
pattern once and we can use it multiple times.

Since the inner loop is executed many more times than the outer loop,
we try to be a bit more efficient in the inner loop.  That's why we
check for ``lcand`` only once and check for ``rcand`` in every
iteration.