changeset 19:499debebda5d

Various updates. Thanks Hannes for the comments.
author Sjoerd Mullender <sjoerd@acm.org>
date Tue, 23 Jan 2018 14:03:10 +0100 (2018-01-23)
parents 17e6cc835d78
children b8ec6f36facb
files README.rst regexp/Makefile regexp/README.rst regexp/regexp.c reverse/Makefile reverse/README.rst
diffstat 6 files changed, 164 insertions(+), 137 deletions(-) [+]
line wrap: on
line diff
--- a/README.rst	Tue Jan 23 10:54:53 2018 +0100
+++ b/README.rst	Tue Jan 23 14:03:10 2018 +0100
@@ -25,6 +25,14 @@
 tutorials in the order presented here, since they will go from simple
 to more complicated.
 
+These tutorials are about extending MonetDB/SQL using one or more
+functions written in C.  There are other ways in which to extend the
+functionality of MonetDB/SQL that may be easier to implement.  See
+`Embedded Python/NumPy in MonetDB`__ and `Embedded R in MonetDB`__.
+
+__ https://www.monetdb.org/blog/embedded-pythonnumpy-monetdb
+__ https://www.monetdb.org/content/embedded-r-monetdb
+
 Setting Up
 ----------
 
@@ -37,7 +45,10 @@
 ``MonetDB5-server-devel``, ``MonetDB-devel``, and
 ``MonetDB-stream-devel``.  On Debian and Ubuntu you need to install
 the packages ``monetdb5-server-dev``, ``libmonetdb-dev``, and
-``libmonetdb-stream-dev``.
+``libmonetdb-stream-dev``.  These packages are available from the
+`MonetDB download site`__.
+
+__ https://dev.monetdb.org/downloads/
 
 Tutorials
 ---------
@@ -58,7 +69,7 @@
 column.  The simplest example of a filter is (given a table ``t`` that
 has an integer column ``c``)::
 
-  SELECT c FROM t WHERE c = 0;
+  SELECT * FROM t WHERE c = 0;
 
-This query filter the table t and select all rows where the value of
+This query filters the table t and selects all rows where the value of
 column ``c`` is equal to ``0``.
--- a/regexp/Makefile	Tue Jan 23 10:54:53 2018 +0100
+++ b/regexp/Makefile	Tue Jan 23 14:03:10 2018 +0100
@@ -4,10 +4,10 @@
 #
 # Copyright 2013-2018 MonetDB B.V.
 
-LIBDIR = $(shell pkg-config --variable=libdir monetdb5)
+LIBDIR = $(shell pkgconf --variable=libdir monetdb5)
 
-CFLAGS += $(shell pkg-config --cflags monetdb5 libpcre)
-LDFLAGS += $(shell pkg-config --libs monetdb5 libpcre)
+CFLAGS += $(shell pkgconf --cflags monetdb5 libpcre)
+LDFLAGS += $(shell pkgconf --libs monetdb5 libpcre)
 
 all: lib_regexp.so
 
--- a/regexp/README.rst	Tue Jan 23 10:54:53 2018 +0100
+++ b/regexp/README.rst	Tue Jan 23 14:03:10 2018 +0100
@@ -24,7 +24,7 @@
 filter a column based on whether a regular expression matches.  When
 we're done, we will be able to issue a query such as this::
 
-  SELECT * FROM t WHERE [name] rematch ['^mue?ller$'];
+  SELECT name FROM t WHERE [name] rematch ['^mue?ller$'];
 
 What this query does is select all rows from table ``t`` where the
 column ``name`` matches the regular expression ``^mue?ller$``.  This
@@ -36,9 +36,9 @@
 although regular expressions are more powerful than the patterns
 allowed by LIKE.  The following three queries are equivalent::
 
-  SELECT * FROM t WHERE name LIKE '%foo%';
-  SELECT * FROM t WHERE [name] sys."like" ['%foo%'];
-  SELECT * FROM t WHERE [name] sys.rematch ['.*foo.*'];
+  SELECT name FROM t WHERE name LIKE '%foo%';
+  SELECT name FROM t WHERE [name] sys."like" ['%foo%'];
+  SELECT name FROM t WHERE [name] sys.rematch ['.*foo.*'];
 
 The first two of these queries work out of the box (given a table
 ``t`` witch a ``STRING`` column ``name``).
@@ -47,6 +47,15 @@
 ``LIKE`` is an SQL keyword and in this case we want to use the SQL
 function called ``sys.like``.
 
+It is also possible to have a reference to a column in some table on
+both sides of the filter function.  Given two tables with STRING
+columns where one table contains values and the other regular
+expressions, we can issue a queries such as the following (these two
+queries are equivalent)::
+
+  SELECT t1.value FROM t1 JOIN t2 ON [t1.value] sys.rematch [t2.pattern];
+  SELECT t1.value FROM t1, t2 WHERE [t1.value] sys.rematch [t2.pattern];
+
 The function ``rematch`` that we will create, and also the function
 ``like`` that already exists are FILTER functions.  We can also use
 the filter function for simple (scalar) queries, such as::
@@ -64,12 +73,12 @@
 string vs. finding a matching substring.  The way we will be able to
 call this variant is::
 
-  SELECT * FROM t WHERE [name] sys.rematch ['.*foo.*', 'i'];
+  SELECT name FROM t WHERE [name] sys.rematch ['.*foo.*', 'i'];
 
 This particular query will be equivalent to both::
 
-  SELECT * FROM t WHERE name ILIKE '%foo%';
-  SELECT * FROM t WHERE [name] sys."ilike" ['%foo%'];
+  SELECT name FROM t WHERE name ILIKE '%foo%';
+  SELECT name FROM t WHERE [name] sys."ilike" ['%foo%'];
 
 Implementation
 --------------
@@ -165,52 +174,26 @@
 ................
 
 First we'll describe the core of the matching process.  This has
-nothing to do with how MonetDB works and everything with how the PCRE
-library works.  However, for a complete example, we need this part.
+nothing to do with how MonetDB works and everything with how the
+`PCRE`__ library works.  However, for a complete example, we need this
+part.  Here we ignore all errors.  Consult the PCRE documentation for
+more information.
 
-We have a regular expression in the variable ``pat``, and a set of
-option letters in the variable ``flags``.  We first need to convert
-the option letters to an options value that is understood by the
-library, then we need to compile the pattern to an internal
-representation.  Finally we can use this internal representation to
-check whether the string value in ``val`` matches the pattern.
+__ http://pcre.org/
 
 ::
 
-   /* first the three input variables */
-   const char *pat = "the regular expression";
-   const char *flags = "i"; /* the flags, here "i", meaning case insensitive */
-   const char *val = "the value being matched";
-
-   int options = PCRE_UTF8; /* MonetDB use the UTF-8 encoding exclusively */
-   /* the following two variables are for error handling */
-   const char *err; /* here we will collect the error message */
-   int pos; /* here we collect the position of the error and the result of the match */
+   pcre *re;        /* compiled regular expression */
+   pcre_extra *sd;  /* studied regular expression */
+   const char *err; /* if non-NULL, error message */
+   int pos;         /* error position or match position */
 
-   while (*flags) {
-       switch (*flags) {
-       case 'i': options |= PCRE_CASELESS; break;
-       case 'x': options |= PCRE_EXTENDED; break;
-       /* more cases are possible: there are more potential options */
-       default: /* unrecognized option; here we should generate an error */
-       }
-       flags++;
-   }
-   re = pcre_compile(pat, options, &err, &pos, NULL);
-   if (re == NULL) {
-       /* an error occurred during compilation, the message is in err,
-        * the position of the error in pat is in pos */
-       return; /* return an error */
-   }
-   pos = pcre_exec(re, NULL, val, (int) strlen(val), 0, 0, NULL, 0);
-   if (pos >= 0) {
-       /* there was a match */
-   } else if (pos == PCRE_ERROR_NOMATCH) {
-       /* there was no match and no other error */
-   } else {
-       /* there was some error as specified in the value of pos */
-       return; /* return an error */
-   }
+   re = pcre_compile(patter, options, &err, &pos, NULL);
+   sd = pcre_study(re, 0, &err); /* optional: else use NULL for sd */
+   pos = pcre_exec(re, sd, value, (int) strlen(value), 0, 0, NULL, 0);
+   /* pos >= 0 if there is a match, pos == PCRE_ERROR_NOMATCH if not */
+   pcre_free_study(sd);
+   pcre_free(re);
 
 Match
 `````
@@ -287,24 +270,24 @@
 Not having a candidate list is indicated by ``*sid`` being ``0`` or
 ``bat_nil``.  To be sure, we also check whether ``sid`` itself is
 equal to the ``NULL`` pointer.  If there is no candidate list, all
-values of the main input BAT ``b`` are considered for matching.  If
-there is a candidate list, it is a sorted list of OID values of values
-from ``b`` that are to be matched.
+values of the main input BAT referred to by ``*bid`` are considered
+for matching.  If there is a candidate list, it is a sorted list of
+OID values of values from ``*bid`` that are to be matched.
 
 In our implementation we use the same code for a *dense* candidate
 list and for no candidate list.  We just iterate over all values of
-``b`` between a start and end index.  If there is a non dense
+``*bid`` between a start and end index.  If there is a non dense
 candidate list, we iterate over it and use an indirection to get to
-the value in ``b``.
+the value in ``*bid``.
 
 When creating the result BAT, we allocate enough space in case all
 input values match.  This means that inside the matching loop we don't
 need to check whether there is enough space.  This is, of course, a
 trade off between speed and simplicity (whether or not we check) and
 memory consumption.  An alternative would be to allocate much less and
-use the function ``BUNappend`` to add values.  When using
-``BUNappend`` we also don't need to set any properties afterwards,
-since they are maintained by ``BUNappend``.
+use the function ``BUNappend`` to add values or check for space inside
+the loop.  When using ``BUNappend`` we also don't need to set any
+properties afterwards, since they are maintained by ``BUNappend``.
 
 Speaking of BAT properties, since we're not using ``BUNappend`` we
 need to maintain the properties ourselves.  Properties must never
@@ -450,7 +433,7 @@
 
 ``tdense``
   Only valid for ``TYPE_oid`` columns: each value in the column is
-  exactly one larger than the previous value.
+  exactly one larger than the one before.
 
 ``tnil``
   There is at least one NIL value in the column.
@@ -465,7 +448,7 @@
   If ``trevsorted == 0``, the index in the column that proofs not
   reverse sorted.
 
-``tnokey``
+``tnokey[0]`` and ``tnokey[1]``
   If ``tkey == 0``, a pair of indexes that together proof not all
   values are distinct.
 
--- a/regexp/regexp.c	Tue Jan 23 10:54:53 2018 +0100
+++ b/regexp/regexp.c	Tue Jan 23 14:03:10 2018 +0100
@@ -12,7 +12,7 @@
 #include <mal_exception.h>
 
 /* for the CANDINIT macro */
-#include <gdk_cand.h>
+#include <gdk_cand.h> /* pre Jul2017-SP5, replace with CANDINIT definition */
 
 /* system include files */
 #include <string.h>
@@ -26,8 +26,15 @@
 #define __declspec(x)	/* nothing */
 #endif
 
+/* these six functions are the only externally visible functions; on
+ * Windows they must be exported, on other systems, declaring them as
+ * extern is enough */
 extern __declspec(dllexport) char *regexpmatch(bit *ret, const char **val, const char **pat);
 extern __declspec(dllexport) char *regexpmatchf(bit *ret, const char **val, const char **pat, const char **flags);
+extern __declspec(dllexport) char *regexpmatchselect(bat *ret, const bat *bid, const bat *sid, const char **pat, const bit *anti);
+extern __declspec(dllexport) char *regexpmatchfselect(bat *ret, const bat *bid, const bat *sid, const char **pat, const char **flags, const bit *anti);
+extern __declspec(dllexport) char *regexpmatchjoin(bat *lres, bat *rres, const bat *lid, const bat *rid, const bat *sl, const bat *sr, const bit *nil_matches, const lng *estimate);
+extern __declspec(dllexport) char *regexpmatchfjoin(bat *lres, bat *rres, const bat *lid, const bat *rid, const char **flags, const bat *sl, const bat *sr, const bit *nil_matches, const lng *estimate);
 
 static int
 parseflags(const char *flags)
--- a/reverse/Makefile	Tue Jan 23 10:54:53 2018 +0100
+++ b/reverse/Makefile	Tue Jan 23 14:03:10 2018 +0100
@@ -4,10 +4,10 @@
 #
 # Copyright 2013-2018 MonetDB B.V.
 
-LIBDIR = $(shell pkg-config --variable=libdir monetdb5)
+LIBDIR = $(shell pkgconf --variable=libdir monetdb5)
 
-CFLAGS += $(shell pkg-config --cflags monetdb5)
-LDFLAGS += $(shell pkg-config --libs monetdb5)
+CFLAGS += $(shell pkgconf --cflags monetdb5)
+LDFLAGS += $(shell pkgconf --libs monetdb5)
 
 all: lib_reverse.so
 
--- a/reverse/README.rst	Tue Jan 23 10:54:53 2018 +0100
+++ b/reverse/README.rst	Tue Jan 23 14:03:10 2018 +0100
@@ -35,31 +35,6 @@
 where ``table`` is an SQL table with a column called ``strcol`` which
 is of type ``VARCHAR`` (or any other string type).
 
-A Note About Names
-------------------
-
-The name *BAT* originally stood for *Binary Association Table*.  This
-suggests, correctly, that there were actually two values per row in a
-BAT that were associated.  The two values where the *head* and *tail*
-values of, what was called, a *Binary UNit* or *BUN*.  The head and
-tail columns of a BAT had independent types, but it turned out that
-most of the time we used the type *oid* (*Object IDentifier*) for the
-head column, and the values were, most of the time, consecutive.
-Since then we have made sure that the head values were always of type
-oid and consecutive, which meant that we could do away with the
-complete head column.  The only thing that we still needed (and still
-need) is the first value of the old head column.  This value is called
-*hseqbase* (head sequence base).  There are still many vestiges of the
-old head and tail columns, especially when accessing values in what
-used to be the tail column and now is the only column.  So we have
-functions such as ``BUNtail`` that have nothing to do anymore with a
-tail.  Also, the term BUN was repurposed to being the name we have
-given to the index of the C array that holds the values of what used
-to be the tail column.  For more information see the blog post
-`MonetDB goes headless`__.
-
-__ https://www.monetdb.org/blog/monetdb-goes-headless
-
 Implementation
 --------------
 
@@ -90,8 +65,10 @@
 
 At the SQL side we don't have to do anything more.
 
-Now that we have the SQL interface, we need to create the MAL
-interface.
+The SQL layer produces an intermediate code called MAL (which stands
+for MonetDB Assembly Language).  The external name above refers to a
+command that is defined in MAL, so now that we have the SQL interface,
+we need to create the MAL interface.
 
 The MAL interface of the function looks like this::
 
@@ -121,21 +98,27 @@
  address UDFBATreverse
  comment "Reverse a column of strings";
 
+Note that implementing a bulk version is optional.  If it does not
+exist, the scalar version will be called multiple times.  However,
+calling the scalar version multiple (sometimes very many) times incurs
+significant overhead, hence it is usually a good idea to implement the
+bulk version.
+
 We put these MAL commands in the file
 ``$libdir/monetdb5/reverse.mal``.  In addition we create a file
 ``$libdir/monetdb5/autoload/80_reverse.mal`` that just contains::
 
  include reverse;
 
+We need the extra file because the ``include`` statement actually
+includes both the MAL file ``reverse.mal`` and the dynamically
+loadable C module that we will install in the file ``lib_reverse.so``
+(or whatever extension is needed on the system you're on).
+
 The files in the ``autoload`` directory are executed in order every
 time the server is started so that by putting the ``80_reverse.mal``
 file there, we make sure that the system knows about these functions.
 
-Note that instead of using ``command``, we could have used
-``pattern``.  The ``pattern`` interface is more complicated to work
-with at the C level, but also more powerful.  We will not use the
-``pattern`` interface here.
-
 Now we come to the actual implementation of the feature.
 
 The MAL interfaces of the scalar and bulk versions of the ``reverse``
@@ -153,7 +136,17 @@
 the time this can be done by calling the macro ``throw`` which is
 defined as ``return createException``, so calling ``throw`` with
 arguments will actually call ``createException`` with those same
-arguments and return the result.
+arguments and return the result.  The function ``createException`` has
+three or more arguments, the first is ``MAL``, the second is the name
+of the MAL function, the third is a ``printf`` format string, and
+remaining arguments are values that are used by the format string.  A
+minimal example is::
+
+  char *UDFreverse(char **retval, const char **arg)
+  {
+      (void) retval; (void) arg; /* we're not using these */
+      throw(MAL, "reverse.reverse", "Not yet implemented");
+  }
 
 MAL commands can return any number of values.  These values are
 returned by the C function in the first arguments of the C function,
@@ -310,9 +303,7 @@
 
  BUNappend(bn, dst, 0);
 
-The third argument to ``BUNappend`` is called ``force``.  Only use
-non-zero there if you know what you're doing (and if you need to read
-this document, you don't).
+The third argument to ``BUNappend`` must always be ``0``.
 
 Note that the return value of ``BUNappend`` was changed starting with
 the Jul2015 feature release.  Before, ``BUNappend`` returned its first
@@ -320,20 +311,6 @@
 release it returns ``GDK_SUCCEED`` or ``GDK_FAIL`` for success or
 failure.
 
-Makefile
---------
-
-To bring all of this together, we have a ``Makefile``.  It uses the
-``pkg-config`` command to find the location of the MonetDB
-installation and the arguments needed to compile the module.  This
-Makefile works on Fedora Linux if you have the package
-``MonetDB5-server-devel`` with all its dependencies installed
-(available starting in the Jan2014 feature release), and on
-Debian/Ubuntu if you have the packages ``libmonetdb-dev`` and
-``monetdb5-server-dev`` with all their dependencies installed
-(available starting in the Oct2014-SP3 bugfix release).  The file may
-need to be changed for other systems.
-
 BAT Properties
 --------------
 
@@ -351,57 +328,106 @@
 
 The property flags are
 
-``sorted``
+``tsorted``
   the column is sorted in ascending order
 
-``revsorted``
+``trevsorted``
   the column is sorted in descending order
 
-``key``
+``tkey``
   all values in the column are distinct
 
-``nonil``
+``tnonil``
   there are no nil values in the column
 
-``nil``
+``tnil``
   there are nil values in the column (this property isn't used
   internally)
 
 In addition to these flags there are a few other properties:
 
-``nosorted``
+``tnosorted``
   the location that proofs the column is *not* sorted
 
-``norevsorted``
+``tnorevsorted``
   the location that proofs the column is *not* reverse sorted
 
-``nokey[0]`` and ``nokey[1]``
+``tnokey[0]`` and ``tnokey[1]``
   two locations that together proof *not* all values are distinct
 
-The property flags ``sorted`` and ``revsorted`` may both be set at the
-same time.  When they are, it implies that all values are equal to
+The property flags ``tsorted`` and ``trevsorted`` may both be set at
+the same time.  When they are, it implies that all values are equal to
 each other.
 
-Next to the ``key`` property there is also a ``unique`` property.
-The ``unique`` property, when set, indicates that all values in the
+Next to the ``tkey`` property there is also a ``tunique`` property.
+The ``tunique`` property, when set, indicates that all values in the
 BAT *must* be distinct (as in the UNIQUE constraint in SQL).  We're
-not really concerned with this, since it is not used by the SQL
-layer.  When ``unique`` is set, then so must ``key``.
+not really concerned with this, since it is not used by the SQL layer.
+When ``tunique`` is set, then so must ``tkey``.
 
-When the ``sorted`` property is unset, the ``nosorted`` property is
+When the ``tsorted`` property is unset, the ``tnosorted`` property is
 a position in the BAT where the previous value is not less than or
-equal to the position itself.  If the ``nosorted`` value is 0, we
+equal to the position itself.  If the ``tnosorted`` value is 0, we
 truly don't know whether the BAT is sorted.  Similarly for the
-``revsorted`` and ``norevsorted`` properties.  The two ``nokey``
+``trevsorted`` and ``tnorevsorted`` properties.  The two ``tnokey``
 values are either both 0 (we really don't know), or two distinct
-locations whose values are equal.
+locations whose values are equal.  The ``tno``... values *must* be
+zero if the corresponding property is set, they may be zero if the
+corresponding property in unset.
 
 Note that most of the properties are true for an empty column, hence
 when ``COLnew`` returns, all property flags except for ``nil`` are set
 (there are no nils in an empty column).  This means that as soon as
 you start adding data to a column, you must deal with the property
-flags.
+flags.  The simplest solution is to just clear all properties::
+
+  bn->tsorted = bn->trevsorted = 0;
+  bn->tkey = 0;
+  bn->tnil = 0;
+  bn->tnonil = 0;
+  bn->tnosorted = bn->tnorevsorted = 0;
+  bn->tnokey[0] = bn->tnokey[1] = 0;
 
 Note also that the function ``BUNappend`` maintains the property flags
 as best it can.  That is why in the example we didn't need to do
 anything with the property flags.
+
+Makefile
+--------
+
+To bring all of this together, we have a ``Makefile``.  It uses the
+``pkgconf`` command to find the location of the MonetDB installation
+and the arguments needed to compile the module.  (If you don't have
+``pkgconf``, you may be able to replace it with ``pkg-config``.)  This
+Makefile works on Fedora Linux if you have the package
+``MonetDB5-server-devel`` with all its dependencies installed
+(available starting in the Jan2014 feature release), and on
+Debian/Ubuntu if you have the packages ``libmonetdb-dev`` and
+``monetdb5-server-dev`` with all their dependencies installed
+(available starting in the Oct2014-SP3 bugfix release).  The file may
+need to be changed for other systems.
+
+A Note About Names
+------------------
+
+The name *BAT* originally stood for *Binary Association Table*.  This
+suggests, correctly, that there were actually two values per row in a
+BAT that were associated.  The two values where the *head* and *tail*
+values of, what was called, a *Binary UNit* or *BUN*.  The head and
+tail columns of a BAT had independent types, but it turned out that
+most of the time we used the type *oid* (*Object IDentifier*) for the
+head column, and the values were, most of the time, consecutive.
+Since then we have made sure that the head values were always of type
+oid and consecutive, which meant that we could do away with the
+complete head column.  The only thing that we still needed (and still
+need) is the first value of the old head column.  This value is called
+*hseqbase* (head sequence base).  There are still many vestiges of the
+old head and tail columns, especially when accessing values in what
+used to be the tail column and now is the only column.  So we have
+functions such as ``BUNtail`` that have nothing to do anymore with a
+tail.  Also, the term BUN was repurposed to being the name we have
+given to the index of the C array that holds the values of what used
+to be the tail column.  For more information see the blog post
+`MonetDB goes headless`__.
+
+__ https://www.monetdb.org/blog/monetdb-goes-headless