Peter, the purpose and/or reason of/for my question is the fact that we cannot release the binary packages that Sjoerd built yesterday, because the binary database format has been modified without increasing the GDK version number. We simply forgot the latter. Hence, a new Mserver cannot notice in case it is started on an pld database, and instead of complaining properly, it will just behave strangely, crash and/or corrup the database. This is not good. Consequently, Sjoerd needs to re-do the building once again, and we have to think of how to solve the above problem. Obviously, one solution is th "simply" increase the GDK version number. But then, users have to abandon or dump/retore their databases now and with the next release (GDK-2). An other alternative would be to undo the string hash function changes in the release branch, keeping the same binary format for now, and include both binary format changes in one release. In particular if the string hash function changes are "only" for performance reasons, both Sjoerd, Niels, and I would favor the second solution. After all, we and our users have been able to live with the "old" string hash function for years. Those that really require the new one can switch to the development version. I'm busy with testing --- on a 8GB dual-dual-core 2GHz Opteron running 64-bit FedoraCore 6 and and optimized 64-bit Mserver with 64-bit OIDs, shredding the 19 GB file read-only took 11 hours, both with the old and the new hash functions (Mserver grew to 10 GB res, >50 GB virt). Shredding updateable without your two fixes took only 2 hours, but the resulting DB is currupt as reported in [ 1811229 ] [ADT] Adding large document, with update support http://sourceforge.net/tracker/index.php?func=detail&aid=1811229&group_id=56 . Shredding updateable with your two fixes is still busy, running for more than 6 hours, now... Stefan On Tue, Oct 16, 2007 at 01:46:47PM +0200, Peter Boncz wrote:
Hi,
Hm, I cannot really understand the purpose of the question. And what is wrong with performance fixes?
both fixes are related to the same bug: - the remap failing is addressed by the gdk_posix fix - the shredding in the bug report taking excessively long is addressed by the gdk_atoms fix
indeed, any hash function can have collisions.. it all depends on the distribution.
Peter
PS most probably, this mail (sent from my home account) will be rejected by the sourceforge mailing list -- and I cannot sent through CWI from home as the secure mail sending is not supported by CWI staff for microsoft emailers.
-----Original Message----- From: Stefan Manegold [mailto:Stefan.Manegold@cwi.nl] Sent: dinsdag 16 oktober 2007 12:01 To: Peter Boncz Cc: monetdb-developers@lists.sourceforge.net Subject: Re: [Monetdb-checkins] MonetDB/src/gdk gdk_atoms.mx, MonetDB_1-20, 1.134, 1.134.6.1 gdk_posix.mx, MonetDB_1-20, 1.143, 1.143.2.1
Peter,
which part of your changes do fix the problem with updatedable shredding of large XML documents as reporten in [ 1811229 ] [ADT] Adding large document, with update support http://sourceforge.net/tracker/index.php?func=detail&aid=1811229&group_id=56 967&atid=482468 ?
The new has string function in gdk_atoms.mx or the file descriptor fixes in gdk_posix.mx?
The former looks for like a performance fix to me --- too many collisions should only slows the system down, but not copromize its fucntionallity/correctness, right? Also with the new string has functions ("too") many collisions can still occur with certain datasets ...
Stefan
On Sun, Oct 14, 2007 at 08:31:36PM +0000, Stefan Manegold wrote:
Update of /cvsroot/monetdb/MonetDB/src/gdk In directory sc8-pr-cvs16.sourceforge.net:/tmp/cvs-serv15103
Modified Files: Tag: MonetDB_1-20 gdk_atoms.mx gdk_posix.mx Log Message:
[checkin on behalf of Peter]
fixing XQuery bug [ 1811229 ] [ADT] Adding large document, with update support
http://sourceforge.net/tracker/index.php?func=detail&aid=1811229&group_id=56 967&atid=482468
gdk_atoms.mx: - hash collisions in strings that consists of digits only (a common case!) we now use a fast derivative of the Bob Jenkins function from now on
Really bad collisions, in case of the 20GB document of the bug report, shredding took 8 hours before, 1 hour after this change.
NOTE: this change affects the binary format (string heaps) and all
product
families, as the hash function is a compiled-in macro! In particular, lookup operations and joins on SQL (Monet4/5)
columns
consisting of digits only, but stored in a VARCHAR, should be
faster
after this check-in.
gdk_posix.mx - we lost track of the file descriptor for large heaps (the file desc is
to the mmap-monitoring-thread to close later), such that the remap function could fail (when it was given the illegal file descriptor 0)
NOTE: this change only affects xquery it only uses remap()
Index: gdk_posix.mx =================================================================== RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v retrieving revision 1.143 retrieving revision 1.143.2.1 diff -u -d -r1.143 -r1.143.2.1 --- gdk_posix.mx 4 Sep 2007 17:55:20 -0000 1.143 +++ gdk_posix.mx 14 Oct 2007 20:31:33 -0000 1.143.2.1 @@ -615,7 +615,7 @@ MT_mmap_tab[i].writable = writable; MT_mmap_tab[i].fd = fd; MT_mmap_tab[i].pincnt = 0; - fd = -1; + fd = -fd; } (void) pthread_mutex_unlock(&MT_mmap_lock); return fd; @@ -1051,9 +1051,7 @@ } if (ret != (void *) -1L) { hdl->fixed = ret; - fd = MT_mmap_new(path, ret, len, fd, (mode & MMAP_WRITABLE)); - if (fd <= 0) - hdl->hdl = (void *) 0; /* MT_mmap_new keeps the fd */ + hdl->hdl = (void*) (ssize_t) MT_mmap_new(path, ret, len, fd, (mode & MMAP_WRITABLE)); } return ret; } @@ -1061,13 +1059,12 @@ void * MT_mmap_remap(MT_mmap_hdl *hdl, off_t off, size_t len) { - void *ret; - - ret = mmap(hdl->fixed, + int fd = (int) (ssize_t) hdl->hdl; + void *ret = mmap(hdl->fixed, len, ((hdl->mode & MMAP_WRITABLE) ? PROT_WRITE : 0) | PROT_READ, ((hdl->mode & MMAP_COPY) ? (MAP_PRIVATE | MAP_NORESERVE) : MAP_SHARED) | (hdl->fixed ? MAP_FIXED : 0), - (int) (ssize_t) hdl->hdl, + (fd < 0)?-fd:fd, off);
if (ret != (void *) -1L) { @@ -1083,9 +1080,7 @@ MT_mmap_close(MT_mmap_hdl *hdl) { int fd = (int) (ssize_t) hdl->hdl; - - if (fd) - close(fd); + if (fd > 0) close(fd); hdl->hdl = NULL; }
Index: gdk_atoms.mx =================================================================== RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_atoms.mx,v retrieving revision 1.134 retrieving revision 1.134.6.1 diff -u -d -r1.134 -r1.134.6.1 --- gdk_atoms.mx 2 May 2007 16:16:58 -0000 1.134 +++ gdk_atoms.mx 14 Oct 2007 20:31:32 -0000 1.134.6.1 @@ -1878,13 +1878,19 @@ rotates all characters together. It is optimized to process 2 characters at a time (adding 16-bits to the hash value each iteration). @h -#define GDK_STRHASH(x,y) { \ - str _c = (str) (x); \ - for((y)=0; _c[0] && _c[1]; _c+=2) { \ - (y) = ((y) << 3) ^ ((y) >> 11) ^ ((y) >> 17) ^ (_c[1] <<
given 8) ^ _c[0];\
- } \ - (y) ^= _c[0]; \ +#define GDK_STRHASH(x,y) {\ + str _key = (str) (x);\ + int _i;\ + for (_i = y = 0; _key[_i]; _i++) {\ + y += _key[_i];\ + y += (y << 10);\ + y ^= (y >> 6);\ + }\ + y += (y << 3);\ + y ^= (y >> 11);\ + y += (y << 15);\ } + @c hash_t strHash(str s)
------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Monetdb-checkins mailing list Monetdb-checkins@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-checkins
-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4312 |
-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4312 |