Just for the records: I finally managed to finsh my experiments regarding [ 1811229 ] [ADT] Adding large document, with update support http://sourceforge.net/tracker/index.php?func=detail&aid=1811229&group_id=56967&atid=482468 and the related code changes. For those interested, here's the detailed story: "S08-64" System (beo-24): - 2x 64-bit Dual-Core Opteron270 @ 2 Ghz - 8 GB memory - MonetDB/XQuery 0.20, 64-bit, 64-bit OIDs, --enable-optimize (gcc 4.1.2) "S16-32" System (core-1): - 4x 64-bit Dual-Core Opteron870 @ 2 Ghz - 16 GB memory - MonetDB/XQuery 0.20, 64-bit, 32-bit OIDs, --enable-optimize (gcc 4.1.2) Document: http://mirror.openstreetmap.nl/planet/planet-071003.osm.bz2 (extracted: 19 GB XML file) "SR" Shredding read-only: pf:add-doc(".../planet-071003.osm","planet-071003.osm") "SU" Shredding updateable: pf:add-doc(".../planet-071003.osm","planet-071003.osm","planet-071003.osm",5) "QR"/"QU" Count query: count(doc("planet-071003.osm")//*) Configurations: m: without Peter's mmap fix in gdk_posix.mx (i.e., using rev. 1.143 of gdk_posix.mx) M: with Peter's mmap fix in gdk_posix.mx (i.e., using rev. 1.143.2.1 of gdk_posix.mx) h: without Peter's new string hash function in gdk_atoms.mx (i.e., using rev. 1.134 of gdk_atoms.mx) H: with Peter's new string hash function in gdk_atoms.mx (i.e., using rev. 1.134.6.1 of gdk_atoms.mx) Results (wall-clock times): S08-64: SR QR SU QU[1] QU[2] mh 659m34s 1m09s 81m06s ERROR - mH - - 383m17s ERROR - Mh - - 77m03s 344m32s 1m34s MH 644m01s 0m49s 390m42s 342m47s 1m36s S16-32: SR QR SU QU[1] QU[2] mh 127m59s 0m17s 43m14s ERROR - mH 110m33s 0m16s 26m26s ERROR - Mh 128m11s 0m18s 44m00s 100m50s 1m15s MH 191m42s 0m17s 25m43s 101m37s 1m21s (NB: "SR" includes building of indices, while "SU" does not; consequently, "QR" can exploit the indices built during "SR", while "QU[1]" has to build the indices first, and only "QU[2]" can exploit them.) Apparently, the mmap fix in gdk_posix.mx seems to be sufficient to prevent the remap-ERROR reported (for a system & configuration similar to "S08-64") in [ 1811229 ] [ADT] Adding large document, with update support http://sourceforge.net/tracker/index.php?func=detail&aid=1811229&group_id=56967&atid=482468 I'll leave the further interpretation of the above results to the interested recipient / reader. Stefan On Tue, Oct 16, 2007 at 01:46:47PM +0200, Peter Boncz wrote:
Hi,
Hm, I cannot really understand the purpose of the question. And what is wrong with performance fixes?
both fixes are related to the same bug: - the remap failing is addressed by the gdk_posix fix - the shredding in the bug report taking excessively long is addressed by the gdk_atoms fix
indeed, any hash function can have collisions.. it all depends on the distribution.
Peter
PS most probably, this mail (sent from my home account) will be rejected by the sourceforge mailing list -- and I cannot sent through CWI from home as the secure mail sending is not supported by CWI staff for microsoft emailers.
-----Original Message----- From: Stefan Manegold [mailto:Stefan.Manegold@cwi.nl] Sent: dinsdag 16 oktober 2007 12:01 To: Peter Boncz Cc: monetdb-developers@lists.sourceforge.net Subject: Re: [Monetdb-checkins] MonetDB/src/gdk gdk_atoms.mx, MonetDB_1-20, 1.134, 1.134.6.1 gdk_posix.mx, MonetDB_1-20, 1.143, 1.143.2.1
Peter,
which part of your changes do fix the problem with updatedable shredding of large XML documents as reporten in [ 1811229 ] [ADT] Adding large document, with update support http://sourceforge.net/tracker/index.php?func=detail&aid=1811229&group_id=56 967&atid=482468 ?
The new has string function in gdk_atoms.mx or the file descriptor fixes in gdk_posix.mx?
The former looks for like a performance fix to me --- too many collisions should only slows the system down, but not copromize its fucntionallity/correctness, right? Also with the new string has functions ("too") many collisions can still occur with certain datasets ...
Stefan
On Sun, Oct 14, 2007 at 08:31:36PM +0000, Stefan Manegold wrote:
Update of /cvsroot/monetdb/MonetDB/src/gdk In directory sc8-pr-cvs16.sourceforge.net:/tmp/cvs-serv15103
Modified Files: Tag: MonetDB_1-20 gdk_atoms.mx gdk_posix.mx Log Message:
[checkin on behalf of Peter]
fixing XQuery bug [ 1811229 ] [ADT] Adding large document, with update support
http://sourceforge.net/tracker/index.php?func=detail&aid=1811229&group_id=56 967&atid=482468
gdk_atoms.mx: - hash collisions in strings that consists of digits only (a common case!) we now use a fast derivative of the Bob Jenkins function from now on
Really bad collisions, in case of the 20GB document of the bug report, shredding took 8 hours before, 1 hour after this change.
NOTE: this change affects the binary format (string heaps) and all
product
families, as the hash function is a compiled-in macro! In particular, lookup operations and joins on SQL (Monet4/5)
columns
consisting of digits only, but stored in a VARCHAR, should be
faster
after this check-in.
gdk_posix.mx - we lost track of the file descriptor for large heaps (the file desc is
given
to the mmap-monitoring-thread to close later), such that the remap function could fail (when it was given the illegal file descriptor 0)
NOTE: this change only affects xquery it only uses remap()
-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4312 |