Re: [MonetDB-users] "Large" document collection and MonetDB/XQuery

2 Sep 2005

      Hoi Bastiaan,

thanks for using MonetDB/XQuery, and sorry for the delayed reply...

On Thu, Aug 25, 2005 at 04:10:47PM +0200, Bastiaan Naber wrote:
...
We have been experimenting with the MonetDB/XQuery at our institure and 
we have some
issues.
We have loaded 727 XML documents into MonetDB with the shred_doc() command:
shred_doc("{xml file 1}, "1")
shred_doc("{xml file 2}, "2")
etc...
These files are small (4 a 5 kilobytes).
When we query these files with the following query the result takes a 
long time to complete (2 minutes):
for $i in ("0", "1", .... "727")
  return $i
Can anyone explain why looping over 727 documents is so slow?
MonetDB/XQuery stores each document in a separate set of tables
(respectively BATs) (see e.g.
http://www.cwi.nl/htbin/ins1/publications?request=abstract&key=BoGrMaRiTe:TR-CWI:05
for details). Hence, looping over 727 documents means looping over 727 sets
of tables/BATs (executing the same query in each iteration). The current
version of MonetDB/XQuery is not optimized for this kind of workload, yet.
In your case, the (relative) overhead of handling many documents is
especially prominent, as the actual documents are rather small (with
large(r) documents, the (relative) overhead becomes smaller, as the time to
query each document gets larger).
In this first version, MonetDB/XQuery is optimized for handling (a single)
large document(s).

In the future, we will investigate how we can improve the handling of large
collections of document. Unfortunately, I cannot give any schedule or
roadmap, yet.
...
We have collected all 727 XML documents
into 1 XML file and loaded this in MonetDB and this is a lot faster.
The test machine is a 2.8 GHz P4 with 1 GB of memory.
Another isue is related to the size of the database on the harddisk. 
When we fist load the 727 XML documents
into the database the database directory contains about 47.000 files and 
is 50 megabytes in size. When
we have excuted a number of querys the size of this directory increases 
to 1.7 gigabytes!! Can anyone explain
this behaviour? Is MonetDB generating somekind of dynamic indices?
All (persistent) indices are generated during schredding, hence, they cannot
be the reason for storage growth you experienced.

Actually, it took us some time to find out, what happens (we hadn't
experimented with so many small documents, yet)...
But we found the reason:

MonetDB uses mmap to access (large) files (BATs). In MonetDB/XQuery, we
currently use mmap to access all files/BATs, independent of there size.
Apparently, (for efficiency reasons) mmap does not handle arbitrary file
sizes, but requires file sizes to be multiples of (e.g.) 64 KB.

During shredding, the BATs are still malloced. Only when shreeding is
finished, and the BATs are made persitent, they are marked to be access via
mmap from now on. Once you start querying, and hence access the respective
BATs, mmap adjusts their size to a multiple of 64 KB. That's why your
database grows.

We will take care of this issue in the coming days. Most probably, we will
change the current behaviour of MonetDB/XQuery to mmap BATs only once they
exceed a certain size (e.g., 64 KB).

We will keep you posted about the progress.

Thank you very much for reporting these issues.

Please don't hesitate to contact us again, once you have more questions
about MonetDB.

Regards,

Stefan
...
Thanks for your replys,
Bastiaan Naber
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
MonetDB-users mailing list
MonetDB-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- 
| Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl |
| CWI,  P.O.Box 94079 | http://www.cwi.nl/~manegold/  |
| 1090 GB Amsterdam   | Tel.: +31 (20) 592-4212       |
| The Netherlands     | Fax : +31 (20) 592-4312       |