[MonetDB-users] "Large" document collection and MonetDB/XQuery
We have been experimenting with the MonetDB/XQuery at our institure and we have some issues. We have loaded 727 XML documents into MonetDB with the shred_doc() command: shred_doc("{xml file 1}, "1") shred_doc("{xml file 2}, "2") etc... These files are small (4 a 5 kilobytes). When we query these files with the following query the result takes a long time to complete (2 minutes): for $i in ("0", "1", .... "727") return $i Can anyone explain why looping over 727 documents is so slow? We have collected all 727 XML documents into 1 XML file and loaded this in MonetDB and this is a lot faster. The test machine is a 2.8 GHz P4 with 1 GB of memory. Another isue is related to the size of the database on the harddisk. When we fist load the 727 XML documents into the database the database directory contains about 47.000 files and is 50 megabytes in size. When we have excuted a number of querys the size of this directory increases to 1.7 gigabytes!! Can anyone explain this behaviour? Is MonetDB generating somekind of dynamic indices? Thanks for your replys, Bastiaan Naber
Hoi Bastiaan, thanks for using MonetDB/XQuery, and sorry for the delayed reply... On Thu, Aug 25, 2005 at 04:10:47PM +0200, Bastiaan Naber wrote:
We have been experimenting with the MonetDB/XQuery at our institure and we have some issues.
We have loaded 727 XML documents into MonetDB with the shred_doc() command:
shred_doc("{xml file 1}, "1") shred_doc("{xml file 2}, "2") etc...
These files are small (4 a 5 kilobytes).
When we query these files with the following query the result takes a long time to complete (2 minutes):
for $i in ("0", "1", .... "727") return $i
Can anyone explain why looping over 727 documents is so slow?
MonetDB/XQuery stores each document in a separate set of tables (respectively BATs) (see e.g. http://www.cwi.nl/htbin/ins1/publications?request=abstract&key=BoGrMaRiTe:TR-CWI:05 for details). Hence, looping over 727 documents means looping over 727 sets of tables/BATs (executing the same query in each iteration). The current version of MonetDB/XQuery is not optimized for this kind of workload, yet. In your case, the (relative) overhead of handling many documents is especially prominent, as the actual documents are rather small (with large(r) documents, the (relative) overhead becomes smaller, as the time to query each document gets larger). In this first version, MonetDB/XQuery is optimized for handling (a single) large document(s). In the future, we will investigate how we can improve the handling of large collections of document. Unfortunately, I cannot give any schedule or roadmap, yet.
We have collected all 727 XML documents into 1 XML file and loaded this in MonetDB and this is a lot faster.
The test machine is a 2.8 GHz P4 with 1 GB of memory.
Another isue is related to the size of the database on the harddisk. When we fist load the 727 XML documents into the database the database directory contains about 47.000 files and is 50 megabytes in size. When we have excuted a number of querys the size of this directory increases to 1.7 gigabytes!! Can anyone explain this behaviour? Is MonetDB generating somekind of dynamic indices?
All (persistent) indices are generated during schredding, hence, they cannot be the reason for storage growth you experienced. Actually, it took us some time to find out, what happens (we hadn't experimented with so many small documents, yet)... But we found the reason: MonetDB uses mmap to access (large) files (BATs). In MonetDB/XQuery, we currently use mmap to access all files/BATs, independent of there size. Apparently, (for efficiency reasons) mmap does not handle arbitrary file sizes, but requires file sizes to be multiples of (e.g.) 64 KB. During shredding, the BATs are still malloced. Only when shreeding is finished, and the BATs are made persitent, they are marked to be access via mmap from now on. Once you start querying, and hence access the respective BATs, mmap adjusts their size to a multiple of 64 KB. That's why your database grows. We will take care of this issue in the coming days. Most probably, we will change the current behaviour of MonetDB/XQuery to mmap BATs only once they exceed a certain size (e.g., 64 KB). We will keep you posted about the progress. Thank you very much for reporting these issues. Please don't hesitate to contact us again, once you have more questions about MonetDB. Regards, Stefan
Thanks for your replys, Bastiaan Naber
------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4312 |
Stefan Manegold wrote: [snip]
In this first version, MonetDB/XQuery is optimized for handling (a single) large document(s).
In the future, we will investigate how we can improve the handling of large collections of document. Unfortunately, I cannot give any schedule or roadmap, yet.
Note to put pressure on the roadmap or schedule at all, just to express my interest in this kind of feature as well: I'm watching MonetDB/XQuery with much interest and am hoping to eventually build a Python binding for it that exposes the XML database power conveniently to a Python programmer. Querying large to very large collections of smaller (often quite small) documents is essential for my use cases though (which come from the CMS world). That MonetDB/XQuery can't do this yet is one reason I'm not playing with MonetDB more at the moment. Being able to do queries over semi-structured data spread out over multiple documents is a very interesting feature: http://weblog.infoworld.com/udell/2005/02/15.html http://weblog.infoworld.com/udell/2005/02/18.html Regards, Martijn
participants (3)
-
Bastiaan Naber
-
Martijn Faassen
-
Stefan Manegold