[MonetDB-users] Large Database / Is there hope?

18 Nov 2010

      Hello, I am still having problems compiling from sources, however, I would like to determine whether these is hope in continuing.

We have a corpus of about 18,000 documents averaging about 5000 words each. We additionally have stand-off annotations that we would like to add and query against. This totals about 300 million XML elements.

We are running 64-bit Linux with 6GiB RAM.

Database A: Loading the documents (in batches of 100) as separate collections takes about 30 min and consumes 30GiB disk space in var/MonetDB4/dbfarm.

Simple queries against database A take a long time, consume only 5% CPU and heavily work kswapd

   count( for $d in pf:documents-unsafe() return doc($d)/tCorpus )   # killed after1 hour

Database B: Loading the documents (in batches of 100) into a single huge collection takes about 6 hours and consumes 120GiB disk space in var/MonetDB4/dbfarm.

Simple queries against Database B take a long time, consume only 5% CPU and heavily work pdflush

   count( pf:collection("tcorpus")//tCorpus )   # killed after 3 hours

My questions:

1. Is there any hope of successfully performing non-trivial queries on either of these databases using MonetDB?

2. If so, Is loading into a single collection or separate collections likely to be preferable?

3. The above work was done using the Jun2010-SP2 SuperBall, which is the only version I have been able to compile. Has any relevant work been done since then on MonetDB4 or any of the XQuery code that might improve performance?

Thank you,
   Dean

[MonetDB-users] Large Database / Is there hope?

Dean Serenevy