[MonetDB-users] Large Database / Is there hope?

18 Nov 2010

      Hello, I am still having problems compiling from sources, however, I would like to determine whether these is hope in continuing.

We have a corpus of about 18,000 documents averaging about 5000 words each. We additionally have stand-off annotations that we would like to add and query against. This totals about 300 million XML elements.

We are running 64-bit Linux with 6GiB RAM.

Database A: Loading the documents (in batches of 100) as separate collections takes about 30 min and consumes 30GiB disk space in var/MonetDB4/dbfarm.

Simple queries against database A take a long time, consume only 5% CPU and heavily work kswapd

   count( for $d in pf:documents-unsafe() return doc($d)/tCorpus )   # killed after1 hour

Database B: Loading the documents (in batches of 100) into a single huge collection takes about 6 hours and consumes 120GiB disk space in var/MonetDB4/dbfarm.

Simple queries against Database B take a long time, consume only 5% CPU and heavily work pdflush

   count( pf:collection("tcorpus")//tCorpus )   # killed after 3 hours

My questions:

1. Is there any hope of successfully performing non-trivial queries on either of these databases using MonetDB?

2. If so, Is loading into a single collection or separate collections likely to be preferable?

3. The above work was done using the Jun2010-SP2 SuperBall, which is the only version I have been able to compile. Has any relevant work been done since then on MonetDB4 or any of the XQuery code that might improve performance?

Thank you,
   Dean

Dean Serenevy

tags

participants (1)