Hi Roy,

I'm not sure how familiar you are with XQuery...

The problem for MonetDB/XQuery (without PF/Tijah) in your query could be that underneath the p elements (even if not used in a nested variant) there might be a large number of nodes. Your query asks for the atomization of all nodes p which leads to a concatenation of all descendant text nodes. (The reason is that node p in '<p>wind <foo/>farm</p>' should be a match as well.) Only then function contains does it's work.

I guess a slightly modified variant 'for $p in collection("papers")//p where some $t in $p//text() satisfies contains($t, "wind farm") return $p' where no concatenation takes place might give you a better performance. It however does not catch text snippets across textnodes.

Answering your question about the size of the documents vs. collections: Currently all documents in a collection are stored in a single big relation. So if you store many big documents in a collection you will get quite a large relation. (If you split your data into multiple collections you will get smaller relations and on a machine with small RAM perhaps less swapping.)

As to Lefteris comment: PF/Tijah uses its indexes only for the additional text retrieval operations. It does not speed up standard XQuery functions such as contains.

Jan

On Jul 20, 2009, at 18:28, Roy Walter wrote:

Running MonetDB/XQuery on a 2.6GHz 32-bit Windows XP box with 1GB of RAM.

What is the best way to organise XML in MonetDB for rapid text searching? A run down of my recent experience might help.

I created a collection of around 450 documents (153MB approx.). I ran the following query from the command line:

collection("papers")//p[contains(., 'wind farm')]

The query time is at best 19 seconds. That's bad. (It's worse than querying a Postgres database with documents stored in the XML field type.)

So to get a reference point I loaded up the 114MB XMark document and ran this query:

doc("standard")//text[contains(., "yoke")]

The query time varies from 2 to 4 seconds. Better, but still not great.

Now, adding more RAM (and moving to 64-bit) would speed things up I hope! But hardware aside:

1. Is it better to have big documents rather than big collections?

2. Is having small collections (<10 docs) of big documents also inefficient?

Ideally I need to query collections comprising several thousand documents using 'text search' predicates. Are there other, better ways to run this type of query against a MonetDB XML database? Or should I really be using some other platform for this task?

Thanks in advance for any pointers.

-- Roy
------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time,
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge_______________________________________________
MonetDB-users mailing list
MonetDB-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/monetdb-users

-- 
Jan Rittinger
Lehrstuhl Datenbanken und Informationssysteme
Wilhelm-Schickard-Institut für Informatik
Eberhard-Karls-Universität Tübingen