Re: [MonetDB-users] Performance

20 Jul 2009

      Hi Roy,

I'm not sure how familiar you are with XQuery...

The problem for MonetDB/XQuery (without PF/Tijah) in your query could  
be that underneath the p elements (even if not used in a nested  
variant) there might be a large number of nodes. Your query asks for  
the atomization of all nodes p which leads to a concatenation of all  
descendant text nodes. (The reason is that node p in '<p>wind 
...
farm</p>' should be a match as well.) Only then function contains  
does it's work.
I guess a slightly modified variant 'for $p in collection("papers")//p  
where some $t in $p//text() satisfies contains($t, "wind farm") return  
$p' where no concatenation takes place might give you a better  
performance. It however does not catch text snippets across textnodes.

Answering your question about the size of the documents vs.  
collections: Currently all documents in a collection are stored in a  
single big relation. So if you store many big documents in a  
collection you will get quite a large relation. (If you split your  
data into multiple collections you will get smaller relations and on a  
machine with small RAM perhaps less swapping.)

As to Lefteris comment: PF/Tijah uses its indexes only for the  
additional text retrieval operations. It does not speed up standard  
XQuery functions such as contains.

Jan

On Jul 20, 2009, at 18:28, Roy Walter wrote:
...
Running MonetDB/XQuery on a 2.6GHz 32-bit Windows XP box with 1GB of  
RAM.
What is the best way to organise XML in MonetDB for rapid text  
searching? A run down of my recent experience might help.
I created a collection of around 450 documents (153MB approx.). I  
ran the following query from the command line:
collection("papers")//p[contains(., 'wind farm')]
The query time is at best 19 seconds. That's bad. (It's worse than  
querying a Postgres database with documents stored in the XML field  
type.)
So to get a reference point I loaded up the 114MB XMark document and  
ran this query:
doc("standard")//text[contains(., "yoke")]
The query time varies from 2 to 4 seconds. Better, but still not  
great.
Now, adding more RAM (and moving to 64-bit) would speed things up I  
hope! But hardware aside:
1. Is it better to have big documents rather than big collections?
2. Is having small collections (<10 docs) of big documents also  
inefficient?
Ideally I need to query collections comprising several thousand  
documents using 'text search' predicates. Are there other, better  
ways to run this type of query against a MonetDB XML database? Or  
should I really be using some other platform for this task?
Thanks in advance for any pointers.
-- Roy
------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge
This is your chance to win up to $100,000 in prizes! For a limited  
time,
vendors submitting new applications to BlackBerry App World(TM) will  
have
the opportunity to enter the BlackBerry Developer Challenge. See  
full prize
details at: http://p.sf.net/sfu/Challenge_______________________________________________
MonetDB-users mailing list
MonetDB-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- 
Jan Rittinger
Lehrstuhl Datenbanken und Informationssysteme
Wilhelm-Schickard-Institut für Informatik
Eberhard-Karls-Universität Tübingen

http://www-db.informatik.uni-tuebingen.de/team/rittinger