On Mon, Jul 20, 2009 at 8:45 PM, Jan
Rittinger
Hi Roy, I'm not sure how familiar you are with XQuery... The problem for MonetDB/XQuery (without PF/Tijah) in your query could be that underneath the p elements (even if not used in a nested variant) there might be a large number of nodes. Your query asks for the atomization of all nodes p which leads to a concatenation of all descendant text nodes. (The reason is that node p in '<p>wind <foo/>farm</p>' should be a match as well.) Only then function contains does it's work. I guess a slightly modified variant 'for $p in collection("papers")//p where some $t in $p//text() satisfies contains($t, "wind farm") return $p' where no concatenation takes place might give you a better performance. It however does not catch text snippets across textnodes. Answering your question about the size of the documents vs. collections: Currently all documents in a collection are stored in a single big relation. So if you store many big documents in a collection you will get quite a large relation. (If you split your data into multiple collections you will get smaller relations and on a machine with small RAM perhaps less swapping.) As to Lefteris comment: PF/Tijah uses its indexes only for the additional text retrieval operations. It does not speed up standard XQuery functions such as contains. Jan
Yes I was not clear on that. You have to use the pf/tijah specific functions, the fn:contains and other string functions will still be generic as before. I just assumed that your application is text retrieval of some kind and the functionality of pf/tijah will be more helpful. Thank you Jan for pointing this.
On Jul 20, 2009, at 18:28, Roy Walter wrote:
Running MonetDB/XQuery on a 2.6GHz 32-bit Windows XP box with 1GB of RAM.
What is the best way to organise XML in MonetDB for rapid text searching? A run down of my recent experience might help.
I created a collection of around 450 documents (153MB approx.). I ran the following query from the command line:
collection("papers")//p[contains(., 'wind farm')]
The query time is at best 19 seconds. That's bad. (It's worse than querying a Postgres database with documents stored in the XML field type.)
So to get a reference point I loaded up the 114MB XMark document and ran this query:
doc("standard")//text[contains(., "yoke")]
The query time varies from 2 to 4 seconds. Better, but still not great.
Now, adding more RAM (and moving to 64-bit) would speed things up I hope! But hardware aside:
1. Is it better to have big documents rather than big collections?
2. Is having small collections (<10 docs) of big documents also inefficient?
Ideally I need to query collections comprising several thousand documents using 'text search' predicates. Are there other, better ways to run this type of query against a MonetDB XML database? Or should I really be using some other platform for this task?
Thanks in advance for any pointers.
-- Roy ------------------------------------------------------------------------------ Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize
details at: http://p.sf.net/sfu/Challenge_______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- Jan Rittinger Lehrstuhl Datenbanken und Informationssysteme Wilhelm-Schickard-Institut für Informatik Eberhard-Karls-Universität Tübingen
http://www-db.informatik.uni-tuebingen.de/team/rittinger
------------------------------------------------------------------------------ Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users