Hi Roy,
I'm not sure how familiar you are with XQuery...
The problem for MonetDB/XQuery (without PF/Tijah) in your query could
be that underneath the p elements (even if not used in a nested
variant) there might be a large number of nodes. Your query asks for
the atomization of all nodes p which leads to a concatenation of all
descendant text nodes. (The reason is that node p in '<p>wind farm</p>' should be a match as well.) Only then function contains
does it's work. I guess a slightly modified variant 'for $p in collection("papers")//p
where some $t in $p//text() satisfies contains($t, "wind farm") return
$p' where no concatenation takes place might give you a better
performance. It however does not catch text snippets across textnodes.
Answering your question about the size of the documents vs.
collections: Currently all documents in a collection are stored in a
single big relation. So if you store many big documents in a
collection you will get quite a large relation. (If you split your
data into multiple collections you will get smaller relations and on a
machine with small RAM perhaps less swapping.)
As to Lefteris comment: PF/Tijah uses its indexes only for the
additional text retrieval operations. It does not speed up standard
XQuery functions such as contains.
Jan
On Jul 20, 2009, at 18:28, Roy Walter wrote: Running MonetDB/XQuery on a 2.6GHz 32-bit Windows XP box with 1GB of
RAM. What is the best way to organise XML in MonetDB for rapid text
searching? A run down of my recent experience might help. I created a collection of around 450 documents (153MB approx.). I
ran the following query from the command line: collection("papers")//p[contains(., 'wind farm')] The query time is at best 19 seconds. That's bad. (It's worse than
querying a Postgres database with documents stored in the XML field
type.) So to get a reference point I loaded up the 114MB XMark document and
ran this query: doc("standard")//text[contains(., "yoke")] The query time varies from 2 to 4 seconds. Better, but still not
great. Now, adding more RAM (and moving to 64-bit) would speed things up I
hope! But hardware aside: 1. Is it better to have big documents rather than big collections? 2. Is having small collections (<10 docs) of big documents also
inefficient? Ideally I need to query collections comprising several thousand
documents using 'text search' predicates. Are there other, better
ways to run this type of query against a MonetDB XML database? Or
should I really be using some other platform for this task? Thanks in advance for any pointers. -- Roy
------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge
This is your chance to win up to $100,000 in prizes! For a limited
time,
vendors submitting new applications to BlackBerry App World(TM) will
have
the opportunity to enter the BlackBerry Developer Challenge. See
full prize
details at: http://p.sf.net/sfu/Challenge_______________________________________________
MonetDB-users mailing list
MonetDB-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/monetdb-users --
Jan Rittinger
Lehrstuhl Datenbanken und Informationssysteme
Wilhelm-Schickard-Institut für Informatik
Eberhard-Karls-Universität Tübingen
http://www-db.informatik.uni-tuebingen.de/team/rittinger