[MonetDB-users] Performance
Running MonetDB/XQuery on a 2.6GHz 32-bit Windows XP box with 1GB of RAM. What is the best way to organise XML in MonetDB for rapid text searching? A run down of my recent experience might help. I created a collection of around 450 documents (153MB approx.). I ran the following query from the command line: collection("papers")//p[contains(., 'wind farm')] The query time is at best 19 seconds. That's bad. (It's worse than querying a Postgres database with documents stored in the XML field type.) So to get a reference point I loaded up the 114MB XMark document and ran this query: doc("standard")//text[contains(., "yoke")] The query time varies from 2 to 4 seconds. Better, but still not great. Now, adding more RAM (and moving to 64-bit) would speed things up I hope! But hardware aside: 1. Is it better to have big documents rather than big collections? 2. Is having small collections (<10 docs) of big documents also inefficient? Ideally I need to query collections comprising several thousand documents using 'text search' predicates. Are there other, better ways to run this type of query against a MonetDB XML database? Or should I really be using some other platform for this task? Thanks in advance for any pointers. -- Roy
Hi Roy,
I suggest that you try the pf/tijah module for MonetDB/XQuery.
http://dbappl.cs.utwente.nl/pftijah/
This will create specific indices for your queries to facilitate text search.
Hope this helps for now. We will also investigate were the time is
spent in your case (without pf/tijah) and come back to you. How many p
elements your documents have? The problem might be that because monet
does not build inverted indices on text by itself, it has to visit
each p element and search with the help of the pcre library. Pf/tijah
was build for that purpose and should help alot.
Please feel free to contact us for further clarification and new
findings from your tests:)
cheers,
lefteris
On Mon, Jul 20, 2009 at 6:28 PM, Roy Walter
Running MonetDB/XQuery on a 2.6GHz 32-bit Windows XP box with 1GB of RAM.
What is the best way to organise XML in MonetDB for rapid text searching? A run down of my recent experience might help.
I created a collection of around 450 documents (153MB approx.). I ran the following query from the command line:
collection("papers")//p[contains(., 'wind farm')]
The query time is at best 19 seconds. That's bad. (It's worse than querying a Postgres database with documents stored in the XML field type.)
So to get a reference point I loaded up the 114MB XMark document and ran this query:
doc("standard")//text[contains(., "yoke")]
The query time varies from 2 to 4 seconds. Better, but still not great.
Now, adding more RAM (and moving to 64-bit) would speed things up I hope! But hardware aside:
1. Is it better to have big documents rather than big collections?
2. Is having small collections (<10 docs) of big documents also inefficient?
Ideally I need to query collections comprising several thousand documents using 'text search' predicates. Are there other, better ways to run this type of query against a MonetDB XML database? Or should I really be using some other platform for this task?
Thanks in advance for any pointers.
-- Roy
------------------------------------------------------------------------------ Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
Hi lefteris Well that seems to tick all the boxes. I tried the global index creation: tijah:create-ft-index() and it crashed the server with: !WARNING: readClient: unexpected end of file; discarding partial input Hmm... R. Lefteris wrote:
Hi Roy,
I suggest that you try the pf/tijah module for MonetDB/XQuery.
http://dbappl.cs.utwente.nl/pftijah/
This will create specific indices for your queries to facilitate text search.
Hope this helps for now. We will also investigate were the time is spent in your case (without pf/tijah) and come back to you. How many p elements your documents have? The problem might be that because monet does not build inverted indices on text by itself, it has to visit each p element and search with the help of the pcre library. Pf/tijah was build for that purpose and should help alot.
Please feel free to contact us for further clarification and new findings from your tests:)
cheers,
lefteris
On Mon, Jul 20, 2009 at 6:28 PM, Roy Walter
wrote: Running MonetDB/XQuery on a 2.6GHz 32-bit Windows XP box with 1GB of RAM.
What is the best way to organise XML in MonetDB for rapid text searching? A run down of my recent experience might help.
I created a collection of around 450 documents (153MB approx.). I ran the following query from the command line:
collection("papers")//p[contains(., 'wind farm')]
The query time is at best 19 seconds. That's bad. (It's worse than querying a Postgres database with documents stored in the XML field type.)
So to get a reference point I loaded up the 114MB XMark document and ran this query:
doc("standard")//text[contains(., "yoke")]
The query time varies from 2 to 4 seconds. Better, but still not great.
Now, adding more RAM (and moving to 64-bit) would speed things up I hope! But hardware aside:
1. Is it better to have big documents rather than big collections?
2. Is having small collections (<10 docs) of big documents also inefficient?
Ideally I need to query collections comprising several thousand documents using 'text search' predicates. Are there other, better ways to run this type of query against a MonetDB XML database? Or should I really be using some other platform for this task?
Thanks in advance for any pointers.
-- Roy
------------------------------------------------------------------------------ Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
This is not expected.
Did you try to restart the server and retry?
You might also have a corrupted dbfarm or the documents didn't shred
correctly to begin with. Which version of monet are you using? how did
you installed it?
lefteris
On Mon, Jul 20, 2009 at 8:52 PM, Roy Walter
Hi lefteris
Well that seems to tick all the boxes.
I tried the global index creation:
tijah:create-ft-index()
and it crashed the server with:
!WARNING: readClient: unexpected end of file; discarding partial input
Hmm...
R.
Lefteris wrote:
Hi Roy,
I suggest that you try the pf/tijah module for MonetDB/XQuery.
http://dbappl.cs.utwente.nl/pftijah/
This will create specific indices for your queries to facilitate text search.
Hope this helps for now. We will also investigate were the time is spent in your case (without pf/tijah) and come back to you. How many p elements your documents have? The problem might be that because monet does not build inverted indices on text by itself, it has to visit each p element and search with the help of the pcre library. Pf/tijah was build for that purpose and should help alot.
Please feel free to contact us for further clarification and new findings from your tests:)
cheers,
lefteris
On Mon, Jul 20, 2009 at 6:28 PM, Roy Walter
wrote: Running MonetDB/XQuery on a 2.6GHz 32-bit Windows XP box with 1GB of RAM.
What is the best way to organise XML in MonetDB for rapid text searching? A run down of my recent experience might help.
I created a collection of around 450 documents (153MB approx.). I ran the following query from the command line:
collection("papers")//p[contains(., 'wind farm')]
The query time is at best 19 seconds. That's bad. (It's worse than querying a Postgres database with documents stored in the XML field type.)
So to get a reference point I loaded up the 114MB XMark document and ran this query:
doc("standard")//text[contains(., "yoke")]
The query time varies from 2 to 4 seconds. Better, but still not great.
Now, adding more RAM (and moving to 64-bit) would speed things up I hope! But hardware aside:
1. Is it better to have big documents rather than big collections?
2. Is having small collections (<10 docs) of big documents also inefficient?
Ideally I need to query collections comprising several thousand documents using 'text search' predicates. Are there other, better ways to run this type of query against a MonetDB XML database? Or should I really be using some other platform for this task?
Thanks in advance for any pointers.
-- Roy
------------------------------------------------------------------------------ Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
Hi Roy,
I'm not sure how familiar you are with XQuery...
The problem for MonetDB/XQuery (without PF/Tijah) in your query could
be that underneath the p elements (even if not used in a nested
variant) there might be a large number of nodes. Your query asks for
the atomization of all nodes p which leads to a concatenation of all
descendant text nodes. (The reason is that node p in '<p>wind farm</p>' should be a match as well.) Only then function contains
does it's work. I guess a slightly modified variant 'for $p in collection("papers")//p
where some $t in $p//text() satisfies contains($t, "wind farm") return
$p' where no concatenation takes place might give you a better
performance. It however does not catch text snippets across textnodes.
Answering your question about the size of the documents vs.
collections: Currently all documents in a collection are stored in a
single big relation. So if you store many big documents in a
collection you will get quite a large relation. (If you split your
data into multiple collections you will get smaller relations and on a
machine with small RAM perhaps less swapping.)
As to Lefteris comment: PF/Tijah uses its indexes only for the
additional text retrieval operations. It does not speed up standard
XQuery functions such as contains.
Jan
On Jul 20, 2009, at 18:28, Roy Walter wrote: Running MonetDB/XQuery on a 2.6GHz 32-bit Windows XP box with 1GB of
RAM. What is the best way to organise XML in MonetDB for rapid text
searching? A run down of my recent experience might help. I created a collection of around 450 documents (153MB approx.). I
ran the following query from the command line: collection("papers")//p[contains(., 'wind farm')] The query time is at best 19 seconds. That's bad. (It's worse than
querying a Postgres database with documents stored in the XML field
type.) So to get a reference point I loaded up the 114MB XMark document and
ran this query: doc("standard")//text[contains(., "yoke")] The query time varies from 2 to 4 seconds. Better, but still not
great. Now, adding more RAM (and moving to 64-bit) would speed things up I
hope! But hardware aside: 1. Is it better to have big documents rather than big collections? 2. Is having small collections (<10 docs) of big documents also
inefficient? Ideally I need to query collections comprising several thousand
documents using 'text search' predicates. Are there other, better
ways to run this type of query against a MonetDB XML database? Or
should I really be using some other platform for this task? Thanks in advance for any pointers. -- Roy
------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge
This is your chance to win up to $100,000 in prizes! For a limited
time,
vendors submitting new applications to BlackBerry App World(TM) will
have
the opportunity to enter the BlackBerry Developer Challenge. See
full prize
details at: http://p.sf.net/sfu/Challenge_______________________________________________
MonetDB-users mailing list
MonetDB-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/monetdb-users --
Jan Rittinger
Lehrstuhl Datenbanken und Informationssysteme
Wilhelm-Schickard-Institut für Informatik
Eberhard-Karls-Universität Tübingen
http://www-db.informatik.uni-tuebingen.de/team/rittinger
On Mon, Jul 20, 2009 at 8:45 PM, Jan
Rittinger
Hi Roy, I'm not sure how familiar you are with XQuery... The problem for MonetDB/XQuery (without PF/Tijah) in your query could be that underneath the p elements (even if not used in a nested variant) there might be a large number of nodes. Your query asks for the atomization of all nodes p which leads to a concatenation of all descendant text nodes. (The reason is that node p in '<p>wind <foo/>farm</p>' should be a match as well.) Only then function contains does it's work. I guess a slightly modified variant 'for $p in collection("papers")//p where some $t in $p//text() satisfies contains($t, "wind farm") return $p' where no concatenation takes place might give you a better performance. It however does not catch text snippets across textnodes. Answering your question about the size of the documents vs. collections: Currently all documents in a collection are stored in a single big relation. So if you store many big documents in a collection you will get quite a large relation. (If you split your data into multiple collections you will get smaller relations and on a machine with small RAM perhaps less swapping.) As to Lefteris comment: PF/Tijah uses its indexes only for the additional text retrieval operations. It does not speed up standard XQuery functions such as contains. Jan
Yes I was not clear on that. You have to use the pf/tijah specific functions, the fn:contains and other string functions will still be generic as before. I just assumed that your application is text retrieval of some kind and the functionality of pf/tijah will be more helpful. Thank you Jan for pointing this.
On Jul 20, 2009, at 18:28, Roy Walter wrote:
Running MonetDB/XQuery on a 2.6GHz 32-bit Windows XP box with 1GB of RAM.
What is the best way to organise XML in MonetDB for rapid text searching? A run down of my recent experience might help.
I created a collection of around 450 documents (153MB approx.). I ran the following query from the command line:
collection("papers")//p[contains(., 'wind farm')]
The query time is at best 19 seconds. That's bad. (It's worse than querying a Postgres database with documents stored in the XML field type.)
So to get a reference point I loaded up the 114MB XMark document and ran this query:
doc("standard")//text[contains(., "yoke")]
The query time varies from 2 to 4 seconds. Better, but still not great.
Now, adding more RAM (and moving to 64-bit) would speed things up I hope! But hardware aside:
1. Is it better to have big documents rather than big collections?
2. Is having small collections (<10 docs) of big documents also inefficient?
Ideally I need to query collections comprising several thousand documents using 'text search' predicates. Are there other, better ways to run this type of query against a MonetDB XML database? Or should I really be using some other platform for this task?
Thanks in advance for any pointers.
-- Roy ------------------------------------------------------------------------------ Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize
details at: http://p.sf.net/sfu/Challenge_______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- Jan Rittinger Lehrstuhl Datenbanken und Informationssysteme Wilhelm-Schickard-Institut für Informatik Eberhard-Karls-Universität Tübingen
http://www-db.informatik.uni-tuebingen.de/team/rittinger
------------------------------------------------------------------------------ Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
participants (3)
-
Jan Rittinger
-
Lefteris
-
Roy Walter