Hello Andy,
From personal experience:
Regarding your first query, if you would like to do 'bulk' processing
such as retrieving _all_ sections meeting specific (and possibly
complex) criteria in your corpus, then MonetDB/XQuery should be your
system of choice when going for fast retrieval, and you should
definitely give it a try.
However, if your query needs to retrieve only 1 XML element (a section
in your case) out of millions of elements then MonetDB/XQuery might
not the ideal solution as the compilation overhead and the
bulk-processing approach that MonetDB uses in this case probably
greatly influences retrieval time and might be to high for retrieving
a single element. Although it might still be suitable, you could still
give it a try.
For your second query scenario you might want to have a look at the
PF/TIJAH module in MonetDB/XQuery. It provides full-text search and
integrates the NEXI language with XQuery.
1M webpages, is that about 10GB of data? Ideally MonetDB/XQuery would
like to keep its indices in main-memory. It depends on your corpus,
but I would guess that the MonetDB indices for such data would fit in
10 to 20GB of memory. (it usually scales almost linear with the size
of the data). You will need a 64-bit operating system for such amounts
of data.
Hope this helps.
Greetings,
Wouter
2009/8/9 ?listanand@gmail.com
Dear Wouter,
Thanks for your response.
Well what I am doing is somewhat open ended. I am still in early stages and have not yet defined specific tasks yet. But what I intend to do is to be able to query different parts of a webpages and combine results from different webpages effectively. The least I want to be able to do is to have queries of the following form run efficiently: (1) Retrieve the whole (specific) section of webpages efficiently (for example retrieve "Summary" section of all webpages). (2)Retrieve all webpages that have specific words in specific sections (for example the word "Shakespeare" in "Introduction" section)
Again, I have about 1 million webpages. I also have resources for parallelizing things if needed.
Thank you again for your help. Andy
------------------------------------------------------------- From: Wouter Alink
- 2009-08-08 18:04 Hello (?),
It completely depends on your application whether MonetDB/XQuery is the right solution. MonetDB/XQuery has been and is being used for very large XML collections, but this also depends on how you would like to query the data. Could you perhaps give some more information about the application you have in mind?
Greetings, Wouter
2009/8/8 ?listanand@gm...
: Dear all,
I am new to MonetDB/XQuery and am considering using it for storing and processing a large number of webpages (~1million) in XML format. I am hoping to know if this is indeed the right tool, and how well it will scale to handle tasks of this magnitude.
Thanks in advance
------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ MonetDB-users mailing list MonetDB-users@li... https://lists.sourceforge.net/lists/listinfo/monetdb-users
------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users