[MonetDB-users] MonetDB/XQuey - how to ignore DTDs?
Hi all, I wonder is there is an option to make MonetDB/XQuery shredder ignore DTD files. What triggered this question is the need to shred about 20M files, each of them including *the same* online DTD. What happens is that, for each xml file, the DTD is downloaded and shredded/used(?). It is not even cached (I tried using xquery_cacherules, but apparently it has no effect on DTDs). This makes shredding speed drop down to more than 3 seconds per document, which is not an option of course, as it would take me roughly 700 days to shred the entire collection :) Removing DTD link from the original xml files is not an option either, the collection is simply too big to be replicated. How can the DTD be ignored? Roberto -- | M.Sc. Roberto Cornacchia | CWI (Centrum voor Wiskunde en Informatica) | Science Park 123, 1098XG Amsterdam, The Netherlands | tel: +31 20 592 4322 , http://www.cwi.nl/~roberto
Hi Roberto, the shredding of multiple documents is solved by a batloop on the MIL level calling the xml shredder for each document separately. The DTD lookup is performed by libxml2 (not by Pathfinder). In my eyes there currently exists no real solution to your problem. But you might think about the following two work-arounds: - use a proxy or an entry in /etc/hosts to reference a local copy of the DTD, or - disable the DTD loading in the shredder (pathfinder/runtime/shredder.mx). > , .externalSubset = shred_external_subset < , .externalSubset = 0 Jan On Apr 14, 2010, at 17:58, Roberto Cornacchia wrote:
Hi all,
I wonder is there is an option to make MonetDB/XQuery shredder ignore DTD files. What triggered this question is the need to shred about 20M files, each of them including *the same* online DTD. What happens is that, for each xml file, the DTD is downloaded and shredded/used(?). It is not even cached (I tried using xquery_cacherules, but apparently it has no effect on DTDs).
This makes shredding speed drop down to more than 3 seconds per document, which is not an option of course, as it would take me roughly 700 days to shred the entire collection :)
Removing DTD link from the original xml files is not an option either, the collection is simply too big to be replicated.
How can the DTD be ignored?
Roberto -- | M.Sc. Roberto Cornacchia | CWI (Centrum voor Wiskunde en Informatica) | Science Park 123, 1098XG Amsterdam, The Netherlands | tel: +31 20 592 4322 , http://www.cwi.nl/~roberto
------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- Jan Rittinger Lehrstuhl Datenbanken und Informationssysteme Wilhelm-Schickard-Institut für Informatik Eberhard-Karls-Universität Tübingen http://www-db.informatik.uni-tuebingen.de/team/rittinger
Hi all,
the shredding of multiple documents is solved by a batloop on the MIL level calling the xml shredder for each document separately. The DTD lookup is performed by libxml2 (not by Pathfinder).
libxml2 provides means to redirect remote DTDs to local copies, using catalogs (see [1]). So the intended way to cope with that situation would be to add the required DTD to your local XML catalog. Best regards, Isidor [1] http://www.xmlsoft.org/catalog.html
Isidor Zeuner wrote:
the shredding of multiple documents is solved by a batloop on the MIL level calling the xml shredder for each document separately. The DTD lookup is performed by libxml2 (not by Pathfinder).
libxml2 provides means to redirect remote DTDs to local copies, using catalogs (see [1]). So the intended way to cope with that situation would be to add the required DTD to your local XML catalog.
This is true but requires manual intervention and earlier the OP wrote: Roberto Cornacchia wrote:
What happens is that, for each xml file, the DTD is downloaded and shredded/used(?). It is not even cached (I tried using xquery_cacherules, but apparently it has no effect on DTDs).
Surely the appropriate component in the stack should be taking notice of HTTP Cache-Control[1] directives? Or is the DTD served without cache control, in which case the originators should be told. Then the system would behave sensibly without the need for manual intervention in every application to create a catalog. Cheers, Dave [1] http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
Hi Jan, Thanks for the info. Actually I see the two solutions you propose as non-mutually exclusive. One thing is to be able to cache remote DTDs. Another thing is whether I want them to be processed at all, regardless of where they are located. I think it would be very useful to add an option at XQuery level to implement your second suggestion dynamically. Cheers, Roberto On Wed, 2010-04-14 at 22:02 +0200, Jan Rittinger wrote:
Hi Roberto,
the shredding of multiple documents is solved by a batloop on the MIL level calling the xml shredder for each document separately. The DTD lookup is performed by libxml2 (not by Pathfinder).
In my eyes there currently exists no real solution to your problem. But you might think about the following two work-arounds:
- use a proxy or an entry in /etc/hosts to reference a local copy of the DTD, or
- disable the DTD loading in the shredder (pathfinder/runtime/shredder.mx).
, .externalSubset = shred_external_subset < , .externalSubset = 0
Jan
On Apr 14, 2010, at 17:58, Roberto Cornacchia wrote:
Hi all,
I wonder is there is an option to make MonetDB/XQuery shredder ignore DTD files. What triggered this question is the need to shred about 20M files, each of them including *the same* online DTD. What happens is that, for each xml file, the DTD is downloaded and shredded/used(?). It is not even cached (I tried using xquery_cacherules, but apparently it has no effect on DTDs).
This makes shredding speed drop down to more than 3 seconds per document, which is not an option of course, as it would take me roughly 700 days to shred the entire collection :)
Removing DTD link from the original xml files is not an option either, the collection is simply too big to be replicated.
How can the DTD be ignored?
Roberto -- | M.Sc. Roberto Cornacchia | CWI (Centrum voor Wiskunde en Informatica) | Science Park 123, 1098XG Amsterdam, The Netherlands | tel: +31 20 592 4322 , http://www.cwi.nl/~roberto
------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- | M.Sc. Roberto Cornacchia | CWI (Centrum voor Wiskunde en Informatica) | Science Park 123, 1098XG Amsterdam, The Netherlands | tel: +31 20 592 4322 , http://www.cwi.nl/~roberto
participants (4)
-
Dave Howorth
-
Isidor Zeuner
-
Jan Rittinger
-
Roberto Cornacchia