Re: [Monetdb-developers] [Monetdb-pf-checkins] pathfinder/runtime shredder.mx, XQuery_0-18, 1.126, 1.126.2.1
Hi Jan F, thanks for looking into our shredder's entity handling. This has been bugging us for quite a while now. It feels a bit strange, though, that we really have to implement getEntity() ourselves. After all, this is exactly what I would expect to be handled automatically by an XML parsing library. I have just glanced briefly over the libxml2 documentation. And I saw that there is a replaceEntities field in libxml2's xmlParserCtxt struct. As usual, the documentation is very poor here, but this sounds to me like exactly what we need. Have you tried whether simply enabling this flag would handle entities automatically? (Sorry, I don't have the time to test this myself right now.) You find the documentation for xmlParserCtxt at http://xmlsoft.org/html/libxml-tree.html#xmlParserCtxt . Just my 2¢ (still Euro-Cents) Jens On Tue, Jun 05, 2007 at 07:30:50AM +0000, Jan Flokstra wrote:
Update of /cvsroot/monetdb/pathfinder/runtime In directory sc8-pr-cvs16.sourceforge.net:/tmp/cvs-serv2405
Modified Files: Tag: XQuery_0-18 shredder.mx Log Message: - A first attempt at making ENTITIES work. After long searching and trying the final solution was very simple. The libxml2 package already maintains a hashtable with defined entities. The only thing missing was the lookup function which it mysteriously does not use. I implemented the "getEntity()" function in the xmlSAXHandler structure and now simple ENTITIES defined in the internal subset work. This solves bug#1185932 XML: Entities by Wouter Alink
- Change error handling. I noticed for some time that there were no error messages generated anymore on my system. The xmlSetGenericErrorFunc() does not work on my machine. I added the "error()" function in the xmlSAXHandler structure and now I get descriptive error messages again. This may screw up some testset output but I think this is a big system improvement.
Index: shredder.mx =================================================================== RCS file: /cvsroot/monetdb/pathfinder/runtime/shredder.mx,v retrieving revision 1.126 retrieving revision 1.126.2.1 diff -u -d -r1.126 -r1.126.2.1 --- shredder.mx 24 Apr 2007 15:04:42 -0000 1.126 +++ shredder.mx 5 Jun 2007 07:30:46 -0000 1.126.2.1 @@ -1103,6 +1103,35 @@ stream_printf(GDKerr, "%s", buf); }
+static void +shred_warning(void *ctx, + const char *msg, ...) +{ + /* IMPORTANT this function may be called multiple times for one error + * message so it is not possible to use GDKerror() here. + * Instead, we "mis-use" the ctx pointer to remember whether a newline + * has occured in the error message, and thus be able to prefix each + *(line of an) error message with GDKERROR("!Error: "), which is + * required to properly get the error message through the MAPI + * protocol... + */ + va_list args; + int *print_error_newline =(int*)ctx; + int len = 0; + char buf[PFSHRED_BUFLEN]; + + if (*print_error_newline) { + len += snprintf(buf+len, PFSHRED_BUFLEN-len-1, GDKERROR); + } + + va_start(args, msg); + len += vsnprintf(buf+len, PFSHRED_BUFLEN-len-1, msg, args); + va_end(args); + + *print_error_newline =(strchr(buf,(int)'\n') != NULL); + stream_printf(GDKerr, "!WARNING: %s", buf); +} + /** * The shred_attribute_defAttributeDef() handles the DTD attribute definition callbacks * in the from the header of the XML file. This is used for the ID/IDREF @@ -1163,6 +1192,44 @@ } }
+static xmlEntityPtr +shred_getEntity(void *xmlCtx, const xmlChar *name) +{ + /* the shredder is now able to handle ENTITY's from the internal + * subset. I do not really understand yet why this had to be done + * here and why it was not handled automagically + * The functions used are defined in $LIBXML2INCLUDES/entities.h + */ +#if 0 + stream_printf(GDKerr,"shred_getEntity(ctx,\"%s\") CALLED\n",name); +#endif + xmlParserCtxtPtr ctx = ((shredCtxStruct*) xmlCtx)->xmlCtx; + /* lookup the entity in the document entity hash table */ + return xmlGetDocEntity(ctx->myDoc,name); + /* QUESTION: xmlGetDtdEntity() and xmlGetParameterEntity() were also + * possible, whats the diff between the doc/dtd versions, they both + * seem to work. */ +} + +#if 0 +/* My first try at building an entity table but this one was not necessary + * because the internal subset table was already build. + */ +static void +shred_entityDecl(void *xmlCtx, + const xmlChar *name, + int type, + const xmlChar *publicId, + const xmlChar *systemId, + xmlChar *content) +{ + xmlParserCtxtPtr ctx = ((shredCtxStruct*) xmlCtx)->xmlCtx; + if ( ! xmlAddDtdEntity(ctx->myDoc,name,type,publicId,systemId,content) ) + stream_printf(GDKerr,"shred_entityDecl(ctx,\"%s\") FAIL\n",name); +} +#endif + + /* ==================================================================================== * the shredder and its data structures * - shredder_create() create all data structures @@ -1183,14 +1250,14 @@ , .characters = shred_characters , .processingInstruction = shred_pi , .comment = shred_comment - , .error = 0 + , .error = shred_error , .cdataBlock = shred_cdata , .internalSubset = 0 , .isStandalone = 0 , .hasInternalSubset = 0 , .hasExternalSubset = 0 , .resolveEntity = 0 - , .getEntity = 0 + , .getEntity = shred_getEntity , .entityDecl = 0 , .notationDecl = 0 , .attributeDecl = shred_attribute_def @@ -1199,7 +1266,7 @@ , .setDocumentLocator = 0 , .reference = 0 , .ignorableWhitespace = 0 - , .warning = 0 + , .warning = shred_warning , .fatalError = 0 , .getParameterEntity = 0 , .externalSubset = shred_external_subset @@ -1229,6 +1296,10 @@ char buf[XMLCHUNK+1];
/* reset libxml2 error handling */ + /* note JF: this does not have any effect on SuSe9.3. No error messages + * are printed. I assigned the 'error' field in the xmlSAXHandler and + * this works fine. + */ xmlSetGenericErrorFunc((void*)&print_error_newline, shred_error);
/* parse XML input(receive SAX events) */ @@ -1237,7 +1308,30 @@ } else if (buffer) { xmlCtx = xmlCreateMemoryParserCtxt(buffer, shredCtx->fileSize); } else { - xmlCtx = xmlCreateURLParserCtxt(location, XML_PARSE_XINCLUDE|XML_PARSE_NOXINCNODE); + /* Possible options for the second arg are: + * XML_PARSE_RECOVER = recover on errors + * XML_PARSE_NOENT = substitute entities + * XML_PARSE_DTDLOAD = load the external subset + * XML_PARSE_DTDATTR = default DTD attributes + * XML_PARSE_DTDVALID = validate with the DTD + * XML_PARSE_NOERROR = suppress error reports + * XML_PARSE_NOWARNING = suppress warning reports + * XML_PARSE_PEDANTIC = pedantic error reporting + * XML_PARSE_NOBLANKS = remove blank nodes + * XML_PARSE_SAX1 = use the SAX1 interface internally + * XML_PARSE_XINCLUDE = Implement XInclude substitition + * XML_PARSE_NONET = Forbid network access + * XML_PARSE_NODICT = Do not reuse the context dictionnary + * XML_PARSE_NSCLEAN = rm redundant namespaces declarations + * XML_PARSE_NOCDATA = merge CDATA as text nodes + * XML_PARSE_NOXINCNODE= do not generate XINCLUDE START/END nodes + */ + /* + * TODO: how to prevent expansion of entities? + */ + xmlCtx = xmlCreateURLParserCtxt(location, + XML_PARSE_XINCLUDE| + XML_PARSE_NOXINCNODE); } if (!xmlCtx) { GDKerror("shredder_parse: libxml2 could not initialize a parser.\n");
------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Monetdb-pf-checkins mailing list Monetdb-pf-checkins@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-pf-checkins
-- Jens Teubner Technische Universitaet Muenchen, Department of Informatics D-85748 Garching, Germany Tel: +49 89 289-17259 Fax: +49 89 289-17263
From /usr/src/linux/include/linux/kernel.h: #define STACK_MAGIC 0xdeadbeef
Hi Jens, Thanks for the attention. I think this replaceEntities flag indicates if user defined entities should be replaced by their entity value. This is always false for the default XML entities. When I look at the variable at runtime it is 0. But it does call getEntity() to replace te value. Of course I also tried to make this value 1 at parser context creation time and run it without my own getEntity() callback function. But then the 'eacute' test fails. This entity handling is very weird in libxml2. On the same 'xmlsoft' website the was a very encouraging warning when using a combination of SAX (which we use) en ENTITIES http://xmlsoft.org/entities.html): WARNING: handling entities on top of the libxml2 SAX interface is difficult!!! If you plan to use non-predefined entities in your documents, then the learning curve to handle then using the SAX API may be long. If you plan to use complex documents, I strongly suggest you consider using the DOM interface instead and let libxml deal with the complexity rather than trying to do it yourself. So it is supposed to be hard:-) I just check the testweb and it seems both solutions for internal and external subsets work on all tested architectures so I'm quite happy, JanF. On Thursday 07 June 2007 09:55, Jens Teubner wrote:
Hi Jan F,
thanks for looking into our shredder's entity handling. This has been bugging us for quite a while now.
It feels a bit strange, though, that we really have to implement getEntity() ourselves. After all, this is exactly what I would expect to be handled automatically by an XML parsing library.
I have just glanced briefly over the libxml2 documentation. And I saw that there is a replaceEntities field in libxml2's xmlParserCtxt struct. As usual, the documentation is very poor here, but this sounds to me like exactly what we need. Have you tried whether simply enabling this flag would handle entities automatically? (Sorry, I don't have the time to test this myself right now.) You find the documentation for xmlParserCtxt at
http://xmlsoft.org/html/libxml-tree.html#xmlParserCtxt .
Just my 2¢ (still Euro-Cents)
Jens
On Tue, Jun 05, 2007 at 07:30:50AM +0000, Jan Flokstra wrote:
Update of /cvsroot/monetdb/pathfinder/runtime In directory sc8-pr-cvs16.sourceforge.net:/tmp/cvs-serv2405
Modified Files: Tag: XQuery_0-18 shredder.mx Log Message: - A first attempt at making ENTITIES work. After long searching and trying the final solution was very simple. The libxml2 package already maintains a hashtable with defined entities. The only thing missing was the lookup function which it mysteriously does not use. I implemented the "getEntity()" function in the xmlSAXHandler structure and now simple ENTITIES defined in the internal subset work. This solves bug#1185932 XML: Entities by Wouter Alink
- Change error handling. I noticed for some time that there were no error messages generated anymore on my system. The xmlSetGenericErrorFunc() does not work on my machine. I added the "error()" function in the xmlSAXHandler structure and now I get descriptive error messages again. This may screw up some testset output but I think this is a big system improvement.
Index: shredder.mx =================================================================== RCS file: /cvsroot/monetdb/pathfinder/runtime/shredder.mx,v retrieving revision 1.126 retrieving revision 1.126.2.1 diff -u -d -r1.126 -r1.126.2.1 --- shredder.mx 24 Apr 2007 15:04:42 -0000 1.126 +++ shredder.mx 5 Jun 2007 07:30:46 -0000 1.126.2.1 @@ -1103,6 +1103,35 @@ stream_printf(GDKerr, "%s", buf); }
+static void +shred_warning(void *ctx, + const char *msg, ...) +{ + /* IMPORTANT this function may be called multiple times for one error + * message so it is not possible to use GDKerror() here. + * Instead, we "mis-use" the ctx pointer to remember whether a newline + * has occured in the error message, and thus be able to prefix each + *(line of an) error message with GDKERROR("!Error: "), which is + * required to properly get the error message through the MAPI + * protocol... + */ + va_list args; + int *print_error_newline =(int*)ctx; + int len = 0; + char buf[PFSHRED_BUFLEN]; + + if (*print_error_newline) { + len += snprintf(buf+len, PFSHRED_BUFLEN-len-1, GDKERROR); + } + + va_start(args, msg); + len += vsnprintf(buf+len, PFSHRED_BUFLEN-len-1, msg, args); + va_end(args); + + *print_error_newline =(strchr(buf,(int)'\n') != NULL); + stream_printf(GDKerr, "!WARNING: %s", buf); +} + /** * The shred_attribute_defAttributeDef() handles the DTD attribute definition callbacks * in the from the header of the XML file. This is used for the ID/IDREF @@ -1163,6 +1192,44 @@ } }
+static xmlEntityPtr +shred_getEntity(void *xmlCtx, const xmlChar *name) +{ + /* the shredder is now able to handle ENTITY's from the internal + * subset. I do not really understand yet why this had to be done + * here and why it was not handled automagically + * The functions used are defined in $LIBXML2INCLUDES/entities.h + */ +#if 0 + stream_printf(GDKerr,"shred_getEntity(ctx,\"%s\") CALLED\n",name); +#endif + xmlParserCtxtPtr ctx = ((shredCtxStruct*) xmlCtx)->xmlCtx; + /* lookup the entity in the document entity hash table */ + return xmlGetDocEntity(ctx->myDoc,name); + /* QUESTION: xmlGetDtdEntity() and xmlGetParameterEntity() were also + * possible, whats the diff between the doc/dtd versions, they both + * seem to work. */ +} + +#if 0 +/* My first try at building an entity table but this one was not necessary + * because the internal subset table was already build. + */ +static void +shred_entityDecl(void *xmlCtx, + const xmlChar *name, + int type, + const xmlChar *publicId, + const xmlChar *systemId, + xmlChar *content) +{ + xmlParserCtxtPtr ctx = ((shredCtxStruct*) xmlCtx)->xmlCtx; + if ( ! xmlAddDtdEntity(ctx->myDoc,name,type,publicId,systemId,content) ) + stream_printf(GDKerr,"shred_entityDecl(ctx,\"%s\") FAIL\n",name); +} +#endif + + /* ========================================================================= =========== * the shredder and its data structures * - shredder_create() create all data structures @@ -1183,14 +1250,14 @@ , .characters = shred_characters , .processingInstruction = shred_pi , .comment = shred_comment - , .error = 0 + , .error = shred_error , .cdataBlock = shred_cdata , .internalSubset = 0 , .isStandalone = 0 , .hasInternalSubset = 0 , .hasExternalSubset = 0 , .resolveEntity = 0 - , .getEntity = 0 + , .getEntity = shred_getEntity , .entityDecl = 0 , .notationDecl = 0 , .attributeDecl = shred_attribute_def @@ -1199,7 +1266,7 @@ , .setDocumentLocator = 0 , .reference = 0 , .ignorableWhitespace = 0 - , .warning = 0 + , .warning = shred_warning , .fatalError = 0 , .getParameterEntity = 0 , .externalSubset = shred_external_subset @@ -1229,6 +1296,10 @@ char buf[XMLCHUNK+1];
/* reset libxml2 error handling */ + /* note JF: this does not have any effect on SuSe9.3. No error messages + * are printed. I assigned the 'error' field in the xmlSAXHandler and + * this works fine. + */ xmlSetGenericErrorFunc((void*)&print_error_newline, shred_error);
/* parse XML input(receive SAX events) */ @@ -1237,7 +1308,30 @@ } else if (buffer) { xmlCtx = xmlCreateMemoryParserCtxt(buffer, shredCtx->fileSize); } else { - xmlCtx = xmlCreateURLParserCtxt(location, XML_PARSE_XINCLUDE|XML_PARSE_NOXINCNODE); + /* Possible options for the second arg are: + * XML_PARSE_RECOVER = recover on errors + * XML_PARSE_NOENT = substitute entities + * XML_PARSE_DTDLOAD = load the external subset + * XML_PARSE_DTDATTR = default DTD attributes + * XML_PARSE_DTDVALID = validate with the DTD + * XML_PARSE_NOERROR = suppress error reports + * XML_PARSE_NOWARNING = suppress warning reports + * XML_PARSE_PEDANTIC = pedantic error reporting + * XML_PARSE_NOBLANKS = remove blank nodes + * XML_PARSE_SAX1 = use the SAX1 interface internally + * XML_PARSE_XINCLUDE = Implement XInclude substitition + * XML_PARSE_NONET = Forbid network access + * XML_PARSE_NODICT = Do not reuse the context dictionnary + * XML_PARSE_NSCLEAN = rm redundant namespaces declarations + * XML_PARSE_NOCDATA = merge CDATA as text nodes + * XML_PARSE_NOXINCNODE= do not generate XINCLUDE START/END nodes + */ + /* + * TODO: how to prevent expansion of entities? + */ + xmlCtx = xmlCreateURLParserCtxt(location, + XML_PARSE_XINCLUDE| + XML_PARSE_NOXINCNODE); } if (!xmlCtx) { GDKerror("shredder_parse: libxml2 could not initialize a parser.\n");
------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Monetdb-pf-checkins mailing list Monetdb-pf-checkins@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-pf-checkins
participants (2)
-
Jan Flokstra
-
Jens Teubner