[Monetdb-developers] New serialization routine
In the current version serialization takes 10 seconds for a 10 MB file (http://www-db.in.tum.de/~rittinge/files/mixed.xml) which contains only slightly more nodes (517448) then the auction xml file of the same size (472681). The difference however is the number of attribute nodes which is much higher in mixed.xml (517447 vs. 38266). While the serialization of doc("auction10MB.xml") takes 2.72 seconds, doc("mixed.xml") requires 10.34 seconds!! Removing the attributes in mixed.xml (by deleting all entries from the attr_own bat) speeds up the serialization to complete in 4.6 seconds. Some more calculations reveal that for the auction file we are able to serialize 187 nodes/sec (nodes = elements+textnodes+attributes) while for mixed.xml the serialization function (with and without attributes) was only able to generate about 100 nodes/sec. In all 3 cases the serialization seems way to slow... (at least for me). As in principle the serialization is only a dump of the tables (even in table order) and other approaches are really faster we should be able to come up with an appropriate fast serialization routine. I see at least three possibilities to increase the performance: * serialize.mx uses about 50 function calls until one node without attributes is serialized * serialize.mx does collect the nodes using random access with a lot of indirections * serialize.mx uses stream_printf which is probably slower than a fixed size print routine like stream_write. In the following two weeks (before the feature freeze) I will try to come up with a new prototypical print routine that should handle the serialization more efficiently. This prototype will be a proof of concept and probably will fall back to current routine whenever some conditions are not fulfilled (e.g. wrong serialization type or multiple namespaces). Let's see wether we can get a faster serialization routine for the average query result! I will probably need help for some gdk internals, changes introduced by the update facility (e.g. indirections), and for understanding parts of the current serialize.mx. I hope for your support :-) More information on the new serialization routine will be available at the pathfinder wiki. Cheers, Jan -- Jan Rittinger Database Systems Technische Universität München (Germany) http://www-db.in.tum.de/~rittinge/
On 12-05-2006 11:23:31 +0200, Jan Rittinger wrote:
More information on the new serialization routine will be available at the pathfinder wiki.
I probably lost it, but where is this wiki located? Google doesn't help me much here. Could you also post a notification (or better the contents) on this list when you made some update? Thanks. Looking forward to your design! I hope it will easily allow multiple representations, which was possible in (other) Jan's version.
participants (2)
-
Fabian Groffen
-
Jan Rittinger