[Monetdb-developers] New serialization routine

12 May 2006

      In the current version serialization takes 10 seconds for a 10 MB file 
(http://www-db.in.tum.de/~rittinge/files/mixed.xml) which contains only 
slightly more nodes (517448) then the auction xml file of the same size 
(472681). The difference however is the number of attribute nodes which 
is much higher in mixed.xml (517447 vs. 38266).

While the serialization of doc("auction10MB.xml") takes 2.72 seconds, 
doc("mixed.xml") requires 10.34 seconds!! Removing the attributes in 
mixed.xml (by deleting all entries from the attr_own bat) speeds up the 
serialization to complete in 4.6 seconds.

Some more calculations reveal that for the auction file we are able to 
serialize 187 nodes/sec (nodes = elements+textnodes+attributes) while 
for mixed.xml the serialization function (with and without attributes) 
was only able to generate about 100 nodes/sec.

In all 3 cases the serialization seems way to slow... (at least for me). 
As in principle the serialization is only a dump of the tables (even in 
table order) and other approaches are really faster we should be able to 
come up with an appropriate fast serialization routine.

I see at least three possibilities to increase the performance:
* serialize.mx uses about 50 function calls until one node without 
attributes is serialized
* serialize.mx does collect the nodes using random access with a lot of 
indirections
* serialize.mx uses stream_printf which is probably slower than a fixed 
size print routine like stream_write.

In the following two weeks (before the feature freeze) I will try to 
come up with a new prototypical print routine that should handle the 
serialization more efficiently. This prototype will be a proof of 
concept and probably will fall back to current routine whenever some 
conditions are not fulfilled (e.g. wrong serialization type or multiple 
namespaces).

Let's see wether we can get a faster serialization routine for the 
average query result!

I will probably need help for some gdk internals, changes introduced by 
the update facility (e.g. indirections), and for understanding parts of 
the current serialize.mx. I hope for your support :-)

More information on the new serialization routine will be available at 
the pathfinder wiki.

Cheers,
Jan

-- 
Jan Rittinger
Database Systems
Technische Universität München (Germany)
http://www-db.in.tum.de/~rittinge/

Jan Rittinger

Fabian Groffen

tags

participants (2)