Dear Martin,
Can you specify the system you are working on, e.g. hardware/os.
The hardware is a dual-core 2.2GHz AMD64, 4G RAM, 8 disk raid0 (256k chunk size) with XFS filesystem for the database. The OS is a 64bit Debian Etch.
You are really beating the system. I don't know a system that can handle a TPC-H SF100 out of the box. This size often require a multistage process and careful setting of the system parameters. But, let's see what you have done and learn from your experience.
I didn't expect sub millisecond response times for the tpc-h queries out of box, so I am open for any system or database setting modification. I have noticed that the MonetDB initial memory usage is slightly lower when I changed the kernel parameter vm/overcommit_memory from 2 to 0.
Did you stop and restart the server between the loads? If not, then from a recovery point of view all 420M are stored in a single log file and become the target of a single reload. It behaves as if you loaded the 7x60M as a single batch.
I didn't try to restart the server on every slice load - so that was the problem. Is there any way to control the checkpoint process? Stopping and restarting the server is well ... not the most elegant way to trigger a checkpoint.
Killing a database process is of course hard. In that case, the recovery process has to reload the data and enters a really expensive part of TPC-H: ensure correctness of the integrity relationships. Protection against this is hard, because it requires that integrity rules enforcement should either be disabled (the method persued in MySQL).
Is there any formula that can predict the recovery time after killing the database process based on the log sizes or other parameters? It would be good to have a bulk data load method without integrity rules enforcement - similar to pg_bulkload [1] in postgres. Using the same database schema as in MonetDB I have loaded the whole TPC-H dataset to postgres within 90 minutes (~19M/sec, which is still pretty slow imho).
to restart the server process but after three hours of intensive processing the sql module still not started. Please note that the initialization process allocated nearly all memory and swap. This is what we expect. Your tables require a lot of space, because MonetDB does not automatically partitioning it. (That's scheduled for an upcoming release ;-))
Hmm, looks pretty interesting. Any plan for reasonably sized (eg.: 64M) table slice partitioning across several machines? I am trying to build a scalable log analysis framework (append only, read mostly data flow) so this feature would help a lot. Regards, J. [1] http://pgbulkload.projects.postgresql.org/