Thanks for the answers Stefan! The MonetDB system is definitely looking quite interesting. On Thu, 10 Feb 2005, Stefan Manegold wrote:
- I tried a load where I committed after every 100 rows. Noticed _huge_ I/O surges. Looking at the subdirectory, it looked like the X.theap file was being rewritten over and over from scratch. If this is a memory-mapped file, why would this occur? When you start playing with 55MB heap files, this kills performance.
For now, we cannot do much about that, and I suppose, there'll never be any change. Basically, we have to write the whole file to make sure that all changes are properly committed. But why do you consider having commits after every 100 rows, if you do load 250k rows in one go? Isn't a single commit at the end enough?
I'm looking at potentially using this on a live system, where data will be arriving at some steady rate (approximatly 10-50 items per second). The current data set has some tables with over 500M rows at the moment (and growing by >5M rows per day). After looking a bit, I can see the following solutions: (1) Only ever do bulk loads (say at the end of the day). Disadvantage, no analysis of date from the current day. Advantage, can try to use a bulk load mechanism rather than "one row at a time". (2) Leave the system "dirty" and do a commit only at the end of the day. Still trying to determine the exact concurrency model. From what I've read so far, there is only global locking and no "multi-view" similar to most databases (such as Oracle or Postgresql), so this should work fine as any uncommitted data could be seen by other processes (NOTE: I have not tested this theory yet, so please excuse if I have it wrong, only so much time to play with the system!). (3) I was thinking of using persistent sub-BATs (a BAT within a BAT). Use this to partition the data into sets (say, one sub-BAT per day). But I was seeing some comments on the mailing lists re persistent BATs within BATs no longer supported? I have had no time to dig/experiment with this option. And I would be worried about the performance of operations trying to run over a partitioned set. On the other hand, I will definitely have to do something due to the sheer size of the data. Any suggestions from the community on how to deal with a live data stream?
- Is it better to use "BAT.reverse().find(X)" than "BAT.search(X)"?
Well, this are two "different pairs of shoes":
Doh! That will teach me from trying to type via memory. I did mean uselect(), and thanks for the pointers to the differences re "definite existence" vs a set. It was the tail vs head operation, but given that "reverse()" is essentially free, I won't worry about it. Thanks for the hints! Ed