Hi Martin,
Thanks for the detailed response.
... This has been solved in the upcoming Feb2010 release. The effect on database size and performance is shown in http://www.cwi.nl/~mk/ontimeReport
That's great news. I had actually been encoding my dimension PK columns as INTs/LONGs to save on (perceived) string costs. Thanks!!
Furthermore, there is a dictionary compression optimizer in Feb2010, which works for any type. It is available at the SQL level on a per table basis.
Even better. But I wonder, outside of strings, does dictionary compression help much? (e.g. I assume for a BAT of INTs the dictionary offset would be as large the original data) No. The dictionary optimizer would e.g. replace :bat[:oid,:lng] with a :bat[:oid,:bte], :bat[:bte,:lng] pair and adjust the query
Jason Kinzer wrote: plans accordingly. In a DW there are a limited number of dates that can compressed nicely this way.
Or do you mean it is not necessary a static dictionary, but e.g. works like LZ77, encoding groups of recurring data? Currently tested on read only tables, but no reason to stop there.
However, compression not necessarily leads to performance gains, you have to decompress at some point in most plans. Furthermore, this code is alpha-stage. A driving (test) user would help to improve it.
Granted. Actually I am more concerned with memory use than raw performance, MonetDB is generally plenty fast for my needs already :-) (BTW, as far as testing it out, it is available (only) in SVN trunk right now, correct?) It is available in different ways, see the dowload section.
There are more options to exploit compression. Early versions of MonetDB used gzipped BATs (10yrs ago already).
Yes, I read some archaic-looking comments about storing BATs using compress, and then decompressing them all at once. (I guess that doesn't help main memory use, but still intersting). Does version 5 still support that?
Not now. It would require a minor MAL transformer to play that game again. We focussed on memory mapped files, but it is certainly a valid route for tables in long sessions, and having many of them.
The current software stack would even make it possible to massage a plan to used e.g. bitvectors.
My dimensions (and hence PK columns) tend to be low cardinality - so if I correctly understand what bit vectors offer, that would save a ton of space in my corresponding Fact table FK columns, no?
Are these projects easy enough for someone outside the core team to approach? (albeit very slowly ;-)
I would not press for bitvectors, but the MAL transformer to decompress BATs before being used, is very well possible with C experience.
If there are no plans, is this because it's completely antithetical to the monet architecture (from the papers it seems like X-100 was, to some degree at least, 'integrated' in), or more due to lack of resources?
We can always use resources to make the code base better. Involved developer/users, but also dollarwise in the form of companies using MonetDB in their applications and who want a (continual) performance/functional quality assessment testing agreement.
Understand completely.
My motivating example here is OLAP: I frequently have 1 relatively large fact table and then many much smaller dimensional tables. If optional compression were available, it we be nice to compress all or some of the BATs for the fact table columns and then have the others work as usual.
So, there is a lot and more coming upon need.
Well, OLAP & memory-constrained environments represent my need, but perhaps I'm an outlier. Whatever you choose to address however, thanks for all your hard work.
(Well, at least this sounds good, maybe it makes no sense). Another motivation is there seems to be a lot of anecdotal evidence for companies moving from larger big iron servers to more numerous, smaller machines - so it would be really nice to have this capability for more memory constrained settings.
Indeed, I expect this year will bring some MonetDB surprises (again). Some of them are already in the distribution. To pick one, the code base contains a 'recycler' optimizer, that can for many BI applications provide a significant performance booster. A throughput improvement of >40% has been reported already. (It received an award in SIGMOD 2009 for its innovation) Such optimizers gradually are integrated in the default optimizer pipe, but available for adventurous users already.
Thanks for pointing it out, I'll read up on it. (And congratulations on the award added to your collection :-)
Best Regards.
Jason