[MonetDB-users] compression
Hi, Are there any plans on the horizon to roll compression into Monet? The X-100 project looked really interesting in this regard, but as I understand it, that work has been transferred into VectorWise. If there are no plans, is this because it's completely antithetical to the monet architecture (from the papers it seems like X-100 was, to some degree at least, 'integrated' in), or more due to lack of resources? My motivating example here is OLAP: I frequently have 1 relatively large fact table and then many much smaller dimensional tables. If optional compression were available, it we be nice to compress all or some of the BATs for the fact table columns and then have the others work as usual. (Well, at least this sounds good, maybe it makes no sense). Another motivation is there seems to be a lot of anecdotal evidence for companies moving from larger big iron servers to more numerous, smaller machines - so it would be really nice to have this capability for more memory constrained settings. I understand on a basic level how compression conflicts with the relatively simple approach monet uses to load BATs (e.g. memory map), but, dwelling in ignorance, I blithely assume there could be some solution not as complex as X-100 if one were to accept a significant performance cost. For example: decompressing BAT data on the fly as part of a BATiterator. I probably don't have the skills to implement even a basic on-the-fly decompression approach like this, but just wondering aloud: how hard a problem is this? Thanks, Jason
Jason Kinzer wrote:
Hi, Hi Jason,
Thanks for your thoughts.
Are there any plans on the horizon to roll compression into Monet? The X-100 project looked really interesting in this regard, but as I understand it, that work has been transferred into VectorWise.
correct, but .... For many years MonetDB already uses dictionary compression over string columns. Provided the string table is relatively small (128MB). Furthermore, most OID columns used for intermediates do not take storage at all, and where ever possible BAT heaps are shared. However, in a recent DW experiment http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-perform... we noticed that our references to string values were the cause of excessive of space consumption. (8 bytes referring to 1 byte) This has been solved in the upcoming Feb2010 release. The effect on database size and performance is shown in http://www.cwi.nl/~mk/ontimeReport Furthermore, there is a dictionary compression optimizer in Feb2010, which works for any type. It is available at the SQL level on a per table basis. However, compression not necessarily leads to performance gains, you have to decompress at some point in most plans. Furthermore, this code is alpha-stage. A driving (test) user would help to improve it. There are more options to exploit compression. Early versions of MonetDB used gzipped BATs (10yrs ago already). The current software stack would even make it possible to massage a plan to used e.g. bitvectors.
If there are no plans, is this because it's completely antithetical to the monet architecture (from the papers it seems like X-100 was, to some degree at least, 'integrated' in), or more due to lack of resources?
We can always use resources to make the code base better. Involved developer/users, but also dollarwise in the form of companies using MonetDB in their applications and who want a (continual) performance/functional quality assessment testing agreement.
My motivating example here is OLAP: I frequently have 1 relatively large fact table and then many much smaller dimensional tables. If optional compression were available, it we be nice to compress all or some of the BATs for the fact table columns and then have the others work as usual.
(Well, at least this sounds good, maybe it makes no sense). Another motivation is there seems to be a lot of anecdotal evidence for companies moving from larger big iron servers to more numerous, smaller machines - so it would be really nice to have this capability for more memory constrained settings. Indeed, I expect this year will bring some MonetDB surprises (again). Some of them are already in the distribution. To pick one, the code base contains a 'recycler' optimizer, that can for many BI applications provide a significant performance booster. A throughput improvement of >40% has been reported already. (It received an award in SIGMOD 2009 for its innovation) Such optimizers gradually are integrated in the default optimizer
So, there is a lot and more coming upon need. pipe, but available for adventurous users already.
I understand on a basic level how compression conflicts with the relatively simple approach monet uses to load BATs (e.g. memory map), but, dwelling in ignorance, I blithely assume there could be some solution not as complex as X-100 if one were to accept a significant performance cost. For example: decompressing BAT data on the fly as part of a BATiterator. I probably don't have the skills to implement even a basic on-the-fly decompression approach like this, but just wondering aloud: how hard a problem is this?
regards, Martin
Thanks, Jason
------------------------------------------------------------------------
------------------------------------------------------------------------------ Throughout its 18-year history, RSA Conference consistently attracts the world's best and brightest in the field, creating opportunities for Conference attendees to learn about information security's most important issues through interactions with peers, luminaries and emerging and established companies. http://p.sf.net/sfu/rsaconf-dev2dev
------------------------------------------------------------------------
_______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
Hi Martin, Thanks for the detailed response.
... This has been solved in the upcoming Feb2010 release. The effect on database size and performance is shown in http://www.cwi.nl/~mk/ontimeReport
That's great news. I had actually been encoding my dimension PK columns as INTs/LONGs to save on (perceived) string costs. Thanks!!
Furthermore, there is a dictionary compression optimizer in Feb2010, which works for any type. It is available at the SQL level on a per table basis.
Even better. But I wonder, outside of strings, does dictionary compression help much? (e.g. I assume for a BAT of INTs the dictionary offset would be as large the original data) Or do you mean it is not necessary a static dictionary, but e.g. works like LZ77, encoding groups of recurring data?
However, compression not necessarily leads to performance gains, you have to decompress at some point in most plans. Furthermore, this code is alpha-stage. A driving (test) user would help to improve it.
Granted. Actually I am more concerned with memory use than raw performance, MonetDB is generally plenty fast for my needs already :-) (BTW, as far as testing it out, it is available (only) in SVN trunk right now, correct?)
There are more options to exploit compression. Early versions of MonetDB used gzipped BATs (10yrs ago already).
Yes, I read some archaic-looking comments about storing BATs using compress, and then decompressing them all at once. (I guess that doesn't help main memory use, but still intersting). Does version 5 still support that?
The current software stack would even make it possible to massage a plan to used e.g. bitvectors.
My dimensions (and hence PK columns) tend to be low cardinality - so if I correctly understand what bit vectors offer, that would save a ton of space in my corresponding Fact table FK columns, no?
Are these projects easy enough for someone outside the core team to approach? (albeit very slowly ;-)
If there are no plans, is this because it's completely antithetical to the monet architecture (from the papers it seems like X-100 was, to some degree at least, 'integrated' in), or more due to lack of resources?
We can always use resources to make the code base better. Involved developer/users, but also dollarwise in the form of companies using MonetDB in their applications and who want a (continual) performance/functional quality assessment testing agreement.
Understand completely.
My motivating example here is OLAP: I frequently have 1 relatively large fact table and then many much smaller dimensional tables. If optional compression were available, it we be nice to compress all or some of the BATs for the fact table columns and then have the others work as usual.
So, there is a lot and more coming upon need.
Well, OLAP & memory-constrained environments represent my need, but perhaps
(Well, at least this sounds good, maybe it makes no sense). Another
motivation is there seems to be a lot of anecdotal evidence for companies moving from larger big iron servers to more numerous, smaller machines - so it would be really nice to have this capability for more memory constrained settings.
Indeed, I expect this year will bring some MonetDB surprises (again). Some of them are already in the distribution. To pick one, the code base contains a 'recycler' optimizer, that can for many BI applications provide a significant performance booster. A throughput improvement of >40% has been reported already. (It received an award in SIGMOD 2009 for its innovation) Such optimizers gradually are integrated in the default optimizer pipe, but available for adventurous users already.
Thanks for pointing it out, I'll read up on it. (And congratulations on
I'm an outlier. Whatever you choose to address however, thanks for all your hard work. the award added to your collection :-) Best Regards. Jason
Hi Martin,
Thanks for the detailed response.
... This has been solved in the upcoming Feb2010 release. The effect on database size and performance is shown in http://www.cwi.nl/~mk/ontimeReport
That's great news. I had actually been encoding my dimension PK columns as INTs/LONGs to save on (perceived) string costs. Thanks!!
Furthermore, there is a dictionary compression optimizer in Feb2010, which works for any type. It is available at the SQL level on a per table basis.
Even better. But I wonder, outside of strings, does dictionary compression help much? (e.g. I assume for a BAT of INTs the dictionary offset would be as large the original data) No. The dictionary optimizer would e.g. replace :bat[:oid,:lng] with a :bat[:oid,:bte], :bat[:bte,:lng] pair and adjust the query
Jason Kinzer wrote: plans accordingly. In a DW there are a limited number of dates that can compressed nicely this way.
Or do you mean it is not necessary a static dictionary, but e.g. works like LZ77, encoding groups of recurring data? Currently tested on read only tables, but no reason to stop there.
However, compression not necessarily leads to performance gains, you have to decompress at some point in most plans. Furthermore, this code is alpha-stage. A driving (test) user would help to improve it.
Granted. Actually I am more concerned with memory use than raw performance, MonetDB is generally plenty fast for my needs already :-) (BTW, as far as testing it out, it is available (only) in SVN trunk right now, correct?) It is available in different ways, see the dowload section.
There are more options to exploit compression. Early versions of MonetDB used gzipped BATs (10yrs ago already).
Yes, I read some archaic-looking comments about storing BATs using compress, and then decompressing them all at once. (I guess that doesn't help main memory use, but still intersting). Does version 5 still support that?
Not now. It would require a minor MAL transformer to play that game again. We focussed on memory mapped files, but it is certainly a valid route for tables in long sessions, and having many of them.
The current software stack would even make it possible to massage a plan to used e.g. bitvectors.
My dimensions (and hence PK columns) tend to be low cardinality - so if I correctly understand what bit vectors offer, that would save a ton of space in my corresponding Fact table FK columns, no?
Are these projects easy enough for someone outside the core team to approach? (albeit very slowly ;-)
I would not press for bitvectors, but the MAL transformer to decompress BATs before being used, is very well possible with C experience.
If there are no plans, is this because it's completely antithetical to the monet architecture (from the papers it seems like X-100 was, to some degree at least, 'integrated' in), or more due to lack of resources?
We can always use resources to make the code base better. Involved developer/users, but also dollarwise in the form of companies using MonetDB in their applications and who want a (continual) performance/functional quality assessment testing agreement.
Understand completely.
My motivating example here is OLAP: I frequently have 1 relatively large fact table and then many much smaller dimensional tables. If optional compression were available, it we be nice to compress all or some of the BATs for the fact table columns and then have the others work as usual.
So, there is a lot and more coming upon need.
Well, OLAP & memory-constrained environments represent my need, but perhaps I'm an outlier. Whatever you choose to address however, thanks for all your hard work.
(Well, at least this sounds good, maybe it makes no sense). Another motivation is there seems to be a lot of anecdotal evidence for companies moving from larger big iron servers to more numerous, smaller machines - so it would be really nice to have this capability for more memory constrained settings.
Indeed, I expect this year will bring some MonetDB surprises (again). Some of them are already in the distribution. To pick one, the code base contains a 'recycler' optimizer, that can for many BI applications provide a significant performance booster. A throughput improvement of >40% has been reported already. (It received an award in SIGMOD 2009 for its innovation) Such optimizers gradually are integrated in the default optimizer pipe, but available for adventurous users already.
Thanks for pointing it out, I'll read up on it. (And congratulations on the award added to your collection :-)
Best Regards.
Jason
participants (2)
-
Jason Kinzer
-
Martin Kersten