Re: [Monetdb-developers] strings

Martin, (et al) Thanks for your comments (Martin's repeated below). For the moment, I settled for a function that ensures no reading beyond the zero byte. The previous proposal did that. It still process two bytes per iteration, though: @- Hash Function The string hash function is a very simple hash function that xors and rotates all characters together. It is optimized to process 2 characters at a time (adding 16-bits to the hash value each iteration). @h #define GDK_STRHASH(x,y) { \ str _c = (str) (x); \ for((y)=0; _c[0] && _c[1]; _s+=2) { \ (y) = ((y) << 5) ^ ((y) >> 7) ^ ((_c[0] << 8) | _c[1]);\ } \ (y) ^= _c[0]; \ } Shared BAT heaps using views could be done. We could modify the view mechanism to allow a VIEW to have its own BUN heap, but still share a heap from a parent. This would allow still only a single parent, though. Touching the view code, however, is quite delicate matter. The suggestion for elimination of all doubles *always* I find not desirable. This policy is what caused crashes I had to go fix in the data center of ABN AMRO. Think of unique string keys -- even if we could dynamically tune the hash table size (we don't) as the BAT grows to avoid long collision lists, we would waste a lot of memory and gain nothing. The current adaptive approach is extremely robust against any distribution thrown at it: - on small bats, the overhead of the (mostly) useless hash table is limited (4KB) - on bats with a 'small' (L2 cachable) heap, the current "working" (=double-eliminating) hash table has just about the right size (256KB) such that things work nice when they can. Out-of-cache hash tables are never nice. - on bats with too many unique values we still profit from local correlations thanks to the repeated flushing. Such distributions would if fully double eliminated, need a (robust!) dynamic allocation scheme, plus a *lot* of memory to keep all the unique values. Not worth it. Thus we have a data structure that requires *no* a-priori distribution knowledge, introduces a limited maximal amount of storage overhead, and reduces storage maximally (given the low resource budget), in the best case achieving full double elimination that can be exploited with ELIMDOUBLES(h). I agree it could be cleaner to remove the hash table fully, and rely on explicit application-driven double elimination. But, as described, this mechanism never hurts us (much), and I guess that for many many use cases, of which we are not even aware now, it actually is delivering us a lot of gain, just because it is pervasively applied. I do not fully grasp your suggestion for hash functions that are distribution-dependent; especially how they would work incrementally, when values are added to a BAT over time. In that case, the distribution would change continually, and thus the hash function.. Peter -----Original Message----- From: Martin Kersten [mailto:Martin.Kersten@cwi.nl] Sent: Friday, October 20, 2006 5:15 PM To: Peter Boncz Cc: Niels Nes; Martin Kersten; Stefan Manegold; Sjoerd Mullender Subject: strings Peter, Niels and I were discussing the snippet shown in the attachment two days ago. The snippet is interesting for several reasons. The most important one is that the cost of this string join comes from inserting 800K strings, *not* the str hash join itself. (observed with callgrind) Amongst others, it triggered (again) the wish to have more control at the BAT level to deal with strings values. For example, it would be nice to know that *all* doubles are eliminated (not only per chunk). In combination with a fast 'uniqueCount' it could make some operations a lot faster. Alternatively, it would be nice if you could refer into the heap of another BAT, e.g. by marking a column as 'shared heap' and re-using the BATview scheme in the target to protect it. (In the 19.sql case, we actually should have dropped the string in the semijoin operation. Calling for either a new gdk operation or a better optimization step.) Wat betreft je voorstel. Ik zou zoeken in de volgende richting. De hash wordt in belangrijke mate bepaald door de statistische kans dat je hetzelfde patroon tegenkomt. Tevens weten we dat in de praktijk het gebruik van characters (en character combinaties) zipfian gedistribueerd is. Dit kan je gebruiken door in de header of de string heap een encoding tabel op te nemen die ervoor zorgt dat nieuwe encoding meer/minder effect heeft op de hash functie. Voorbeeld laat alle veel voorkomende characters weg en neem altijd N minder voorkomende characters. Of gebruik alternatieve bit patronen die de hash Deze codering zou je zelfs als apparte optimalizatie stap op de BAT kunnen loslaten. De default is gebaseerd op een eenvoudige tekst analyse. Verder kan de hash table (nu 1K best wat groter)
participants (1)
-
Martin Kersten