Re: [Monetdb-developers] strings

20 Oct 2006

      Martin, (et al)

Thanks for your comments (Martin's repeated below).

For the moment, I settled for a function that ensures no reading beyond the
zero byte. The previous proposal did that. It still process two bytes per
iteration, though:

@- Hash Function
The string hash function is a very simple hash function that xors and
rotates all characters together. It is optimized to process 2 characters
at a time (adding 16-bits to the hash value each iteration).
@h
#define GDK_STRHASH(x,y) {                                              \
         str _c = (str) (x);                                             \
         for((y)=0; _c[0] && _c[1]; _s+=2) {                             \
                  (y) = ((y) << 5) ^ ((y) >> 7) ^ ((_c[0] << 8) | _c[1]);\
         }                                                               \
         (y) ^= _c[0];                                                   \
}

Shared BAT heaps using views could be done. We could modify the view
mechanism to allow a VIEW to have its own BUN heap, but still share a heap
from a parent. This would allow still only a single parent, though. Touching
the view code, however, is quite delicate matter.

The suggestion for elimination of all doubles *always* I find not desirable.
This policy is what caused crashes I had to go fix in the data center of ABN
AMRO. Think of unique string keys -- even if we could dynamically tune the
hash table size (we don't) as the BAT grows to avoid long collision lists,
we would waste a lot of memory and gain nothing.

The current adaptive approach is extremely robust against any distribution
thrown at it:
- on small bats, the overhead of the (mostly) useless hash table is limited
(4KB)
- on bats with a 'small' (L2 cachable) heap, the current "working"
(=double-eliminating) hash table has just about the right size (256KB) such
that things work nice when they can. Out-of-cache hash tables are never
nice.
- on bats with too many unique values we still profit from local
correlations thanks to the repeated flushing. Such distributions would if
fully double eliminated, need a (robust!) dynamic allocation scheme, plus a
*lot* of memory to keep all the unique values. Not worth it.

Thus we have a data structure that requires *no* a-priori distribution
knowledge, introduces a limited maximal amount of storage overhead, and
reduces storage maximally (given the low resource budget), in the best case
achieving full double elimination that can be exploited with ELIMDOUBLES(h).

I agree it could be cleaner to remove the hash table fully, and rely on
explicit application-driven double elimination. But, as described, this
mechanism never hurts us (much), and I guess that for many many use cases,
of which we are not even aware now, it actually is delivering us a lot of
gain, just because it is pervasively applied.

I do not fully grasp your suggestion for hash functions that are
distribution-dependent; especially how they would work incrementally, when
values are added to a BAT over time. In that case, the distribution would
change continually, and thus the hash function..

Peter

-----Original Message-----
From: Martin Kersten [mailto:Martin.Kersten@cwi.nl]
Sent: Friday, October 20, 2006 5:15 PM
To: Peter Boncz
Cc: Niels Nes; Martin Kersten; Stefan Manegold; Sjoerd Mullender
Subject: strings

Peter,

Niels and I were discussing the snippet shown in the attachment
two days ago. The snippet is interesting for several reasons.
The most important one is that the cost of this string join
comes from inserting 800K strings, *not* the str hash join itself.
(observed with callgrind)

Amongst others, it triggered (again) the wish to have more
control at the BAT level to deal with strings values.
For example, it would be nice to know that *all* doubles
are eliminated (not only per chunk). In combination with
a fast 'uniqueCount' it could make some operations a lot faster.

Alternatively, it would be nice if you could refer into the heap
of another BAT, e.g. by marking a column as 'shared heap' and
re-using the BATview scheme in the target to protect it.

(In the 19.sql case, we actually should have dropped the string
in the semijoin operation. Calling for either a new gdk operation
or a better optimization step.)

Wat betreft je voorstel. Ik zou zoeken in de volgende richting.
De hash wordt in belangrijke mate bepaald door de statistische
kans dat je hetzelfde patroon tegenkomt. Tevens weten we dat
in de praktijk het gebruik van characters (en character combinaties)
zipfian gedistribueerd is. Dit kan je gebruiken door in de
header of de string heap een encoding tabel op te nemen die
ervoor zorgt dat nieuwe encoding meer/minder effect heeft op
de hash functie.

Voorbeeld laat alle veel voorkomende characters weg en neem
altijd N minder voorkomende characters. Of gebruik alternatieve
bit patronen die de hash

Deze codering zou je zelfs als apparte optimalizatie stap
op de BAT kunnen loslaten. De default is gebaseerd op
een eenvoudige tekst analyse.

Verder kan de hash table (nu 1K best wat groter)

Martin Kersten

tags

participants (1)