Related to my previous question about persisting hashes, I would like to throw another one.
BATsubjoin has a series of heuristics to decide what type of join implementation to use. When using hash-join, the latest rule says: if nothing else applied, build a hash on the smaller bat.
Could you tell me what is the rationale for this?
From what I could verify:
- when sizes are comparable: it doesn't really make much difference which side is hashed
- when sizes differ much: sure, building the hash table on that is much cheaper, but the join as a whole becomes 4-5 times slower then when hashing on the larger bat.
In which case hashing on the larger bat is a good option?
Cheers,
Roberto