Hi Roberto, thanks for reporting! This need detailed profiling to understand what's happening. Why are 16M look-ups to a 1-tuple hash table 500x more expensive than building a 16M hash table (assuming that is it indeed built on the fly, i.e., does not already exist), and then performing a single look-up ... ? Stefan ----- On May 11, 2015, at 6:30 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:
Hi there,
Triggered by a suspiciously slow join, I had a look at BATsubjoin, in particular at how it is decided which algorithm to use.
I looked at a couple of cases with a join on strings. This is one (pseudo syntax):
l.count ~16M l.T.sorted = 0 l.T.revsorted = 0
r.count = 1 r.T.sorted = 1 r.T.revsorted = 1
In this case, a hash table on r is created (because it's smaller). This join takes 430ms . I forced swapping l and r, thus built the hash table on the larger bat, and then it takes 0.8ms .
Notice that though r.count = 1 is a bit of an extreme case, it does happen often in practice. And the case with a few tuples would not be very different.
The question I wanted to ask is:
is it always right to build the hash table on the small table? Perhaps some more heuristics can help? Like: if the smaller is very small, and the larger is not extremely large, then hash on the larger one?
Roberto
_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |