Re: Hashjoin performance with large vs small tables

11 May 2015

      ----- On May 11, 2015, at 6:36 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:
...
Correction:
This join takes 430ms .
I forced swapping l and r, thus built the hash table on the larger bat, and then
it takes 0.8ms .
It takes 0.8ms the second time.
The first time, it needs to create the hash table, and then it takes about 30ms.
Still, much better than 430ms.
Ok, but indeed still the question where does this difference come from?
...
Also, those 430ms are not invested. The second time will still take 430ms. So
hashing on a very small bat is never a good investment. On the contrary,
hashing on a larger (but not too much) table is a good investment. The next
time a similar query comes in, it will be sub-millisecond.
Well, this is a trade-off that in in general hard to judge.
If the bigger table / BAT is a base table/BAT, the hash table will (nowadays)
be made persistent and *could* be reused --- whether it indeed will be reused,
we cannot predict. If the bigger table is a transient intermediate result,
re-use is unlikely ...

Having said that, is your smaller table a base table or an intermediate result
that is (might be) a tiny slice of a large (huge) base table?
Then current code might build the hash on the entire parent BAT rather than on
the tiny slice ...

Also: Which version of MonetDB are we talking about?

Stefan

-- 
| Stefan.Manegold@CWI.nl | DB Architectures   (DA) |
| www.CWI.nl/~manegold/  | Science Park 123 (L321) |
| +31 (0)20 592-4212     | 1098 XG Amsterdam  (NL) |

Re: Hashjoin performance with large vs small tables

Stefan Manegold