Thanks Lefteris,
I am aware that what I encounter with my data / queries can be different from typical tpch queries.
Perhaps simplistic, but what seems to work rock-solid for us, at the moment, is this additional rule before the last one:
} else if (rcount < 128 && lpcount > rpcount) {
/* Spinque-specific: prefer to hash on the larger bat when the other one is very small */
swap = 1;
reason = "left is very small";
} else if (lpcount < rpcount) {
/* no hashes, not sorted, create hash on smallest BAT */
swap = 1;
reason = "left is smaller";
}
The constant 128 was admittedly chosen quite arbitrarily, it could be less. The joins we encounter most frequently with this pattern have sizes like 100M-500M against 1-5 tuples. All these joins perform consistently better (~ 5x) when this rule is in place.
That is also why being able to reuse the investment made in hashing the larger bat would be so important (the other question).
NB: I am not suggesting that our heuristics should become standard in MonetDB - only that perhaps it is not such an uncommon pattern and it could be worth some more thought.
Roberto