This is a question about join order. In general, about how it changed from
Nov2019 to Jun2020 releases.
In particular, with respect to custom joins (filter functions).
With a schema:
CREATE TABLE t1(s string);
CREATE TABLE t2(s string);
Consider the following 2 queries, which only differ for having swapped
conditions:
Q1:
SELECT t1.s, t2.s
FROM t1, t2
WHERE t1.s <> t2.s
AND [t1.s] maxlevenshtein [t2.s, 1];
Q2:
SELECT t1.s, t2.s
FROM t1, t2
WHERE [t1.s] maxlevenshtein [t2.s, 1]
AND t1.s <> t2.s;
[t1.s] maxlevenshtein [t2.s, 1] is equivalent to levenshtein(t1.s, t2.s) <=1
(i.e. the two strings have a levenshtein distance at most 1)
This is a relatively expensive and selective function.
In Nov2019, both Q1 and Q2 are translated to:
- "maxlevenshtein" custom join
- "!=" selection on the result
In Jun2020, the two queries happen to be evaluated in the same order as
they are written. Which means that Q2 is evaluated as:
- "!=" join
- "maxlevenshtein" selection
This last evaluation plan is unfortunately not viable at all. The first
join is not very selective, and the "maxlevenshtein" selection is run on
way too many pairs, without the optimizations that can be exploited in a
join (in a join, it is possible to skip the actual levenshtein computation
for most combinations). Q2 in Jun2020 is 2 orders of magnitude slower than
Q1, which quickly leads to unreasonably long query times.
Of course, this is just one specific case. A very unfortunate one, due to
the combination of a couple of factors:
- The "!=" join is not selective enough
- The custom function is an expensive one
I guess my real questions are:
- Is it by chance (or, as a by-product of more generic join ordering rules)
that Nov2019 executes custom joins first in both Q1 and Q2, or was it an
intentional choice to first execute custom joins?
- What would a reasonable approach be?
Is it reasonable to assume that if one writes a custom join, this is
expected to use an expensive comparison function and that the join
implementation can be much more efficient than the selection implementation
(by skipping unnecessary comparisons)?
If no assumptions can be made, can there be a way to annotate custom
implementations with information on selectivity and cost?
Thanks for you input,
Roberto