Roberto Cornacchia
writes: [..] Ps. It's actually not very difficult to write it yourself if you feel like it. There is a template in the code.
Interesting. I think we will go for this.
We're working on it. Fortunately, it seems rather easy to plug a custom optimizer, which is rather nice. To give a recap, we would like to optimize away the following case: big := <big BAT> a := bat.new(..) a := bat.append(a, big) .. from this point use `a', and no longer use `big' .. However, it is not as trivial as we thought, because to eliminate useless (in appearence) bat.new/bat.append pair, we need to know that the BAT that is processed and for which we want to avoid a costly copy is not used elsewhere. So we decided to whitelist some MAL functions which are known as producing a new BAT rather than returning one of their arguments. This is not nice obviously, and we might make mistake when deciding if such functions are really functional or not (ie mutating/returning one of the input BAT.) We need to review the code of each function to ensure that. A better solution, but much more complicated for us, and that would require major changes would be to postpone the operation until really needed by using a copy-on-write mechanism. That is, in the example above, when the bat.append function is called, it see that it try to append to an empty BAT, in which case it would return a BAT *proxy* that could act as a regular BAT. Then whenever an attempt would be made to change either of these BAT (`a` or `big`), a copy would be triggered. This way the copy would occur only when needed. Simpler to say than to implement I know. (Especially when not knowing all the intricacies of the internal processing of BAT.) Likewise, we see another optimization opportunity with a query like this: SELECT our_computation(some_col) FROM some_table LIMIT 10; which produce the following MAL code: t := ... u := udf.complex_computation(t) # process 10M rows v := algebra.subslice(u, 0, 9) w := algebra.leftfetchjoin(v, u) # get only the ten first rows! This code is not optimal because we could have processed 10 rows instead of millions if our custom operator was run after the slice. But to figure that, we would need a way to put attribute on functions to tell if the order of the rows or the number of rows matter for the operation. Knowing that the code could be rewritten as: t := ... v := algebra.subslice(t, 0, 9) w := algebra.leftfetchjoin(v, t) # get the ten first rows u := udf.complex_computation(w) # process 10 rows only As a workaround we tried to write the query as: SELECT our_computation(some_col) FROM (SELECT some_col FROM some_table LIMIT 10) as q; But it seems that the LIMIT is not supported in subselect. We could also write an optimizer for that (and we will do I think), again by whitelisting functions. But overall, what is missing is a way to flag function to let the optimizer know more about them. -- Frédéric Jolliton SecurActive