Re: MonetDB: Oct2014 - Reimplemented rangejoin, now using imprints o...
Hi Sjoerd,
This is very welcome!
If I read well, this will use, in order of preference:
1. binary search (if l is sorted)
2. imprints if available
3. nested loop otherwise
We use rangejoin extensively within Spinque and the previous one (just
nested loop) has never an option.
So far we have been using our own version which is perhaps naive but proved
to be effective:
- sort all inputs if not sorted already
- perform a mergejon-like fast scan
Though simple, I have not found so far a case where this strategy would not
outperform nested loops by far. The cost of sorting was always far less
than the cost of a full nested loop.
I was wondering what your thoughts about this would be. Could this strategy
replace the number 3 above? Does it make sense to keep the vanilla nested
loop?
Best, Roberto
On 27 January 2015 at 14:09, Sjoerd Mullender
Changeset: 5147add3bb38 for MonetDB URL: http://dev.monetdb.org/hg/MonetDB?cmd=changeset;node=5147add3bb38 Modified Files: gdk/ChangeLog.Oct2014 gdk/gdk_join.c gdk/gdk_private.h gdk/gdk_select.c Branch: Oct2014 Log Message:
Reimplemented rangejoin, now using imprints or binary search if possible.
On 28/01/15 14:49, Roberto Cornacchia wrote:
Hi Sjoerd,
This is very welcome!
If I read well, this will use, in order of preference: 1. binary search (if l is sorted) 2. imprints if available
No, if *possible*. I.e. we make then if needed.
3. nested loop otherwise
Correct.
We use rangejoin extensively within Spinque and the previous one (just nested loop) has never an option. So far we have been using our own version which is perhaps naive but proved to be effective: - sort all inputs if not sorted already - perform a mergejon-like fast scan
Though simple, I have not found so far a case where this strategy would not outperform nested loops by far. The cost of sorting was always far less than the cost of a full nested loop.
I was wondering what your thoughts about this would be. Could this strategy replace the number 3 above? Does it make sense to keep the vanilla nested loop? Best, Roberto
Our next step will likely be that we do an on-the-fly sort of the left input and then use the binary search version. A back-of-the-envelope (*) calculation tells us that this will be faster than doing either an imprints or nested-loop version quite soon. The sort is O(n log n) and the binary search that follows is O(m log n) where n is the size of the left input and m the size of the right input. Both imprints and nested loop are O(n m) (with different constant factors). So the condition has to do with comparing m and log n. But until this is implemented we have no idea about the constant factor that is involved here. But it's likely that we'll initially just always sort the left side unless the right is very short (max 2 or 3 entries?). (*) on a whiteboard ;-)
On 27 January 2015 at 14:09, Sjoerd Mullender
wrote: Changeset: 5147add3bb38 for MonetDB URL: http://dev.monetdb.org/hg/MonetDB?cmd=changeset;node=5147add3bb38 Modified Files: gdk/ChangeLog.Oct2014 gdk/gdk_join.c gdk/gdk_private.h gdk/gdk_select.c Branch: Oct2014 Log Message:
Reimplemented rangejoin, now using imprints or binary search if possible.
_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
-- Sjoerd Mullender
On 28 January 2015 at 15:12, Sjoerd Mullender
Our next step will likely be that we do an on-the-fly sort of the left input and then use the binary search version. A back-of-the-envelope (*) calculation tells us that this will be faster than doing either an imprints or nested-loop version quite soon. The sort is O(n log n) and the binary search that follows is O(m log n) where n is the size of the left input and m the size of the right input.
That makes sense. The approach I currently use is n logn + m logm + m + n So a bit less efficient than the one you mention. Both imprints and nested
loop are O(n m) (with different constant factors). So the condition has to do with comparing m and log n. But until this is implemented we have no idea about the constant factor that is involved here. But it's likely that we'll initially just always sort the left side unless the right is very short (max 2 or 3 entries?).
Yep. And when their size are comparable, it won't matter much which one you sort. Thanks for the preview, looking forward to the next developments of the rangejoin. Roberto
On 27 January 2015 at 14:09, Sjoerd Mullender
wrote: Changeset: 5147add3bb38 for MonetDB URL: http://dev.monetdb.org/hg/MonetDB?cmd=changeset;node=5147add3bb38 Modified Files: gdk/ChangeLog.Oct2014 gdk/gdk_join.c gdk/gdk_private.h gdk/gdk_select.c Branch: Oct2014 Log Message:
Reimplemented rangejoin, now using imprints or binary search if possible.
_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
-- Sjoerd Mullender
_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
participants (2)
-
Roberto Cornacchia
-
Sjoerd Mullender