Possible bug in pcre.c
Hi, I was looking at this piece of code (end of pcrejoin loop), and I have the feeling something isn't right. Inside the body of if (nl == 0) a condition over lastl is checked. However, it seems to me that lastl gets a useful value only when nl > 0. If (nl == 0) then lastl must be 0. <loop> { /* actual body */ .... /* track result properties */ if (BATcount(r1) > 0) { if (lastl + 1 != lo) r1->tseqbase = oid_nil; if (nl == 0) { r2->trevsorted = 0; if (lastl > lo) { r1->tsorted = 0; r1->tkey = 0; } else if (lastl < lo) { r1->trevsorted = 0; } else { r1->tkey = 0; } } } APPEND(r1, lo); APPEND(r2, ro); lastl = lo; nl++; }
It's complicated stuff, but I think the code is right. The implementation uses two nested for loops. The outer loop iterates over the right input (since the right input is column of regular expressions, the expressions only get compiled once and matched multiple times). The inner loop iterates over the left input. (This is different from most other implementation of join that iterate over the left input first.) Because of this, we know the right output is sorted. lastl is de last inserted vale in the left output. It is used to maintain the dense and sorted flags of the left output (tseqbase, tsorted, and trevsorted). BATcount(r1) > 0, so this means we've already seen a match, which in turn means that we've been past the assignment "lastl = lo;" just past the if statement. Note that there is no other assignment to lastl (just the initialization to 0 at the top of the function). nl == 0 means that for the current value of the right input (outermost for loop) we haven't yet inserted any matches, but since we're executing where we are, there are matches for that current value and one or more values in the left input. Note that nl is set to 0 inside the outer for loop, and gets incremented inside the inner for loop for each match that was inserted. So, we are now looking at a new value in the right column that has matched a value in the left column. There are three possibilities: the left input is smaller than, equal to, or larger than the last value for the left input that was inserted into the left output. If smaller, the left output is not sorted (smaller value follows larger). Also, we don't know about uniqueness anymore (we only maintain that if the left output is strictly ascending). If equal, the left output is definitely not unique. If larger, the left output is not reverse sorted (larger value follows smaller). Since we're inserting an output for a new iteration of the right input, we know that the right output is not reverse sorted. If nl were greater than 0, we're inserting a subsequent match of the right input, so the right input could still be reverse sorted, though not unique anymore. I hope this clears things up. On 08/04/2019 15.27, Roberto Cornacchia wrote:
Hi,
I was looking at this piece of code (end of pcrejoin loop), and I have the feeling something isn't right. Inside the body of if (nl == 0) a condition over lastlis checked.
However, it seems to me that lastl gets a useful value only when nl > 0. If (nl == 0) then lastl must be 0.
<loop> { /* actual body */ ....
/* track result properties */ if (BATcount(r1) > 0) { if (lastl + 1 != lo) r1->tseqbase = oid_nil; if (nl == 0) { r2->trevsorted = 0; if (lastl > lo) { r1->tsorted = 0; r1->tkey = 0; } else if (lastl < lo) { r1->trevsorted = 0; } else { r1->tkey = 0; } } } APPEND(r1, lo); APPEND(r2, ro); lastl = lo; nl++; }
_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
-- Sjoerd Mullender
My bad, I rushed into my conclusions.
On Mon, 8 Apr 2019 at 16:47, Sjoerd Mullender
It's complicated stuff, but I think the code is right.
The implementation uses two nested for loops. The outer loop iterates over the right input (since the right input is column of regular expressions, the expressions only get compiled once and matched multiple times). The inner loop iterates over the left input. (This is different from most other implementation of join that iterate over the left input first.) Because of this, we know the right output is sorted.
lastl is de last inserted vale in the left output. It is used to maintain the dense and sorted flags of the left output (tseqbase, tsorted, and trevsorted).
BATcount(r1) > 0, so this means we've already seen a match, which in turn means that we've been past the assignment "lastl = lo;" just past the if statement. Note that there is no other assignment to lastl (just the initialization to 0 at the top of the function).
nl == 0 means that for the current value of the right input (outermost for loop) we haven't yet inserted any matches, but since we're executing where we are, there are matches for that current value and one or more values in the left input. Note that nl is set to 0 inside the outer for loop, and gets incremented inside the inner for loop for each match that was inserted.
So, we are now looking at a new value in the right column that has matched a value in the left column. There are three possibilities: the left input is smaller than, equal to, or larger than the last value for the left input that was inserted into the left output.
If smaller, the left output is not sorted (smaller value follows larger). Also, we don't know about uniqueness anymore (we only maintain that if the left output is strictly ascending). If equal, the left output is definitely not unique. If larger, the left output is not reverse sorted (larger value follows smaller).
Since we're inserting an output for a new iteration of the right input, we know that the right output is not reverse sorted. If nl were greater than 0, we're inserting a subsequent match of the right input, so the right input could still be reverse sorted, though not unique anymore.
I hope this clears things up.
On 08/04/2019 15.27, Roberto Cornacchia wrote:
Hi,
I was looking at this piece of code (end of pcrejoin loop), and I have the feeling something isn't right. Inside the body of if (nl == 0) a condition over lastlis checked.
However, it seems to me that lastl gets a useful value only when nl > 0. If (nl == 0) then lastl must be 0.
<loop> { /* actual body */ ....
/* track result properties */ if (BATcount(r1) > 0) { if (lastl + 1 != lo) r1->tseqbase = oid_nil; if (nl == 0) { r2->trevsorted = 0; if (lastl > lo) { r1->tsorted = 0; r1->tkey = 0; } else if (lastl < lo) { r1->trevsorted = 0; } else { r1->tkey = 0; } } } APPEND(r1, lo); APPEND(r2, ro); lastl = lo; nl++; }
_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
-- Sjoerd Mullender
_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
participants (2)
-
Roberto Cornacchia
-
Sjoerd Mullender