Mark, I think there is a bug in the implementation of the PYTHON_MAP aggregate functions. The return value of a PYTHON_MAP aggregate function can change if there are other functions in the query. The return value of my function weighted_percentile_0 (definition below) changes if I include the function median() in my query. The results are incorrect if median() isn't in the query, but are correct if median() is in the query. In this test case, the weights (v2) are all 1 and the values (v1) are uniformly sampled integers from [1,10]. Number of rows for each fctr is ~1000, and my function should reduce to min(v1) in this case. Is some information bleeding between threads? Should I submit a bug report? This is on the default branch. Thanks, Dave sql>select fctr,weighted_percentile_0(v1,v2),min(v1) from mini where fctr < 10 group by fctr order by fctr; +------+--------------------------+------+ | fctr | L1 | L2 | +======+==========================+======+ | 1 | 3 | 1 | | 2 | 2 | 1 | | 3 | 2 | 1 | | 4 | 1 | 1 | | 5 | 3 | 1 | | 6 | 3 | 1 | | 7 | 2 | 1 | | 8 | 1 | 1 | | 9 | 2 | 1 | +------+--------------------------+------+ 9 tuples (126.928ms) sql>select fctr,weighted_percentile_0(v1,v2),min(v1),median(v1) from mini where fctr < 10 group by fctr order by fctr; +------+--------------------------+------+------+ | fctr | L1 | L2 | L3 | +======+==========================+======+======+ | 1 | 1 | 1 | 5 | | 2 | 1 | 1 | 5 | | 3 | 1 | 1 | 5 | | 4 | 1 | 1 | 5 | | 5 | 1 | 1 | 5 | | 6 | 1 | 1 | 5 | | 7 | 1 | 1 | 5 | | 8 | 1 | 1 | 5 | | 9 | 1 | 1 | 5 | +------+--------------------------+------+------+ 9 tuples (519.195ms) sql> CREATE AGGREGATE weighted_percentile_0(a DOUBLE, w DOUBLE) RETURNS DOUBLE LANGUAGE PYTHON_MAP { import numpy as np # Standardize and sort based on values in a q = np.array([0]) / 100.0 idx = np.argsort(a) a_sort = a[idx] w_sort = w[idx] # Get the cumulative sum of weights ecdf = np.cumsum(w_sort) # Find the percentile index positions associated with the percentiles p = q * (w_sort.sum() - 1) # Find the bounding indices (both low and high) idx_low = np.searchsorted(ecdf, p, side='right') idx_high = np.searchsorted(ecdf, p + 1, side='right') idx_high[idx_high > ecdf.size - 1] = ecdf.size - 1 # Calculate the weights weights_high = p - np.floor(p) weights_low = 1.0 - weights_high # Extract the low/high indexes and multiply by the corresponding weights x1 = np.take(a_sort, idx_low) * weights_low x2 = np.take(a_sort, idx_high) * weights_high wp = np.add(x1,x2) return(wp[0]) };
The data: CREATE FUNCTION random_integers(low INTEGER, high INTEGER, amount INTEGER) RETURNS TABLE(fctr INTEGER, v1 INTEGER, v2 INTEGER, v3 INTEGER) LANGUAGE PYTHON { result = dict() result['fctr'] = numpy.random.randint(1, 100, size=(amount,)) result['v1'] = numpy.random.randint(low, high, size=(amount,)) result['v2'] = numpy.random.randint(low, high, size=(amount,)) result['v3'] = numpy.random.randint(low, high, size=(amount,)) return result }; create table mini(fctr integer, v1 integer, v2 integer, v3 integer); insert into mini select * from random_integers(1,10,100000); update mini set v2 = 1; -----Original Message----- From: Anderson, David B [ICG-MKTS] Sent: Wednesday, July 06, 2016 3:49 PM To: users-list Subject: Python Aggregate functions Mark, I think there is a bug in the implementation of the PYTHON_MAP aggregate functions. The return value of a PYTHON_MAP aggregate function can change if there are other functions in the query. The return value of my function weighted_percentile_0 (definition below) changes if I include the function median() in my query. The results are incorrect if median() isn't in the query, but are correct if median() is in the query. In this test case, the weights (v2) are all 1 and the values (v1) are uniformly sampled integers from [1,10]. Number of rows for each fctr is ~1000, and my function should reduce to min(v1) in this case. Is some information bleeding between threads? Should I submit a bug report? This is on the default branch. Thanks, Dave sql>select fctr,weighted_percentile_0(v1,v2),min(v1) from mini where sql>fctr < 10 group by fctr order by fctr; +------+--------------------------+------+ | fctr | L1 | L2 | +======+==========================+======+ | 1 | 3 | 1 | | 2 | 2 | 1 | | 3 | 2 | 1 | | 4 | 1 | 1 | | 5 | 3 | 1 | | 6 | 3 | 1 | | 7 | 2 | 1 | | 8 | 1 | 1 | | 9 | 2 | 1 | +------+--------------------------+------+ 9 tuples (126.928ms) sql>select fctr,weighted_percentile_0(v1,v2),min(v1),median(v1) from sql>mini where fctr < 10 group by fctr order by fctr; +------+--------------------------+------+------+ | fctr | L1 | L2 | L3 | +======+==========================+======+======+ | 1 | 1 | 1 | 5 | | 2 | 1 | 1 | 5 | | 3 | 1 | 1 | 5 | | 4 | 1 | 1 | 5 | | 5 | 1 | 1 | 5 | | 6 | 1 | 1 | 5 | | 7 | 1 | 1 | 5 | | 8 | 1 | 1 | 5 | | 9 | 1 | 1 | 5 | +------+--------------------------+------+------+ 9 tuples (519.195ms) sql> CREATE AGGREGATE weighted_percentile_0(a DOUBLE, w DOUBLE) RETURNS DOUBLE LANGUAGE PYTHON_MAP { import numpy as np # Standardize and sort based on values in a q = np.array([0]) / 100.0 idx = np.argsort(a) a_sort = a[idx] w_sort = w[idx] # Get the cumulative sum of weights ecdf = np.cumsum(w_sort) # Find the percentile index positions associated with the percentiles p = q * (w_sort.sum() - 1) # Find the bounding indices (both low and high) idx_low = np.searchsorted(ecdf, p, side='right') idx_high = np.searchsorted(ecdf, p + 1, side='right') idx_high[idx_high > ecdf.size - 1] = ecdf.size - 1 # Calculate the weights weights_high = p - np.floor(p) weights_low = 1.0 - weights_high # Extract the low/high indexes and multiply by the corresponding weights x1 = np.take(a_sort, idx_low) * weights_low x2 = np.take(a_sort, idx_high) * weights_high wp = np.add(x1,x2) return(wp[0]) };
Hey Anderson,
Indeed, it looks like a bug. It is not related to information bleeding between threads though; it's related to mitosis. The reason adding the median call fixes the issue is because the median call disables mitosis. I will look into it some more later.
Regards,
Mark
----- Original Message -----
From: "Anderson, David B"
I just tried the queries, and the problem (with mitosis) seems to have been resolved in the meantime. With both Dec2016 and default, I have: sql>select fctr,weighted_percentile_0(v1,v2),min(v1) from mini where fctr < 10 group by fctr order by fctr; +------+--------------------------+------+ | fctr | L3 | L6 | +======+==========================+======+ | 1 | 1 | 1 | | 2 | 1 | 1 | | 3 | 1 | 1 | | 4 | 1 | 1 | | 5 | 1 | 1 | | 6 | 1 | 1 | | 7 | 1 | 1 | | 8 | 1 | 1 | | 9 | 1 | 1 | +------+--------------------------+------+ sql>select fctr,weighted_percentile_0(v1,v2),min(v1),median(v1) from mini where fctr < 10 group by fctr order by fctr; +------+--------------------------+------+------+ | fctr | L3 | L6 | L11 | +======+==========================+======+======+ | 1 | 1 | 1 | 5 | | 2 | 1 | 1 | 5 | | 3 | 1 | 1 | 5 | | 4 | 1 | 1 | 5 | | 5 | 1 | 1 | 5 | | 6 | 1 | 1 | 5 | | 7 | 1 | 1 | 5 | | 8 | 1 | 1 | 5 | | 9 | 1 | 1 | 5 | +------+--------------------------+------+------+ When I forget to start mserver5 with ‘--set embedded_py=true’, the error message given by Dec2016 could be improved though, because it says: sql>TypeException:user.s8_1[9]:'pyapi.eval' undefined in: (X_10:bat[:int],X_13:bat[:int],X_14:bat[:int],X_15:bat[:int]) := pyapi.eval(0x7fc01b928630:ptr,"{\n result = dict()\n result['fctr'] = numpy.random.randint(1, 100, size=(amount,))\n result['v1'] = numpy.random.randint(low, high, size=(amount,))\n result['v2'] = numpy.random.randint(low, high, size=(amount,))\n result['v3'] = numpy.random.randint(low, high, size=(amount,))\n return result\n};":str,X_5:int,X_7:int,X_9:int); while the error message given by ‘default’ is much more informative: "Embedded Python has not been enabled. Start server with --set embedded_py=true" Regards, Jennie
On 6 Jul 2016, at 23:17, Mark Raasveldt
wrote: Hey Anderson,
Indeed, it looks like a bug. It is not related to information bleeding between threads though; it's related to mitosis. The reason adding the median call fixes the issue is because the median call disables mitosis. I will look into it some more later.
Regards,
Mark
----- Original Message ----- From: "Anderson, David B"
To: "users-list" Sent: Wednesday, July 6, 2016 10:05:46 PM Subject: RE: Python Aggregate functions The data:
CREATE FUNCTION random_integers(low INTEGER, high INTEGER, amount INTEGER) RETURNS TABLE(fctr INTEGER, v1 INTEGER, v2 INTEGER, v3 INTEGER) LANGUAGE PYTHON { result = dict() result['fctr'] = numpy.random.randint(1, 100, size=(amount,)) result['v1'] = numpy.random.randint(low, high, size=(amount,)) result['v2'] = numpy.random.randint(low, high, size=(amount,)) result['v3'] = numpy.random.randint(low, high, size=(amount,)) return result };
create table mini(fctr integer, v1 integer, v2 integer, v3 integer);
insert into mini select * from random_integers(1,10,100000);
update mini set v2 = 1;
-----Original Message----- From: Anderson, David B [ICG-MKTS] Sent: Wednesday, July 06, 2016 3:49 PM To: users-list Subject: Python Aggregate functions
Mark,
I think there is a bug in the implementation of the PYTHON_MAP aggregate functions.
The return value of a PYTHON_MAP aggregate function can change if there are other functions in the query.
The return value of my function weighted_percentile_0 (definition below) changes if I include the function median() in my query. The results are incorrect if median() isn't in the query, but are correct if median() is in the query. In this test case, the weights (v2) are all 1 and the values (v1) are uniformly sampled integers from [1,10]. Number of rows for each fctr is ~1000, and my function should reduce to min(v1) in this case.
Is some information bleeding between threads? Should I submit a bug report? This is on the default branch.
Thanks, Dave
sql>select fctr,weighted_percentile_0(v1,v2),min(v1) from mini where sql>fctr < 10 group by fctr order by fctr; +------+--------------------------+------+ | fctr | L1 | L2 | +======+==========================+======+ | 1 | 3 | 1 | | 2 | 2 | 1 | | 3 | 2 | 1 | | 4 | 1 | 1 | | 5 | 3 | 1 | | 6 | 3 | 1 | | 7 | 2 | 1 | | 8 | 1 | 1 | | 9 | 2 | 1 | +------+--------------------------+------+ 9 tuples (126.928ms) sql>select fctr,weighted_percentile_0(v1,v2),min(v1),median(v1) from sql>mini where fctr < 10 group by fctr order by fctr; +------+--------------------------+------+------+ | fctr | L1 | L2 | L3 | +======+==========================+======+======+ | 1 | 1 | 1 | 5 | | 2 | 1 | 1 | 5 | | 3 | 1 | 1 | 5 | | 4 | 1 | 1 | 5 | | 5 | 1 | 1 | 5 | | 6 | 1 | 1 | 5 | | 7 | 1 | 1 | 5 | | 8 | 1 | 1 | 5 | | 9 | 1 | 1 | 5 | +------+--------------------------+------+------+ 9 tuples (519.195ms) sql>
CREATE AGGREGATE weighted_percentile_0(a DOUBLE, w DOUBLE) RETURNS DOUBLE LANGUAGE PYTHON_MAP {
import numpy as np
# Standardize and sort based on values in a q = np.array([0]) / 100.0
idx = np.argsort(a)
a_sort = a[idx] w_sort = w[idx]
# Get the cumulative sum of weights ecdf = np.cumsum(w_sort)
# Find the percentile index positions associated with the percentiles p = q * (w_sort.sum() - 1)
# Find the bounding indices (both low and high) idx_low = np.searchsorted(ecdf, p, side='right') idx_high = np.searchsorted(ecdf, p + 1, side='right') idx_high[idx_high > ecdf.size - 1] = ecdf.size - 1
# Calculate the weights weights_high = p - np.floor(p) weights_low = 1.0 - weights_high
# Extract the low/high indexes and multiply by the corresponding weights x1 = np.take(a_sort, idx_low) * weights_low x2 = np.take(a_sort, idx_high) * weights_high
wp = np.add(x1,x2)
return(wp[0]) };
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Ok, the story continues. I was curious about the performance of weighted_percentile_0 for larger data, but then, I stumble upon another error (Dec2016): $ mclient -s 'delete from mini; insert into mini select * from random_integers(1,10,10000000); update mini set v2 = 1;' 10000000 affected rows 10000000 affected rows 10000000 affected rows $ mclient -s 'select fctr,weighted_percentile_0(v1,v2),min(v1) from mini where fctr < 10 group by fctr order by fctr;' b and g must be aligned It looks like mitosis related again. Because, adding a median() makes it at least possible to get results: $ mclient -s 'select fctr,weighted_percentile_0(v1,v2),min(v1), median(v1) from mini where fctr < 10 group by fctr order by fctr;' +------+--------------------------+------+------+ | fctr | L3 | L6 | L11 | +======+==========================+======+======+ | 1 | 1 | 1 | 5 | | 2 | 1 | 1 | 5 | | 3 | 1 | 1 | 5 | | 4 | 1 | 1 | 5 | | 5 | 1 | 1 | 5 | | 6 | 1 | 1 | 5 | | 7 | 1 | 1 | 5 | | 8 | 1 | 1 | 5 | | 9 | 1 | 1 | 5 | +------+--------------------------+------+------+ 9 tuples Regards, Jennie
On 6 Feb 2017, at 05:12, Ying Zhang
wrote: I just tried the queries, and the problem (with mitosis) seems to have been resolved in the meantime. With both Dec2016 and default, I have:
sql>select fctr,weighted_percentile_0(v1,v2),min(v1) from mini where fctr < 10 group by fctr order by fctr; +------+--------------------------+------+ | fctr | L3 | L6 | +======+==========================+======+ | 1 | 1 | 1 | | 2 | 1 | 1 | | 3 | 1 | 1 | | 4 | 1 | 1 | | 5 | 1 | 1 | | 6 | 1 | 1 | | 7 | 1 | 1 | | 8 | 1 | 1 | | 9 | 1 | 1 | +------+--------------------------+------+
sql>select fctr,weighted_percentile_0(v1,v2),min(v1),median(v1) from mini where fctr < 10 group by fctr order by fctr; +------+--------------------------+------+------+ | fctr | L3 | L6 | L11 | +======+==========================+======+======+ | 1 | 1 | 1 | 5 | | 2 | 1 | 1 | 5 | | 3 | 1 | 1 | 5 | | 4 | 1 | 1 | 5 | | 5 | 1 | 1 | 5 | | 6 | 1 | 1 | 5 | | 7 | 1 | 1 | 5 | | 8 | 1 | 1 | 5 | | 9 | 1 | 1 | 5 | +------+--------------------------+------+------+
When I forget to start mserver5 with ‘--set embedded_py=true’, the error message given by Dec2016 could be improved though, because it says: sql>TypeException:user.s8_1[9]:'pyapi.eval' undefined in: (X_10:bat[:int],X_13:bat[:int],X_14:bat[:int],X_15:bat[:int]) := pyapi.eval(0x7fc01b928630:ptr,"{\n result = dict()\n result['fctr'] = numpy.random.randint(1, 100, size=(amount,))\n result['v1'] = numpy.random.randint(low, high, size=(amount,))\n result['v2'] = numpy.random.randint(low, high, size=(amount,))\n result['v3'] = numpy.random.randint(low, high, size=(amount,))\n return result\n};":str,X_5:int,X_7:int,X_9:int);
while the error message given by ‘default’ is much more informative: "Embedded Python has not been enabled. Start server with --set embedded_py=true"
Regards, Jennie
On 6 Jul 2016, at 23:17, Mark Raasveldt
wrote: Hey Anderson,
Indeed, it looks like a bug. It is not related to information bleeding between threads though; it's related to mitosis. The reason adding the median call fixes the issue is because the median call disables mitosis. I will look into it some more later.
Regards,
Mark
----- Original Message ----- From: "Anderson, David B"
To: "users-list" Sent: Wednesday, July 6, 2016 10:05:46 PM Subject: RE: Python Aggregate functions The data:
CREATE FUNCTION random_integers(low INTEGER, high INTEGER, amount INTEGER) RETURNS TABLE(fctr INTEGER, v1 INTEGER, v2 INTEGER, v3 INTEGER) LANGUAGE PYTHON { result = dict() result['fctr'] = numpy.random.randint(1, 100, size=(amount,)) result['v1'] = numpy.random.randint(low, high, size=(amount,)) result['v2'] = numpy.random.randint(low, high, size=(amount,)) result['v3'] = numpy.random.randint(low, high, size=(amount,)) return result };
create table mini(fctr integer, v1 integer, v2 integer, v3 integer);
insert into mini select * from random_integers(1,10,100000);
update mini set v2 = 1;
-----Original Message----- From: Anderson, David B [ICG-MKTS] Sent: Wednesday, July 06, 2016 3:49 PM To: users-list Subject: Python Aggregate functions
Mark,
I think there is a bug in the implementation of the PYTHON_MAP aggregate functions.
The return value of a PYTHON_MAP aggregate function can change if there are other functions in the query.
The return value of my function weighted_percentile_0 (definition below) changes if I include the function median() in my query. The results are incorrect if median() isn't in the query, but are correct if median() is in the query. In this test case, the weights (v2) are all 1 and the values (v1) are uniformly sampled integers from [1,10]. Number of rows for each fctr is ~1000, and my function should reduce to min(v1) in this case.
Is some information bleeding between threads? Should I submit a bug report? This is on the default branch.
Thanks, Dave
sql>select fctr,weighted_percentile_0(v1,v2),min(v1) from mini where sql>fctr < 10 group by fctr order by fctr; +------+--------------------------+------+ | fctr | L1 | L2 | +======+==========================+======+ | 1 | 3 | 1 | | 2 | 2 | 1 | | 3 | 2 | 1 | | 4 | 1 | 1 | | 5 | 3 | 1 | | 6 | 3 | 1 | | 7 | 2 | 1 | | 8 | 1 | 1 | | 9 | 2 | 1 | +------+--------------------------+------+ 9 tuples (126.928ms) sql>select fctr,weighted_percentile_0(v1,v2),min(v1),median(v1) from sql>mini where fctr < 10 group by fctr order by fctr; +------+--------------------------+------+------+ | fctr | L1 | L2 | L3 | +======+==========================+======+======+ | 1 | 1 | 1 | 5 | | 2 | 1 | 1 | 5 | | 3 | 1 | 1 | 5 | | 4 | 1 | 1 | 5 | | 5 | 1 | 1 | 5 | | 6 | 1 | 1 | 5 | | 7 | 1 | 1 | 5 | | 8 | 1 | 1 | 5 | | 9 | 1 | 1 | 5 | +------+--------------------------+------+------+ 9 tuples (519.195ms) sql>
CREATE AGGREGATE weighted_percentile_0(a DOUBLE, w DOUBLE) RETURNS DOUBLE LANGUAGE PYTHON_MAP {
import numpy as np
# Standardize and sort based on values in a q = np.array([0]) / 100.0
idx = np.argsort(a)
a_sort = a[idx] w_sort = w[idx]
# Get the cumulative sum of weights ecdf = np.cumsum(w_sort)
# Find the percentile index positions associated with the percentiles p = q * (w_sort.sum() - 1)
# Find the bounding indices (both low and high) idx_low = np.searchsorted(ecdf, p, side='right') idx_high = np.searchsorted(ecdf, p + 1, side='right') idx_high[idx_high > ecdf.size - 1] = ecdf.size - 1
# Calculate the weights weights_high = p - np.floor(p) weights_low = 1.0 - weights_high
# Extract the low/high indexes and multiply by the corresponding weights x1 = np.take(a_sort, idx_low) * weights_low x2 = np.take(a_sort, idx_high) * weights_high
wp = np.add(x1,x2)
return(wp[0]) };
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (3)
-
Anderson, David B
-
Mark Raasveldt
-
Ying Zhang