Python Aggregate functions

6 Jul 2016

      Mark,

I think there is a bug in the implementation of the PYTHON_MAP aggregate functions. 

The return value of a PYTHON_MAP aggregate function can change if there are other functions in the query.

The return value of my function weighted_percentile_0 (definition below) changes if I include the function median() in my query. The results are incorrect if median() isn't in the query, but are correct if median() is in the query. In this test case, the weights (v2) are all 1 and the values (v1) are uniformly sampled integers from [1,10]. Number of rows for each fctr is ~1000, and my function should reduce to min(v1) in this case. 

Is some information bleeding between threads? Should I submit a bug report? This is on the default branch.

Thanks,
Dave

sql>select fctr,weighted_percentile_0(v1,v2),min(v1) from mini where fctr < 10 group by fctr order by fctr;
+------+--------------------------+------+
| fctr | L1                       | L2   |
+======+==========================+======+
|    1 |                        3 |    1 |
|    2 |                        2 |    1 |
|    3 |                        2 |    1 |
|    4 |                        1 |    1 |
|    5 |                        3 |    1 |
|    6 |                        3 |    1 |
|    7 |                        2 |    1 |
|    8 |                        1 |    1 |
|    9 |                        2 |    1 |
+------+--------------------------+------+
9 tuples (126.928ms)
sql>select fctr,weighted_percentile_0(v1,v2),min(v1),median(v1) from mini where fctr < 10 group by fctr order by fctr;
+------+--------------------------+------+------+
| fctr | L1                       | L2   | L3   |
+======+==========================+======+======+
|    1 |                        1 |    1 |    5 |
|    2 |                        1 |    1 |    5 |
|    3 |                        1 |    1 |    5 |
|    4 |                        1 |    1 |    5 |
|    5 |                        1 |    1 |    5 |
|    6 |                        1 |    1 |    5 |
|    7 |                        1 |    1 |    5 |
|    8 |                        1 |    1 |    5 |
|    9 |                        1 |    1 |    5 |
+------+--------------------------+------+------+
9 tuples (519.195ms)
sql>

CREATE AGGREGATE weighted_percentile_0(a DOUBLE, w DOUBLE)
RETURNS DOUBLE 
LANGUAGE PYTHON_MAP {

    import numpy as np      

    # Standardize and sort based on values in a
    q = np.array([0]) / 100.0

    idx = np.argsort(a)

    a_sort = a[idx]
    w_sort = w[idx]

    # Get the cumulative sum of weights
    ecdf = np.cumsum(w_sort)

    # Find the percentile index positions associated with the percentiles
    p = q * (w_sort.sum() - 1)

    # Find the bounding indices (both low and high)
    idx_low = np.searchsorted(ecdf, p, side='right')
    idx_high = np.searchsorted(ecdf, p + 1, side='right')
    idx_high[idx_high > ecdf.size - 1] = ecdf.size - 1

    # Calculate the weights 
    weights_high = p - np.floor(p)
    weights_low = 1.0 - weights_high

    # Extract the low/high indexes and multiply by the corresponding weights
    x1 = np.take(a_sort, idx_low) * weights_low
    x2 = np.take(a_sort, idx_high) * weights_high

    wp = np.add(x1,x2)

    return(wp[0])
};

Anderson, David B

Anderson, David B

Mark Raasveldt

Ying Zhang

Ying Zhang

tags

participants (3)