As of the Apr2019 release of MonetDB, we have extended MonetDB’s support for the SQL window functions to cover the majority as specified by the 2011 revision of the SQL standard. In this blogpost we describe the new functionality together with some example SQL queries to show their usage.
Besides the existing RANK(), DENSE_RANK() and ROW_NUMBER() functions, we have now implemented all the remaining analytic functions listed in the SQL standard:
PERCENT_RANK() : DOUBLE - Calculates the relative rank of the current row: (rank() - 1) / (rows in partition - 1)
CUME_DIST() : DOUBLE - Calculates the cumulative distribution of a value within a set of values: (number of rows with values less than or equal to this row’s value) / (total number of rows in this partition).
NTILE(nbuckets BIGINT) : BIGINT - Enumerates rows from 1 in each partition, dividing it in the most equal way possible.
LAG(input A [, offset BIGINT [, default_value A ] ]) : A - Returns input value at row “offset” before the current row in the partition. If the offset row does not exist, then the “default_value” is output. By default “offset” is 1 and “default_value” is NULL.
LEAD(input A [, offset BIGINT [, default_value A ] ]) : A - Returns input value at row “offset” after the current row in the partition. If the offset row does not exist, then the “default_value” is output. By default “offset” is 1 and “default_value” is NULL.
FIRST_VALUE(input A) : A - Returns input value at first row of the window frame.
LAST_VALUE(input A) : A - Returns input value at last row of the window frame.
NTH_VALUE(input A, nth BIGINT) : A - Returns input value at “nth” row of the window frame. If there is no “nth” row in the window frame, then NULL is returned.
We have extended our existing aggregate functions to support aggregate window functions:
MIN(input A) : A,
MAX(input A) : A,
COUNT(*) : BIGINT,
COUNT(input A) : BIGINT,
SUM(input A) : A,
PROD(input A) : A, and
AVG(input A) : DOUBLE.
Window functions now support frame specifications from the SQL standard. The fully implemented SQL grammar is listed below:
window_function_call:
{ window_aggregate_function | window_rank_function } OVER
{ ident | '(' window_specification ')' }
window_aggregate_function:
AVG '(' query_expression ')'
| COUNT '(' { '*' | query_expression } ')'
| MAX '(' query_expression ')'
| MIN '(' query_expression ')'
| PROD '(' query_expression ')'
| SUM '(' query_expression ')'
window_rank_function:
CUME_DIST '(' ')'
| DENSE_RANK '(' ')'
| FIRST_VALUE '(' query_expression ')'
| LAG '(' query_expression [ ',' query_expression [ ',' query_expression ] ] ')'
| LAST_VALUE '(' query_expression ')'
| LEAD '(' query_expression [ ',' query_expression [ ',' query_expression ] ] ')'
| NTH_VALUE '(' query_expression ',' query_expression ')'
| NTILE '(' query_expression ')'
| PERCENT_RANK '(' ')'
| RANK '(' ')'
| ROW_NUMBER '(' ')'
window_specification:
[ ident ] [ PARTITION BY column_ref [ ',' ... ] ] [ ORDER BY sort_spec ]
[ { ROWS | RANGE | GROUPS }
{ window_frame_start | BETWEEN window_bound AND window_bound }
[ EXCLUDE { CURRENT ROW | GROUP | TIES | NO OTHERS } ] ]
window_frame_start:
UNBOUNDED PRECEDING
| query_expression PRECEDING
| CURRENT ROW
window_bound:
UNBOUNDED FOLLOWING
| query_expression FOLLOWING
| UNBOUNDED PRECEDING
| query_expression PRECEDING
| CURRENT ROW
The supported frames are: ROWS, RANGE and GROUPS.
ROWS - Frames are calculated on physical offsets of input rows.
RANGE - Result frames are calculated on value differences from input rows (used with a custom PRECEDING or FOLLOWING bound requires an ORDER BY clause).
GROUPS - Groups of equal row values are used to calculate result frames (requires an ORDER BY clause).
After a window frame declaration, the window bounds must be specified (the window function will be applied to each frame derived from each row in the input). If “window_frame_start” bound is provided, then the frame’s end will be set to CURRENT ROW. An UNBOUNDED PRECEDING bound means the first row of a partition, while an UNBOUNDED FOLLOWING means the last row of a partition. In “query_expression PRECEDING” (i.e. frame rows before the current row) and “query_expression FOLLOWING” (i.e. frame rows after the current row) bounds, the “query_expression” can evaluate to a single atom (use the same bound for every input row), or a column (use a different bound for each input row). In either case, every “query_expression” value must be non-negative and non-NULL, as negative and NULL bounds are not defined for SQL window functions. CURRENT ROW is equivalent to 0 PRECEDING and 0 FOLLOWING on either side of the bound.
The SQL standard allows an EXCLUDE clause after the bounds definition. At the moment only EXCLUDE NO OTHERS (i.e. default one) is implemented, which means all rows in the window frame are used for computation of the analytic function.
The frame specification has been implemented for aggregation functions, as well as the functions FIRST_VALUE, LAST_VALUE and NTH_VALUE . The default frame specification is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when there is an ORDER BY clause, and RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING when no ORDER BY clause is present.
CREATE TABLE analytics (col1 int, col2 int);
INSERT INTO analytics VALUES
(15, 3), (3, 1), (2, 1), (5, 3), (NULL, 2),
(3, 2), (4, 1), (6, 3), (8, 2), (NULL, 4);
SELECT PERCENT_RANK() OVER (ORDER BY col1) FROM analytics;
L4 |
---|
0 |
0 |
0.22222222222222222 |
0.33333333333333333 |
0.33333333333333333 |
0.55555555555555556 |
0.66666666666666666 |
0.77777777777777778 |
0.88888888888888888 |
1 |
SELECT FIRST_VALUE(col1) OVER (PARTITION BY col2) FROM analytics;
L4 |
---|
3 |
3 |
3 |
null |
null |
null |
15 |
15 |
15 |
null |
SELECT COUNT(col1) OVER (ORDER BY col2 DESC RANGE UNBOUNDED PRECEDING)
FROM analytics
L4 |
---|
0 |
3 |
3 |
3 |
5 |
5 |
5 |
8 |
8 |
8 |
SELECT AVG(col1) OVER
(ORDER BY col2 GROUPS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM analytics;
L4 |
---|
3 |
3 |
3 |
4 |
4 |
4 |
5.75 |
5.75 |
5.75 |
5.75 |
We have also implemented interval boundaries for date and time(stamp) columns on RANGE frames:
CREATE TABLE timetable (col1 timestamp, col2 int);
INSERT INTO timetable VALUES
('2017-01-01', 3), ('2017-02-02', 1), ('2017-03-03', 1),
('2017-04-04', 3), (NULL, 2), ('2017-06-06', 2), ('2017-07-07', 1),
('2017-08-08', 3), ('2017-09-09', 2), (NULL, 4);
SELECT SUM(col2) OVER
(ORDER BY col1 RANGE BETWEEN INTERVAL '1' MONTH PRECEDING
AND INTERVAL '3' MONTH FOLLOWING)
FROM timetable;
L4 |
---|
6 |
6 |
5 |
5 |
4 |
5 |
6 |
6 |
5 |
2 |
We have also added support for the WINDOW keyword. If a same window specification is used multiple times in a SELECT clause, one can define an alias for this window specification to avoid repeating the same window definition. Such aliases can be defined using the new WINDOW keyword in a FROM clause. In the query below, the definitions of the aliases “w1” and “w2” show how different aliases can be defined for different window specifications on one table. The definition of the alias “w3” shows that different aliases can be defined for the same window specification. Finally, all aliases can be subsequently used in the SELECT clause.
SELECT COUNT(*) OVER w1, PROD(col1) OVER w2,
SUM(col1) OVER w1, AVG(col2) OVER w2, MAX(col2) OVER w3
FROM analytics WINDOW
w1 AS (ROWS BETWEEN 5 PRECEDING AND 0 FOLLOWING),
w2 AS (RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
w3 AS (w2);
Several things about the implementation are worth being pointed out:
CREATE TABLE ranktest (id INT, k STRING);
INSERT INTO ranktest VALUES (1061,'a'), (1062,'b'), (1062,'c'), (1061,'d');
SELECT ROW_NUMBER() OVER (PARTITION BY id), id FROM ranktest;
--Output before
L4 | id |
---|---|
1 | 1062 |
2 | 1062 |
1 | 1061 |
2 | 1061 |
--Output now
L4 | id |
---|---|
1 | 1061 |
2 | 1061 |
1 | 1062 |
2 | 1062 |
The DISTINCT clause on aggregate windowing functions is not implemented yet.
The SQL standard has defined the options RESPECT NULLS and IGNORE NULLS for the functions LEAD, LAG, FIRST_VALUE, LT_VALUE, and NTH_VALUE. These options have not been implemented yet, thus RESPECT NULLS (default behaviour) is always applied.
The SQL standard has also defined the options FROM FIRST and FROM LAST for the function NTH_VALUE. These options are not implemented yet, thus FROM FIRST (default behaviour) is always implied. However the FROM LAST option can be easily mimicked by reversing the order of the input.
The new extended SQL Window Functions is developed by Pedro Ferreira, a software developer at MonetDB Solutions.