To keep up with growing data sizes and workload, applications must be scaled either vertically or horizontally. For the first case, the machine can be equipped with more hardware resources, while for the second case, more instances of the same application can be used simultaneously to serve more users, and/or process more data. The latter scenario brings liability to coordinate all the instances of an application together.
Many out-of-box solutions have been developed for horizontal scaling in the application layer, such as Apache Hadoop. At the same time, relational databases have developed new strategies to handle horizontal scaling by splitting existing SQL tables across multiple database servers.
Previously, we have added the “merge table” concept into MonetDB. It is an abstraction of multiple physical tables in a single logical table. We have also introduced the “remote table” concept, with which one can create a logical table at the local MonetDB server to refer to a table resides in a remote MonetDB server. By including remote tables in a merge table, it is possible to scale a SQL table horizontally in MonetDB. And queries on the merge table will be run in a distributed fashion transparently to the users.
In previous versions, the child tables of a merge table cannot be updated through the merge table. Instead, all updates must be directly performed on each individual child table. For remote tables, this means that one has to connect to the remote MonetDB server to update the physical table there. Clearly, this largely limits the usability of merge tables and remote tables for large scaling-out applications.
As of the Apr2019 release of MonetDB, we have extended the definition of merge tables with an optional PARTITION BY clause. With this clause, one can specify the distribution of data among the child tables of a merge table. Then we have extended MERGE TABLE to allow updating the child tables of a merge table through the merge table. Newly inserted tuples will be automatically distributed over the child tables according to the partitioning scheme defined for this merge table. Updates and deletes will be automatically propagated to child tables.
In this section we describe the syntax extensions. In the next section we demonstrate how these extensions work using example queries.
To accommodate data insertion, we must know into which child tables the to-be-inserted tuples should go. Therefore, we have added a predicate to a merge table to specify how its data will be partitioned. Below is the extended syntax of MERGE TABLE with the new PARTITION BY clause denoted in bold font:
CREATE MERGE TABLE [ IF NOT EXISTS ] table_name (... columns ...)
[ PARTITION BY { RANGE | VALUES }
{ ON '(' column_name ')' | USING '(' expression ')' } ]
In the partition scheme definition, two properties must be specified:
How the tuples will be distributed among the child tables: according to a value RANGE or a list of explicitly identified VALUES. We refer to those values as the “partitioning values”.
How the partition scheme is computed: use the value of a single column
(i.e. ON '(' column_name ')'
) or an expression (i.e. USING '(' expression ')'
). We refer to the column as the “partitioning column”, and
the expression as the “partitioning expression”.
Note that in the partition-by-column scenario we allow exactly one column in the partition scheme.
After a MERGE TABLE has been created, one can add child tables into it with the ALTER TABLE statements as shown below. The syntax of ALTER TABLE was extended with a new clause AS PARTITION with which one can specify for a child table which tuples should be stored in this table. A child table can be removed from its parent merge table using an ALTER TABLE… DROP TABLE… statement, whose syntax is the same as in the previous MonetDB versions.
ALTER TABLE [ IF EXISTS ] merge_table_name { ADD | SET } TABLE child_table_name
AS PARTITION IN '(' expression [ ',' ... ] ')' [ WITH NULL VALUES ]
ALTER TABLE [ IF EXISTS ] merge_table_name { ADD | SET } TABLE child_table_name
AS PARTITION FROM { RANGE MINVALUE | expression }
TO { RANGE MAXVALUE | expression } [ WITH NULL VALUES ]
ALTER TABLE [ IF EXISTS ] merge_table_name { ADD | SET } TABLE child_table_name
AS PARTITION FOR NULL VALUES
ALTER TABLE [ IF EXISTS ] merge_table_name
DROP TABLE child_table_name [ RESTRICT | CASCADE ]
One can use the ADD keyword to add a new child table to a merge table. This is the same as in the previous MonetDB versions. However, if the merge table is updatable, the AS PARTITION clause is required; otherwise, this clause is not allowed.
One can use the SET keyword to update the range or the list of values of an existing child table. If this child table already contains some tuples, those tuples will be verified according to the partition definition. If the validation fails, this ALTER TABLE statement will fail.
For a merge table partitioned by values, one can use AS PARTITION IN… to specify a set of values that will be assigned to this child table. The values must be within the domain of the partitioning values of this merge table.
For a merge table partitioned by range, one can use AS PARTITION FROM…TO… to specify a range of values within the domain of the partitioning values of this merge table.
The lower limit is inclusive, while the higher limit exclusive.
Both the value sets and the ranges must be non-overlapping.
If a merge table is partitioned using a column, the distribution of the tuples to the child tables is performed by comparing the values of the partitioning column of the to-be-inserted tuples against the partitioning values assigned to each child table. If a merge table is partitioned using an expression, the expression is evaluated for every insertion and partitioning will be performed based on the output of the evaluation.
In SQL, NULL values have a distinct meaning, hence one has to use a WITH NULL VALUES clause to indicate that this child table will host tuples whose partitioning criterium evaluates NULL.
As partitions must never overlap (ambiguous inserts), an ALTER TABLE statement will fail if its domain conflicts with an existing partition. In addition, the contents of the child table must be within the domain of the parent merge table, otherwise the ALTER TABLE statement will also fail.
Finally, after having added at least one child table, one can run insert, update, delete and truncate queries on a merge table.
Some implementation notes are left here:
A MERGE TABLE created without a PARTITION BY clause is not updatable. To make this MERGE TABLE table updatable, one must delete and recreate it with the desired PARTITION BY clause, and add its child tables.
At the moment, update statements on the partitioning column or columns used by a partitioning expression is not possible. When processing an insert statement, if any of the to-be-inserted tuples does not satisfy any partition predicates, the insertion will fail.
Primary keys are allowed in updatable merge tables only if their corresponding columns match exactly the partitioning column or the columns used in the partitioning expression.
Updating the schema of a merge table, e.g. add or remove keys, and add or remove columns, which already contains one or more child tables is not supported, because such changes are not automatically propagated to the child tables yet, thus doing this will violate the basic constraint that the schema of a merge table and all of its child tables must be exactly the same. To update the schema of a merge table, one has to i) detach the child tables first, ii) modify the schema of both the merge table and its child tables, and iii) rebuild the merge table.
Similar to the previous note, updating the schema of a physical table which is also a child table of a merge table is not supported. To do this, the same detach, modify and rebuild strategy should be used.
All the tables in a merge table’s hierarchy must belong to the same database schema. Also, any user defined functions used in the partitioning expressions of this merge table and all its child tables must belong to this same database.
A merge table may have an unlimited number of child tables as long as the partition domains do not overlap, however a physical table can only be used as a child table of a single merge table.
We allow nested partitioning, i.e. a merge table of merge tables, nonetheless a single table can appear only once in the whole hierarchy, i.e. a cyclic definition is not allowed.
Both COPY INTO and BINARY COPY INTO statements are supported. However, COPY INTO statements from Python loaders to partitioned tables are not supported at the moment.
Floating-point columns and floating-point number returning expressions, including FLOAT, REAL and DOUBLE cannot be used as partition keys at the moment due to their internal representation. Decimal types should be used instead.
Updating a remote table through a merge table is not supported yet. This is not an easy feature to support, because it will require a proper 2-phase commit protocol to guarantee that the updates are correctly processed.
The following queries demonstrate the main features of updatable merge tables.
In the first demo we show a merge table partitioned by range on a column:
CREATE MERGE TABLE splitted (stamp TIMESTAMP, val INT)
PARTITION BY RANGE ON (stamp);
CREATE TABLE first_decade (stamp TIMESTAMP, val INT);
CREATE TABLE second_decade (stamp TIMESTAMP, val INT);
CREATE TABLE third_decade (stamp TIMESTAMP, val INT);
ALTER TABLE splitted ADD TABLE first_decade AS PARTITION
FROM TIMESTAMP '2000-01-01 00:00:00' TO TIMESTAMP '2010-01-01 00:00:00';
ALTER TABLE splitted ADD TABLE second_decade AS PARTITION
FROM TIMESTAMP '2010-01-01 00:00:00' TO TIMESTAMP '2020-01-01 00:00:00';
-- The following query will trigger an error:
-- this range is enclosed by first_decade table.
ALTER TABLE splitted ADD TABLE third_decade AS PARTITION
FROM TIMESTAMP '2005-02-13 01:08:10' TO TIMESTAMP '2007-12-13 00:00:00';
INSERT INTO splitted VALUES
(TIMESTAMP '2000-01-01 00:00:00', 1), (TIMESTAMP '2002-12-03 20:00:00', 2),
(TIMESTAMP '2012-05-12 21:01:00', 3), TIMESTAMP '2019-12-12 23:59:59', 4);
-- The following query will trigger an error:
-- '1972-02-13 01:00:00' is prior to the partition’s minimum value.
INSERT INTO first_decade VALUES
(TIMESTAMP '1972-02-13 01:00:00', 1000),
(TIMESTAMP '2005-02-13 01:00:00', 2000);
-- The following query will trigger an error:
-- both tuples have no partition to go into.
INSERT INTO splitted VALUES (TIMESTAMP '2020-01-01 00:00:00', 5), (NULL, 6);
-- Now let us add the table third_decade using a non-overlapping range:
ALTER TABLE splitted ADD TABLE third_decade AS PARTITION
FROM TIMESTAMP '2020-01-01 00:00:00' TO RANGE MAXVALUE WITH NULL VALUES;
-- Now the insert will succeed:
INSERT INTO splitted VALUES (TIMESTAMP '2020-01-01 00:00:00', 5), (NULL, 6);
-- Sanity check:
-- the tuples have been inserted into the corresponding physical tables,
-- while the merge table provides a total view of all tuples.
SELECT stamp, val FROM first_decade;
stamp | val |
---|---|
2000-01-01 00:00:00.000000 | 1 |
2002-12-03 20:00:00.000000 | 2 |
SELECT stamp, val FROM second_decade;
stamp | val |
---|---|
2012-05-12 21:01:00.000000 | 3 |
2019-12-12 23:59:59.000000 | 4 |
SELECT stamp, val FROM third_decade;
stamp | val |
---|---|
2020-01-01 00:00:00.000000 | 5 |
null | 6 |
SELECT stamp, val FROM splitted;
stamp | val |
---|---|
2000-01-01 00:00:00.000000 | 1 |
2002-12-03 20:00:00.000000 | 2 |
2012-05-12 21:01:00.000000 | 3 |
2019-12-12 23:59:59.000000 | 4 |
2020-01-01 00:00:00.000000 | 5 |
null | 6 |
DELETE FROM splitted; --6 rows affected
SELECT stamp, val FROM splitted;
stamp | val |
---|
SELECT stamp, val FROM first_decade;
stamp | val |
---|
SELECT stamp, val FROM second_decade;
stamp | val |
---|
SELECT stamp, val FROM third_decade;
stamp | val |
---|
In the second demo we show a merge table partitioned by values using an expression:
CREATE FUNCTION calculate_max(a INT, b INT) RETURNS INT BEGIN
IF a > b THEN RETURN a; ELSE RETURN b; END IF;
END;
-- Note the modulo in the partitioning expression
CREATE MERGE TABLE scorers (name clob, first_score int, second_score int)
PARTITION BY VALUES USING (calculate_max(first_score, second_score) % 10);
CREATE TABLE lower_scorers (name clob, first_score int, second_score int);
CREATE TABLE higher_scorers (name clob, first_score int, second_score int);
ALTER TABLE scorers ADD TABLE lower_scorers AS PARTITION IN (0, 1, 2, 3, 4);
ALTER TABLE scorers ADD TABLE higher_scorers AS PARTITION IN (5, 6, 7, 8, 9);
INSERT INTO scorers VALUES
('John', 1, 3), ('Ana', 5, 18), ('Peter', 19, 19), ('Jessica', 6, 4),
('Olivia', 10, 10), ('Charles', 12, 16), ('Sam', 11, 10);
-- Sanity check:
-- the tuples have been inserted into the corresponding physical tables,
-- while the merge table provides a total view of all tuples.
SELECT * FROM lower_scorers;
name | first_score | second_score |
---|---|---|
John | 1 | 3 |
Olivia | 10 | 10 |
Sam | 11 | 10 |
SELECT * FROM higher_scorers;
name | first_score | second_score |
---|---|---|
Ana | 5 | 18 |
Peter | 19 | 19 |
Jessica | 6 | 4 |
Charles | 12 | 16 |
SELECT * FROM scorers;
name | first_score | second_score |
---|---|---|
John | 1 | 3 |
Olivia | 10 | 10 |
Sam | 11 | 10 |
Ana | 5 | 18 |
Peter | 19 | 19 |
Jessica | 6 | 4 |
Charles | 12 | 16 |
-- Updating a column used by the partitioning expression is not allowed:
UPDATE scorers SET first_score = 10 WHERE name = ‘Ana’;
-- But we can update other columns
-- With the following query, the scores of Ana and Olivia are swapped!
UPDATE scorers SET name =
CASE name WHEN 'Ana' THEN 'Olivia' WHEN 'Olivia' THEN 'Ana' END
WHERE name IN ('Ana', 'Olivia');
SELECT * FROM lower_scorers;
name | first_score | second_score |
---|---|---|
John | 1 | 3 |
Ana | 10 | 10 |
Sam | 11 | 10 |
SELECT * FROM higher_scorers;
name | first_score | second_score |
---|---|---|
Olivia | 5 | 18 |
Peter | 19 | 19 |
Jessica | 6 | 4 |
Charles | 12 | 16 |
SELECT * FROM scorers;
name | first_score | second_score |
---|---|---|
John | 1 | 3 |
Ana | 10 | 10 |
Sam | 11 | 10 |
Olivia | 5 | 18 |
Peter | 19 | 19 |
Jessica | 6 | 4 |
Charles | 12 | 16 |
In the final demo we show the contents of the new system tables added to the SQL catalog to accommodate the metadata information of the partition predicates of the updatable merge tables. In the “sys”.“table_partitions”
table the ids of the partitioning columns or the partitioning expressions are stored. In “sys”.“range_partitions”
and “sys”.”value_partitions”
the information of child tables and their respective domains are listed:
SELECT * FROM table_partitions;
id | table_id | column_id | expression | type |
---|---|---|---|---|
8795 | 8794 | 8792 | null | 5 |
8818 | 8817 | null | mod(calculate_max(first_score,second_score),10) | 10 |
SELECT * FROM range_partitions;
table_id | partition_id | minimum | maximum | with_nulls |
---|---|---|---|---|
8799 | 8795 | 2000-01-01 00:00:00.000 | 2010-01-01 00:00:00.000 | false |
8803 | 8795 | 2010-01-01 00:00:00.000 | 2020-01-01 00:00:00.000 | false |
8807 | 8795 | 2020-01-01 00:00:00.000 | 5867411-12-31 23:59:59.999 | true |
SELECT * FROM value_partitions;
table_id | partition_id | value |
---|---|---|
8631 | 8626 | 0 |
8631 | 8626 | 1 |
8631 | 8626 | 2 |
8631 | 8626 | 3 |
8631 | 8626 | 4 |
8636 | 8626 | 5 |
8636 | 8626 | 6 |
8636 | 8626 | 7 |
8636 | 8626 | 8 |
8636 | 8626 | 9 |
The updatable merge table extensions are developed and supported by Pedro Ferreira, a software developer at MonetDB Solutions.