Sampling a database is an essential step to improve the response time for database exploration. We have introduced a number of sampling techniques in the MonetDB software stack. Our goal is to provide methods for performing sampling over a) the result of a query, b) the base tables, and c) the entire database schema.
Sampling can be performed during query execution, as well as during data loading in the case of predefined sampling indexes. Eventually, we want to support both uniform and weighted sampling, but in the existing released, all sampling methods are uniform.
The sampling methods included in the distribution are described below.
A new SQL operator SAMPLE
has been added to support sampling the result of a query.
SELECT ... FROM ... [ SAMPLE <expr> ] [ SEED intval ]
If <expr> is a non-negative integer literal, it defines the number of rows to be included in the sample. If <expr> is a real literal between [ 0.0, 1.0 ] it refers to the percentage of the result set to be sampled. For example, if <expr> is 0.3, then the sample will contain 30% of the rows in the query result.
Because SAMPLE
primarily operates on query results, it is treated as the same type of operator as the LIMIT
clauses,
which according to the SQL:2003 standard, may only be used in the outer most SELECT clause.
So, before the Jul2017 release, SAMPLE
is not allowed in a subquery; in addition,
the SAMPLE
operator does not operates on query input data.
However, both restrictions can be circumvented using a table producing function, for example:
CREATE FUNCTION mysample ()
RETURNS TABLE(col a,...)
BEGIN
RETURN
SELECT a,...
FROM name_table
SAMPLE 100;
end;
Then one can use the function mysample() to create a sampled table, for instance:
INSERT INTO sample_table (SELECT * FROM mysample());
In this way, we can apply SAMPLE on base tables, before running the actual query.
The current sampling methods all use uniform sampling, which is based on the algorithm A as described in the paper "Faster Methods for Random Sampling" by Jeffrey Scott Vitter [1]. Algorithm A is not the fastest one, but it only makes <expr> number of calls in function random() and it is simpler than the other more complex and CPU intensive algorithms in the literature. Instead of performing one random experiment for each row to decide if it should be included in the sample or not, Algorithm A skips <expr> rows and includes the next row found. The algorithm scans the input relation sequentially and maintains the uniqueness and sort properties. The sample is without replacement.