On Mon, Sep 26, 2011 at 10:41 AM, Roberto Cornacchia <roberto.cornacchia@gmail.com> wrote:

Lefteris, I would like to add something about the syntax (and semantics) of this feature.

See below

+/*
+ * @- Uniform Sampling.
+ *
+ * A new SQL operator has been added to support sampling the result of a query.
+ * The syntax for sampling is:
+ * SELECT ... FROM ... WHERE ... SAMPLE s
+ *
+ * where s if is an integer greater than 1, it defines the number of rows to be
+ * in the sample. If s is a double between [0.0,1.0] the it refers to the
+ * percentage of the result to be sampled. That is if s=0.3 then the sample
+ * will be 30% the size of the query result.
+ *

It has been discussed already whether this choice or an explicit PERCENT is more intuitive.
Personally, I don't care much about that.

+ * SAMPLE is been treated as LIMIT, ORDER BY, etc., that means that it can only
+ * be in the outer most SELECT clause, i.e., SAMPLE cannot appear in a
+ * subquery.

This is much more interesting to me.
I see that SAMPLE fits in the same semantics as LIMIT. In fact, you can think of it as a special LIMIT.
What I don't like about it, is that it then shares the same limitations (which are absurd, as the whole SQL often is, from my point of view), which is what you mention above about the subquery.

You suggest a workaround here:

However, if this is needed, then one may define a function, for
+ * example
+ *
+ * CREATE FUNCTION mysample ()
+ * RETURNS TABLE(col a,...)
+ * BEGIN
+ * RETURN
+ * SELECT a,...
+ * FROM name_table
+ * SAMPLE 100;
+ * end;
+ *
+ * and then use function mysample() for example to populate a new table with
+ * the sample. E.g.,
+ *
+ * INSERT INTO sample_table (SELECT * FROM mysample());
+ *
+ *

To me, writing a function with an hard-coded table name in it is... weird! Like writing a function that can only do 2+2.
(This could be much better if table identifiers were allowed as function parameters, a feature that I'd love to see implemented!).

To this, I much prefer a solution based on the window functions RANK() OVER() and ROW_NUMBER() OVER() which luckily are implemented in MonetDB, and the function rand().

Your example above becomes a regular query that has no nesting-issue:

SELECT a,b,c
FROM (
SELECT a,b,c, ROW_NUMBER() OVER(ORDER BY rand()) as my_rank
FROM name_table
) as x
where x.my_rank <= 100;

Replace ROW_NUMBER() with RANK() to allow duplicates in the random sample.

I'm not comparing the efficiency of the two solutions here, your solution could well be more efficient.
I'm only comparing them from a usability point of view.

I noticed that my own usage of the LIMIT clause is becoming less and less frequent (in favour of the syntax I report here) because of the nesting issue.

What I am wondering is, how much useful is to introduce a new syntax for the sampling that has this (important, in my view) limitation? How often would you need a sample in the outermost select clause only?

Cheers,
Roberto

_______________________________________________
Checkin-list mailing list
Checkin-list@monetdb.org
http://mail.monetdb.org/mailman/listinfo/checkin-list