New subject: [Monetdb-developers] MonetDB: default - The sample module for the monetdb5 server.

26 Sep 2011

      Roberto,

we were also puzzled with which syntax to adopt. The first issue is that the
syntax is also dictating in a way implementation (although that sounds not
correct:)). A syntax that has a order by rand() implies that you are
attaching random numbers to the rows and then sort on that. This sorting is
very coslty and it is the worst case since you have random order to begin
with. Something like that would cost NlogN where N is the size of rows.
Moreover, that syntax implies that you are sampling the base table (or some
intermediate result anyway) and then "continue" the query evaluation.
however sampling is not a very kind operator and cannot by easily pushed
down: a sample of a join is not equal with the join of 2 samples. These are
in my view the implications (or benefits!) of such a syntax. So MonetDB
could as well support the syntax you propose and should!

On the other hand, the syntax SAMPLE we just introduce, is applied solely on
the final result set of the query, after or query evaluation is done. The
algorithm has a complexity of S, where S is the size of the sample (as
opposed to NlogN of the previous). Also , the sample operator has the same
semantics of limits, it is an operator that shows a limited (random) portion
of the entire (usually computed) result. that means that all joins,
predicated etc. have been evaluated before a sample is taken. Of course we
would love to have that functionality in sub-queries too, hence the
workaround. What is really missing now is a function where the table
definition is a parameter, indeed!

With this 2 views in mind, I would suggest that both syntax are necessary to
exist! Each of them serving a different purpose (and different underline
implementation eventually).

the other solution would be to allow SAMPLE is subqueries and "depart" from
the SQL spirit.

What do you think?

lefteris

On Mon, Sep 26, 2011 at 10:41 AM, Roberto Cornacchia <
roberto.cornacchia@gmail.com> wrote:
...
Lefteris, I would like to add something about the syntax (and semantics) of
this feature.
See below
...
+/*
+ * @- Uniform Sampling.
+ *
+ * A new SQL operator has been added to support sampling the result of a
query.
+ * The syntax for sampling is:
+ * SELECT ... FROM ... WHERE ... SAMPLE s
+ *
+ * where s if is an integer greater than 1, it defines the number of rows
to be
+ * in the sample. If s is a double between [0.0,1.0] the it refers to the
+ * percentage of the result to be sampled. That is if s=0.3 then the
sample
+ * will be 30% the size of the query result.
+ *
It has been discussed already whether this choice or an explicit PERCENT is
more intuitive.
Personally, I don't care much about that.
...
+ * SAMPLE is been treated as LIMIT, ORDER BY, etc., that means that it
can only
+ * be in the outer most SELECT clause, i.e., SAMPLE cannot appear in a
+ * subquery.
This is much more interesting to me.
I see that SAMPLE fits in the same semantics as LIMIT. In fact, you can
think of it as a special LIMIT.
What I don't like about it, is that it then shares the same limitations
(which are absurd, as the whole SQL often is, from my point of view), which
is what you mention above about the subquery.
You suggest a workaround here:
However, if this is needed, then one may define a function, for
...
+ * example
+ *
+ * CREATE FUNCTION mysample ()
+ * RETURNS TABLE(col a,...)
+ * BEGIN
+ *    RETURN
+ *      SELECT a,...
+ *      FROM name_table
+ *      SAMPLE 100;
+ * end;
+ *
+ * and then use function mysample() for example to populate a new table
with
+ * the sample. E.g.,
+ *
+ * INSERT INTO sample_table (SELECT * FROM mysample());
+ *
+ *
To me, writing a function with an hard-coded table name in it is... weird!
Like writing a function that can only do 2+2.
(This could be much better if table identifiers were allowed as function
parameters, a feature that I'd love to see implemented!).
To this, I much prefer a solution based on the window functions RANK()
OVER()  and ROW_NUMBER() OVER() which luckily are implemented in MonetDB,
and the function rand().
Your example above becomes a regular query that has no nesting-issue:
SELECT a,b,c
FROM (
  SELECT a,b,c, ROW_NUMBER() OVER(ORDER BY rand()) as my_rank
  FROM name_table
) as x
where x.my_rank <= 100;
Replace ROW_NUMBER() with RANK() to allow duplicates in the random sample.
I'm not comparing the efficiency of the two solutions here, your solution
could well be more efficient.
I'm only comparing them from a usability point of view.
I noticed that my own usage of the LIMIT clause is becoming less and less
frequent (in favour of the syntax I report here) because of the nesting
issue.
What I am wondering is, how much useful is to introduce a new syntax for
the sampling that has this (important, in my view) limitation? How often
would you need a sample in the outermost select clause only?
Cheers,
Roberto
_______________________________________________
Checkin-list mailing list
Checkin-list@monetdb.org
http://mail.monetdb.org/mailman/listinfo/checkin-list

Re: [Monetdb-developers] MonetDB: default - The sample module for the monetdb5 server.

Lefteris

Roberto Cornacchia

tags

participants (2)