Re: [Monetdb-developers] MonetDB: default - The sample module for the monetdb5 server.
Roberto, we were also puzzled with which syntax to adopt. The first issue is that the syntax is also dictating in a way implementation (although that sounds not correct:)). A syntax that has a order by rand() implies that you are attaching random numbers to the rows and then sort on that. This sorting is very coslty and it is the worst case since you have random order to begin with. Something like that would cost NlogN where N is the size of rows. Moreover, that syntax implies that you are sampling the base table (or some intermediate result anyway) and then "continue" the query evaluation. however sampling is not a very kind operator and cannot by easily pushed down: a sample of a join is not equal with the join of 2 samples. These are in my view the implications (or benefits!) of such a syntax. So MonetDB could as well support the syntax you propose and should! On the other hand, the syntax SAMPLE we just introduce, is applied solely on the final result set of the query, after or query evaluation is done. The algorithm has a complexity of S, where S is the size of the sample (as opposed to NlogN of the previous). Also , the sample operator has the same semantics of limits, it is an operator that shows a limited (random) portion of the entire (usually computed) result. that means that all joins, predicated etc. have been evaluated before a sample is taken. Of course we would love to have that functionality in sub-queries too, hence the workaround. What is really missing now is a function where the table definition is a parameter, indeed! With this 2 views in mind, I would suggest that both syntax are necessary to exist! Each of them serving a different purpose (and different underline implementation eventually). the other solution would be to allow SAMPLE is subqueries and "depart" from the SQL spirit. What do you think? lefteris On Mon, Sep 26, 2011 at 10:41 AM, Roberto Cornacchia < roberto.cornacchia@gmail.com> wrote:
Lefteris, I would like to add something about the syntax (and semantics) of this feature. See below
+/* + * @- Uniform Sampling. + * + * A new SQL operator has been added to support sampling the result of a query. + * The syntax for sampling is: + * SELECT ... FROM ... WHERE ... SAMPLE s + * + * where s if is an integer greater than 1, it defines the number of rows to be + * in the sample. If s is a double between [0.0,1.0] the it refers to the + * percentage of the result to be sampled. That is if s=0.3 then the sample + * will be 30% the size of the query result. + *
It has been discussed already whether this choice or an explicit PERCENT is more intuitive. Personally, I don't care much about that.
+ * SAMPLE is been treated as LIMIT, ORDER BY, etc., that means that it can only + * be in the outer most SELECT clause, i.e., SAMPLE cannot appear in a + * subquery.
This is much more interesting to me. I see that SAMPLE fits in the same semantics as LIMIT. In fact, you can think of it as a special LIMIT. What I don't like about it, is that it then shares the same limitations (which are absurd, as the whole SQL often is, from my point of view), which is what you mention above about the subquery. You suggest a workaround here:
However, if this is needed, then one may define a function, for
+ * example + * + * CREATE FUNCTION mysample () + * RETURNS TABLE(col a,...) + * BEGIN + * RETURN + * SELECT a,... + * FROM name_table + * SAMPLE 100; + * end; + * + * and then use function mysample() for example to populate a new table with + * the sample. E.g., + * + * INSERT INTO sample_table (SELECT * FROM mysample()); + * + *
To me, writing a function with an hard-coded table name in it is... weird! Like writing a function that can only do 2+2. (This could be much better if table identifiers were allowed as function parameters, a feature that I'd love to see implemented!).
To this, I much prefer a solution based on the window functions RANK() OVER() and ROW_NUMBER() OVER() which luckily are implemented in MonetDB, and the function rand().
Your example above becomes a regular query that has no nesting-issue:
SELECT a,b,c FROM ( SELECT a,b,c, ROW_NUMBER() OVER(ORDER BY rand()) as my_rank FROM name_table ) as x where x.my_rank <= 100;
Replace ROW_NUMBER() with RANK() to allow duplicates in the random sample.
I'm not comparing the efficiency of the two solutions here, your solution could well be more efficient. I'm only comparing them from a usability point of view.
I noticed that my own usage of the LIMIT clause is becoming less and less frequent (in favour of the syntax I report here) because of the nesting issue. What I am wondering is, how much useful is to introduce a new syntax for the sampling that has this (important, in my view) limitation? How often would you need a sample in the outermost select clause only?
Cheers, Roberto
_______________________________________________ Checkin-list mailing list Checkin-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/checkin-list
Yes, I agree about the difference in efficiency.
My considerations were only about usability and in particular from the point
of view of *generated* SQL.
The syntax I suggested works well in MonetDB, so yes, both can be useful,
with different focus.
They actually achieve the same result, in two different ways.
The SAMPLE clause is a special top-k, that draws k tuples in random order
The syntax I'm using is a special ranking. It ranks all tuples randomly and
then takes the first k.
To allow the SAMPLE syntax in subqueries would also imply to allow LIMIT in
subqueries, because they really boil down to the same semantics. Although I
do think SQL should allow this, I do not think that MonetDB/SQL should
deviate from the standard.
Roberto
On Mon, Sep 26, 2011 at 15:30, Lefteris
Roberto,
we were also puzzled with which syntax to adopt. The first issue is that the syntax is also dictating in a way implementation (although that sounds not correct:)). A syntax that has a order by rand() implies that you are attaching random numbers to the rows and then sort on that. This sorting is very coslty and it is the worst case since you have random order to begin with. Something like that would cost NlogN where N is the size of rows. Moreover, that syntax implies that you are sampling the base table (or some intermediate result anyway) and then "continue" the query evaluation. however sampling is not a very kind operator and cannot by easily pushed down: a sample of a join is not equal with the join of 2 samples. These are in my view the implications (or benefits!) of such a syntax. So MonetDB could as well support the syntax you propose and should!
On the other hand, the syntax SAMPLE we just introduce, is applied solely on the final result set of the query, after or query evaluation is done. The algorithm has a complexity of S, where S is the size of the sample (as opposed to NlogN of the previous). Also , the sample operator has the same semantics of limits, it is an operator that shows a limited (random) portion of the entire (usually computed) result. that means that all joins, predicated etc. have been evaluated before a sample is taken. Of course we would love to have that functionality in sub-queries too, hence the workaround. What is really missing now is a function where the table definition is a parameter, indeed!
With this 2 views in mind, I would suggest that both syntax are necessary to exist! Each of them serving a different purpose (and different underline implementation eventually).
the other solution would be to allow SAMPLE is subqueries and "depart" from the SQL spirit.
What do you think?
lefteris
On Mon, Sep 26, 2011 at 10:41 AM, Roberto Cornacchia < roberto.cornacchia@gmail.com> wrote:
Lefteris, I would like to add something about the syntax (and semantics) of this feature. See below
+/* + * @- Uniform Sampling. + * + * A new SQL operator has been added to support sampling the result of a query. + * The syntax for sampling is: + * SELECT ... FROM ... WHERE ... SAMPLE s + * + * where s if is an integer greater than 1, it defines the number of rows to be + * in the sample. If s is a double between [0.0,1.0] the it refers to the + * percentage of the result to be sampled. That is if s=0.3 then the sample + * will be 30% the size of the query result. + *
It has been discussed already whether this choice or an explicit PERCENT is more intuitive. Personally, I don't care much about that.
+ * SAMPLE is been treated as LIMIT, ORDER BY, etc., that means that it can only + * be in the outer most SELECT clause, i.e., SAMPLE cannot appear in a + * subquery.
This is much more interesting to me. I see that SAMPLE fits in the same semantics as LIMIT. In fact, you can think of it as a special LIMIT. What I don't like about it, is that it then shares the same limitations (which are absurd, as the whole SQL often is, from my point of view), which is what you mention above about the subquery. You suggest a workaround here:
However, if this is needed, then one may define a function, for
+ * example + * + * CREATE FUNCTION mysample () + * RETURNS TABLE(col a,...) + * BEGIN + * RETURN + * SELECT a,... + * FROM name_table + * SAMPLE 100; + * end; + * + * and then use function mysample() for example to populate a new table with + * the sample. E.g., + * + * INSERT INTO sample_table (SELECT * FROM mysample()); + * + *
To me, writing a function with an hard-coded table name in it is... weird! Like writing a function that can only do 2+2. (This could be much better if table identifiers were allowed as function parameters, a feature that I'd love to see implemented!).
To this, I much prefer a solution based on the window functions RANK() OVER() and ROW_NUMBER() OVER() which luckily are implemented in MonetDB, and the function rand().
Your example above becomes a regular query that has no nesting-issue:
SELECT a,b,c FROM ( SELECT a,b,c, ROW_NUMBER() OVER(ORDER BY rand()) as my_rank FROM name_table ) as x where x.my_rank <= 100;
Replace ROW_NUMBER() with RANK() to allow duplicates in the random sample.
I'm not comparing the efficiency of the two solutions here, your solution could well be more efficient. I'm only comparing them from a usability point of view.
I noticed that my own usage of the LIMIT clause is becoming less and less frequent (in favour of the syntax I report here) because of the nesting issue. What I am wondering is, how much useful is to introduce a new syntax for the sampling that has this (important, in my view) limitation? How often would you need a sample in the outermost select clause only?
Cheers, Roberto
_______________________________________________ Checkin-list mailing list Checkin-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/checkin-list
participants (2)
-
Lefteris
-
Roberto Cornacchia