[Monetdb-developers] Question about subquery to calculate median
Hello, I am trying to write a subquery that can serve as a poor man's median function. However I am running into issues every angle I try and cannot find any details about this in the docs or list archives. Question 1: Currently I can calculate the median manually as two steps: sql> select count(*)/2 from mytable where myIntVal is not null; +----------------+ | sql_div_count_ | +================+ | 10 | +----------------+ sql> select myIntVal as median from mytable order by myIntVal limit 1 offset 10; +--------+ | median | +========+ | 1172 | +--------+ However I cannot figure out how to combine these two queries into one query so that the count/2 can be done together with the query, so that I don't have to manually first get the count and then hard code it into the second query that does the median. How can these be combined? For instance I tried this which does not work: sql> select myIntVal as median from mytable order by myIntVal limit 1 offset (select count(*)/2 from mytable where myIntVal is not null); but this gives me: !syntax error, unexpected '(', expecting IDENT or sqlINT So I assume you can not dynamically substitute a subquery for an OFFSET value. But is there then another way to combine these statements into one statement? Question 2: Once I get the above resolved, I need to be able to get the median as an aggregate function, so that I can get medians in a group by query for each element in the group. For instance, if there really was a median() aggregate function I would do this: select median(myIntval), customerid from mytable group by customerid; However of course there is no median function. So I understand the work around for not having a median function is to get the total rows, sort the rows and then take the row in the middle. But how can I do that if it is in a group by statement like the above? What I am trying desperately to avoid is having to make a query to get all the group by results, and then for each result having to do a separate individual query to get its median separately. Please tell me this is possible! :) And if so, how? Thank you!!
On Thu, Jun 19, 2008 at 09:54:12AM -0700, Rt Ibmer wrote:
Hello,
I am trying to write a subquery that can serve as a poor man's median function. However I am running into issues every angle I try and cannot find any details about this in the docs or list archives.
Question 1:
Currently I can calculate the median manually as two steps:
sql> select count(*)/2 from mytable where myIntVal is not null; +----------------+ | sql_div_count_ | +================+ | 10 | +----------------+ sql> select myIntVal as median from mytable order by myIntVal limit 1 offset 10; +--------+ | median | +========+ | 1172 | +--------+
However I cannot figure out how to combine these two queries into one query so that the count/2 can be done together with the query, so that I don't have to manually first get the count and then hard code it into the second query that does the median. How can these be combined?
For instance I tried this which does not work:
sql> select myIntVal as median from mytable order by myIntVal limit 1 offset (select count(*)/2 from mytable where myIntVal is not null);
but this gives me: !syntax error, unexpected '(', expecting IDENT or sqlINT
So I assume you can not dynamically substitute a subquery for an OFFSET value. But is there then another way to combine these statements into one statement?
Question 2:
Once I get the above resolved, I need to be able to get the median as an aggregate function, so that I can get medians in a group by query for each element in the group. For instance, if there really was a median() aggregate function I would do this:
select median(myIntval), customerid from mytable group by customerid;
However of course there is no median function. So I understand the work around for not having a median function is to get the total rows, sort the rows and then take the row in the middle.
But how can I do that if it is in a group by statement like the above?
What I am trying desperately to avoid is having to make a query to get all the group by results, and then for each result having to do a separate individual query to get its median separately. Please tell me this is possible! :) And if so, how?
Thank you!!
I have a different solution (some googling), try the following CREATE TABLE sampleData ( groupID int, numValue int ); INSERT INTO sampleData VALUES ( 1, 1 ); INSERT INTO sampleData VALUES ( 1, 2 ); INSERT INTO sampleData VALUES ( 1, 6 ); INSERT INTO sampleData VALUES ( 1, 16 ); INSERT INTO sampleData VALUES ( 1, 7 ); INSERT INTO sampleData VALUES ( 2, 5 ); INSERT INTO sampleData VALUES ( 2, 5 ); INSERT INTO sampleData VALUES ( 2, 5 ); INSERT INTO sampleData VALUES ( 2, 11 ); INSERT INTO sampleData VALUES ( 3, 10 ); INSERT INTO sampleData VALUES ( 3, 17 ); INSERT INTO sampleData VALUES ( 3, 52 ); INSERT INTO sampleData VALUES ( 3, 66 ); INSERT INTO sampleData VALUES ( 4, 18 ); INSERT INTO sampleData VALUES ( 5, 0 ); INSERT INTO sampleData VALUES ( 5, 0 ); SELECT R.groupID, AVG(1.0*R.numValue) AS medianValue FROM ( SELECT GroupID, numValue, ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY NumValue) AS rowno FROM sampleData ) R INNER JOIN ( SELECT GroupID, 1+count(*) as N FROM sampleData GROUP BY GroupID ) G ON R.GroupID = G.GroupID AND R.rowNo BETWEEN N/2 AND N/2+N%2 GROUP BY R.groupID; Niels
------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Monetdb-developers mailing list Monetdb-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-developers
-- Niels Nes, Centre for Mathematics and Computer Science (CWI) Kruislaan 413, 1098 SJ Amsterdam, The Netherlands room C0.02, phone ++31 20 592-4098, fax ++31 20 592-4312 url: http://www.cwi.nl/~niels e-mail: Niels.Nes@cwi.nl
participants (2)
-
Niels Nes
-
Rt Ibmer