Re: Improving performance of "select * from table" ?
Hi Gordon,
indeed, your experiments confirm that most of the time is spent in serializing, sending, and re-parsing your queries huge result.
I am wondering what kind of application needs to see the entire database table?
No user can seriously handle millions of result tuples with tens (or more) of columns?
What is the purpose of using a DBMS if all it needs to do is select * from table ?
In my/our opinion, as much as possible of the application logic should be expressed in SQL (plus domain-/application specific extension, if required) and (thus) performed in the DBMS, resulting in only small results that need to be sent back to the client/application. Then the efficiency and performance of the DBMS (server) itself is dominating and more important than the server-client communication.
Best,
Stefan
Assaf Gordon
Stefan de Konink wrote, On 10/16/2012 06:49 PM:
[...] What about: the time taken to serialise the results? For MonetDB this is done in a very 'plain text' wire-format. [...] Are you aware of the PostgreSQL \timing command? In that way you can see outputs similar to that of MonetDB with respect of time taken for a query.
I re-run one test using psql's and mclient's internal timing. === ## Postgres: Select all columns from table $ psql test test=# \o /dev/null test=# \timing Timing is on. test=# select * from bc251; Time: 5506.768 ms test=# select * from bc251; Time: 5446.689 ms test=# ## MonetDB: Select all columns from table $ mclient -d voc sql>\>/dev/null sql>\w-1 sql>select * from bc251; 1657032 tuples (3.8s) sql>select * from bc251; 1657032 tuples (3.8s) sql> === As you said, the reported durations are much shorter than in my previous timings (monetDb: 3.8s vs 2m16s, postgres 5.5s vs 21s). However, those numbers don't reveal the entire story: mclient reports 3.8 seconds, but it took more than 2 minutes (of wall time) for the result message to be printed. I understand that this quite possibly indicates a serialization/transfer bottle-neck, and it could be of lesser interest to the developers, but as an application developer I worry about the end-to-end timing of the query - the time it takes for my application to get the data back. If there are any recommendation on improving those queries, it would be much appreciated. Thanks, -gordon _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
Hello Stefan, Stefan Manegold wrote, On 10/17/2012 04:05 AM:
indeed, your experiments confirm that most of the time is spent in serializing, sending, and re-parsing your queries huge result.
I am wondering what kind of application needs to see the entire database table?
I'm working on a website that will provide some analysis on genomic data. Most of the time, users are indeed interested in grouping/counting/summing and similar aggregation of the data - which MonetDB does exceptionally well. However, every now and then, users will want to export the data (either the entire table, or a large chunk of the raw data) and carry on the analysis on a different platform.
No user can seriously handle millions of result tuples with tens (or more) of columns? What is the purpose of using a DBMS if all it needs to do is select * from table ?
I don't expect them to open the data in Excel, of course. Users can export it to another web-based platforms ( e.g. Galaxy[1], GenePatten[2], GenomeSpace[3]), or save it to a file and run R scripts on the data - those are designed to handle huge data files without a problem. [1] Galaxy - http://galaxy.psu.edu/ [2] GenePattern - http://www.broadinstitute.org/cancer/software/genepattern/ [3] GenomeSpace - http://www.genomespace.org/
In my/our opinion, as much as possible of the application logic should be expressed in SQL (plus domain-/application specific extension, if required) and (thus) performed in the DBMS, resulting in only small results that need to be sent back to the client/application. Then the efficiency and performance of the DBMS (server) itself is dominating and more important than the server-client communication.
Generally I agree. But the reality is that my web site will provide a specific (limited) set of functions over the data, and to continue with down-stream analysis the users will need to be able to take the data elsewhere - I want to make it as easy as possible for them. I'll investigate other ways to extract the data. Thanks, -gordon _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
Hi Gordon, I see. Thanks to the explanations! Well, I assume that for the export to some other system you still need a serialized textual form of the data, right? Hence, the serialization cost themselves cannot be avoided. In case the MAPI protocol and client-side data parsing and -rendering are indeed too much of an overhead even for "occasional" exports, you might want to consider the bulk export functionality as described under http://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto i.e., COPY subquery INTO file_name [ [USING] DELIMITERS field_separator [',' record_separator [ ',' string_quote ]]] [ NULL AS null_string ] This will generate a (possibly even compressed) "DSV" (delimiter separated values) file --- obviously in the file system of your server. But if you wrap this functionality in a "export" button of your website (rather than let users type SQL literally), you can easily place the file somewhere where the user can download it via http, ftp, or alike ... Hope this helps ... Stefan ----- Original Message -----
Hello Stefan,
Stefan Manegold wrote, On 10/17/2012 04:05 AM:
indeed, your experiments confirm that most of the time is spent in serializing, sending, and re-parsing your queries huge result.
I am wondering what kind of application needs to see the entire database table?
I'm working on a website that will provide some analysis on genomic data. Most of the time, users are indeed interested in grouping/counting/summing and similar aggregation of the data - which MonetDB does exceptionally well.
However, every now and then, users will want to export the data (either the entire table, or a large chunk of the raw data) and carry on the analysis on a different platform.
No user can seriously handle millions of result tuples with tens (or more) of columns? What is the purpose of using a DBMS if all it needs to do is select * from table ?
I don't expect them to open the data in Excel, of course. Users can export it to another web-based platforms ( e.g. Galaxy[1], GenePatten[2], GenomeSpace[3]), or save it to a file and run R scripts on the data - those are designed to handle huge data files without a problem.
[1] Galaxy - http://galaxy.psu.edu/ [2] GenePattern - http://www.broadinstitute.org/cancer/software/genepattern/ [3] GenomeSpace - http://www.genomespace.org/
In my/our opinion, as much as possible of the application logic should be expressed in SQL (plus domain-/application specific extension, if required) and (thus) performed in the DBMS, resulting in only small results that need to be sent back to the client/application. Then the efficiency and performance of the DBMS (server) itself is dominating and more important than the server-client communication.
Generally I agree. But the reality is that my web site will provide a specific (limited) set of functions over the data, and to continue with down-stream analysis the users will need to be able to take the data elsewhere - I want to make it as easy as possible for them.
I'll investigate other ways to extract the data.
Thanks, -gordon _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
Hello Stefan, Stefan Manegold wrote, On 10/17/2012 12:07 PM:
In case the MAPI protocol and client-side data parsing and -rendering are indeed too much of an overhead even for "occasional" exports, you might want to consider the bulk export functionality as described under http://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto i.e., COPY subquery INTO file_name [ [USING] DELIMITERS field_separator [',' record_separator [ ',' string_quote ]]] [ NULL AS null_string ]
This will generate a (possibly even compressed) "DSV" (delimiter separated values) file --- obviously in the file system of your server. But if you wrap this functionality in a "export" button of your website (rather than let users type SQL literally), you can easily place the file somewhere where the user can download it via http, ftp, or alike ...
Thanks for the tip. I tried the "copy" command, and unfortunately it's not a lot faster: === sql>copy select * from bc251 into '/tmp/foo2' ; 1657032 affected rows (2m 14s) === and sadly I also found a bug related to "COPY INTO": http://bugs.monetdb.org/show_bug.cgi?id=3165 So I'm still exploring other possibilities. Thanks, -gordon _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
On Wed, Oct 17, 2012 at 11:34:38AM -0400, Assaf Gordon wrote:
Hello Stefan,
Stefan Manegold wrote, On 10/17/2012 04:05 AM:
indeed, your experiments confirm that most of the time is spent in serializing, sending, and re-parsing your queries huge result.
I am wondering what kind of application needs to see the entire database table?
I'm working on a website that will provide some analysis on genomic data. Most of the time, users are indeed interested in grouping/counting/summing and similar aggregation of the data - which MonetDB does exceptionally well.
However, every now and then, users will want to export the data (either the entire table, or a large chunk of the raw data) and carry on the analysis on a different platform.
No user can seriously handle millions of result tuples with tens (or more) of columns? What is the purpose of using a DBMS if all it needs to do is select * from table ?
I don't expect them to open the data in Excel, of course. Users can export it to another web-based platforms ( e.g. Galaxy[1], GenePatten[2], GenomeSpace[3]), or save it to a file and run R scripts on the data - those are designed to handle huge data files without a problem.
[1] Galaxy - http://galaxy.psu.edu/ [2] GenePattern - http://www.broadinstitute.org/cancer/software/genepattern/ [3] GenomeSpace - http://www.genomespace.org/
In my/our opinion, as much as possible of the application logic should be expressed in SQL (plus domain-/application specific extension, if required) and (thus) performed in the DBMS, resulting in only small results that need to be sent back to the client/application. Then the efficiency and performance of the DBMS (server) itself is dominating and more important than the server-client communication.
Generally I agree. But the reality is that my web site will provide a specific (limited) set of functions over the data, and to continue with down-stream analysis the users will need to be able to take the data elsewhere - I want to make it as easy as possible for them.
I'll investigate other ways to extract the data. Did you look at the copy into file options?
Niels
Thanks, -gordon _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: http://www.cwi.nl/~niels e-mail: Niels.Nes@cwi.nl _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
participants (3)
-
Assaf Gordon
-
Niels Nes
-
Stefan Manegold