[MonetDB-users] Apples and oranges

Hi! I have some questions as to how MonetDB/XQuery should be compared fairly to other systems. If I re-run a query multiple times in a single call to `mclient`, is any calculation re-used? How about if I run multiple similar queries in a single call? Example: $ cat www.xq count(doc("dblp")//www) $ cat www_s10.xq (count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www)) $ cat www_s100.xq (count(doc("dblp")//www), ... count(doc("dblp")//www)) $ mclient --language=xquery --time < www.xq 11760 Timer 22.552 msec (Assert we are running hot.) $ mclient --language=xquery --time < www.xq 11760 Timer 21.661 msec $ mclient --language=xquery --time < www_s10.xq 11760, [snip] 11760 Timer 33.063 msec $ mclient --language=xquery --time < www_s100.xq 11760, [snip] 11760 Timer 252.414 msec So the average execution times are 22, 3.3 and 2.5 milliseconds. Is the extra cost for the first query just starting up the client program, or is some calculation re-used? If we now look at more expensive queries: $ cat dblp_authors.xq count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) Just repeating the same: $ cat dblp_authors_s10.xq (count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"])) Different but related queries: $ cat dblp_authors_x10.xq (count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Wen Gao"]), count(doc("dblp")/dblp//author[text()="Irith Pomeranz"]), count(doc("dblp")/dblp//author[text()="Hector Garcia-Molina"]), count(doc("dblp")/dblp//author[text()="Moshe Y. Vardi"]), count(doc("dblp")/dblp//author[text()="Joseph Y. Halpern"]), count(doc("dblp")/dblp//author[text()="Noga Alon"]), count(doc("dblp")/dblp//author[text()="Wei Li"]), count(doc("dblp")/dblp//author[text()="Ming Li"]), count(doc("dblp")/dblp//author[text()="Donald F. Towsley"]) ) $ mclient --language=xquery --time < dblp_authors.xq 351 Timer 1238.436 msec $ mclient --language=xquery --time < dblp_authors.xq 351 Timer 1253.927 msec $ mclient --language=xquery --time < dblp_authors_s10.xq 351, ... 351 Timer 1284.191 msec $ mclient --language=xquery --time < dblp_authors_x10.xq 351, 347, 346, 341, 334, 334, 330, 320, 320, 317 Timer 2610.589 msec Here the average times are 1238, 128 and 261 milliseconds. Here the difference is clearly not just startup of the client. If this was not a client-server architecture, I would guess the difference came from opening files, getting stuff into cache, etc.. Are there similar reasons here? What parts of the calculations are actually done inside the client, if any? If the answer is none, why is this behavior seen? In conclusion: When running multiple queries, what would be the most fair way to compare MonetDB/XQuery to other client/server architectures in your view? Concatenating the queries in a single call to `mclient`, or multiple calls? When timing a single query, can it be repeated multiple times in a single call, and the average taken, without being unfair? If I use for example MS SQL Server 2008, there is no gain from a single invocation of the client, whether I have multiple SQL statements SELECT x.query('$q) FROM t; ..., SELECT x.query('$q) FROM t; Or a single SQL statement with a list of XPath queries SELECT x.query('($q, ..., $q)') FROM t; Klem fra Nils -- http://www.idi.ntnu.no/~nilsgri/ Why is this thus? What is the reason of this thusness? - Artemus Ward

Hi,
I will try to answer some of your questions.
On Mon, Mar 16, 2009 at 2:56 PM, Nils Grimsmo
Hi!
I have some questions as to how MonetDB/XQuery should be compared fairly to other systems.
If I re-run a query multiple times in a single call to `mclient`, is any calculation re-used? How about if I run multiple similar queries in a single call?
MonetDB/Xquery uses the MonetDB 4 server which does not have any "recycling" of results. MonetDB5 has indeed these capabilities but it is used only woth the SQL front-end.
Example:
$ cat www.xq count(doc("dblp")//www)
$ cat www_s10.xq (count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www))
$ cat www_s100.xq (count(doc("dblp")//www), ... count(doc("dblp")//www))
$ mclient --language=xquery --time < www.xq 11760 Timer 22.552 msec
(Assert we are running hot.)
$ mclient --language=xquery --time < www.xq 11760 Timer 21.661 msec
$ mclient --language=xquery --time < www_s10.xq 11760, [snip] 11760 Timer 33.063 msec
$ mclient --language=xquery --time < www_s100.xq 11760, [snip] 11760 Timer 252.414 msec
So the average execution times are 22, 3.3 and 2.5 milliseconds. Is the extra cost for the first query just starting up the client program, or is some calculation re-used?
The extra cost is just the start-up, since you are already ensure warm runs and your data is small enough to fit in memory. We had some tests done in the past of our own and verified that it is the start-up cost.
If we now look at more expensive queries:
$ cat dblp_authors.xq count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"])
Just repeating the same:
$ cat dblp_authors_s10.xq (count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]))
Different but related queries:
$ cat dblp_authors_x10.xq (count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Wen Gao"]), count(doc("dblp")/dblp//author[text()="Irith Pomeranz"]), count(doc("dblp")/dblp//author[text()="Hector Garcia-Molina"]), count(doc("dblp")/dblp//author[text()="Moshe Y. Vardi"]), count(doc("dblp")/dblp//author[text()="Joseph Y. Halpern"]), count(doc("dblp")/dblp//author[text()="Noga Alon"]), count(doc("dblp")/dblp//author[text()="Wei Li"]), count(doc("dblp")/dblp//author[text()="Ming Li"]), count(doc("dblp")/dblp//author[text()="Donald F. Towsley"]) )
$ mclient --language=xquery --time < dblp_authors.xq 351 Timer 1238.436 msec
$ mclient --language=xquery --time < dblp_authors.xq 351 Timer 1253.927 msec
$ mclient --language=xquery --time < dblp_authors_s10.xq 351, ... 351 Timer 1284.191 msec
$ mclient --language=xquery --time < dblp_authors_x10.xq 351, 347, 346, 341, 334, 334, 330, 320, 320, 317 Timer 2610.589 msec
Here the average times are 1238, 128 and 261 milliseconds. Here the difference is clearly not just startup of the client.
It is the startup cost still. In both cases you are sending 1 query, which has to be compiled, optimized and run, just because you are asking for 10 different things it does not mean that it runs 10 different separated plans (10 different xpath-axes steps for example.) So if you divide the time by 10 in the second case, you are just dividing by 10 the same amount of work in principle as your first query. Try to run this queries: $ cat dblp_authors_q10.xq count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) <> count(doc("dblp")/dblp//author[text()="Wen Gao"]) <> count(doc("dblp")/dblp//author[text()="Irith Pomeranz"]) <> count(doc("dblp")/dblp//author[text()="Hector Garcia-Molina"]) <> count(doc("dblp")/dblp//author[text()="Moshe Y. Vardi"]) <> count(doc("dblp")/dblp//author[text()="Joseph Y. Halpern"]) <> count(doc("dblp")/dblp//author[text()="Noga Alon"]) <> count(doc("dblp")/dblp//author[text()="Wei Li"]) <> count(doc("dblp")/dblp//author[text()="Ming Li"]) <> count(doc("dblp")/dblp//author[text()="Donald F. Towsley"]) These are 10 different queries.
If this was not a client-server architecture, I would guess the difference came from opening files, getting stuff into cache, etc.. Are there similar reasons here?
What parts of the calculations are actually done inside the client, if any? If the answer is none, why is this behavior seen?
The client does not do any calculations.
In conclusion: When running multiple queries, what would be the most fair way to compare MonetDB/XQuery to other client/server architectures in your view? Concatenating the queries in a single call to `mclient`, or multiple calls?
If you are after multiple queries, then I would suggest to write all your queries in one file, seperating each query with '<>' and the feed that file to a single mclient.
When timing a single query, can it be repeated multiple times in a single call, and the average taken, without being unfair?
As long as you consider hot runs, I would say yes. Hope i could help, lefteris
If I use for example MS SQL Server 2008, there is no gain from a single invocation of the client, whether I have multiple SQL statements
SELECT x.query('$q) FROM t; ..., SELECT x.query('$q) FROM t;
Or a single SQL statement with a list of XPath queries
SELECT x.query('($q, ..., $q)') FROM t;
Klem fra Nils
-- http://www.idi.ntnu.no/~nilsgri/ Why is this thus? What is the reason of this thusness? - Artemus Ward
------------------------------------------------------------------------------ Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are powering Web 2.0 with engaging, cross-platform capabilities. Quickly and easily build your RIAs with Flex Builder, the Eclipse(TM)based development software that enables intelligent coding and step-through debugging. Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

On Mon, Mar 16, 2009 at 03:45:27PM +0100, Lefteris wrote:
Hi,
I will try to answer some of your questions.
Thank you to both Lefteris and Stefan for very informative and helpful answers! Klem fra Nils -- http://www.idi.ntnu.no/~nilsgri/ Why is this thus? What is the reason of this thusness? - Artemus Ward

On Mon, Mar 16, 2009 at 02:56:25PM +0100, Nils Grimsmo wrote:
Hi!
I have some questions as to how MonetDB/XQuery should be compared fairly to other systems.
If I re-run a query multiple times in a single call to `mclient`, is any calculation re-used? How about if I run multiple similar queries in a single call?
Example:
$ cat www.xq count(doc("dblp")//www)
$ cat www_s10.xq (count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www))
$ cat www_s100.xq (count(doc("dblp")//www), ... count(doc("dblp")//www))
$ mclient --language=xquery --time < www.xq 11760 Timer 22.552 msec
(Assert we are running hot.)
$ mclient --language=xquery --time < www.xq 11760 Timer 21.661 msec
$ mclient --language=xquery --time < www_s10.xq 11760, [snip] 11760 Timer 33.063 msec
$ mclient --language=xquery --time < www_s100.xq 11760, [snip] 11760 Timer 252.414 msec
So the average execution times are 22, 3.3 and 2.5 milliseconds. Is the extra cost for the first query just starting up the client program, or is some calculation re-used?
No execution is reused between queries, but the Pathfinder compiler does common sub expression elimination per query; i.e., the 10 identical subqueries of query www_s10.xq (note: www_s10.xq is one query with ten identical sub expressions, not 10 single queries!) are executed only once.
If we now look at more expensive queries:
$ cat dblp_authors.xq count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"])
Just repeating the same:
$ cat dblp_authors_s10.xq (count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]))
Different but related queries:
$ cat dblp_authors_x10.xq (count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Wen Gao"]), count(doc("dblp")/dblp//author[text()="Irith Pomeranz"]), count(doc("dblp")/dblp//author[text()="Hector Garcia-Molina"]), count(doc("dblp")/dblp//author[text()="Moshe Y. Vardi"]), count(doc("dblp")/dblp//author[text()="Joseph Y. Halpern"]), count(doc("dblp")/dblp//author[text()="Noga Alon"]), count(doc("dblp")/dblp//author[text()="Wei Li"]), count(doc("dblp")/dblp//author[text()="Ming Li"]), count(doc("dblp")/dblp//author[text()="Donald F. Towsley"]) )
$ mclient --language=xquery --time < dblp_authors.xq 351 Timer 1238.436 msec
$ mclient --language=xquery --time < dblp_authors.xq 351 Timer 1253.927 msec
$ mclient --language=xquery --time < dblp_authors_s10.xq 351, ... 351 Timer 1284.191 msec
$ mclient --language=xquery --time < dblp_authors_x10.xq 351, 347, 346, 341, 334, 334, 330, 320, 320, 317 Timer 2610.589 msec
Here the average times are 1238, 128 and 261 milliseconds. Here the difference is clearly not just startup of the client.
If this was not a client-server architecture, I would guess the difference came from opening files, getting stuff into cache, etc.. Are there similar reasons here?
What parts of the calculations are actually done inside the client, if any? If the answer is none, why is this behavior seen?
Indeed no calculation in the server; again, common subexpression elimnation does its job: dblp_authors_s10.xq: only one evaluation of 10 identical subexpressions is just as fast as evaluating a single idential expression (dblp_authors.xq); dblp_authors_x10.xq: the common subexpression "doc("dblp")/dblp//author" of all 10 expressions in this one query is evaluated only once, obviously taking the major part of the query execution time; only the text value selection are then different per expression.
In conclusion: When running multiple queries, what would be the most fair way to compare MonetDB/XQuery to other client/server architectures in your view? Concatenating the queries in a single call to `mclient`, or multiple calls?
Depends on whether you want to compare the cost per individual query or the costs of all queries as sub expression in a single compose query (as you did). In the first case, run queries in isolation, preferably still in the same mclient session to eliminate connection setup costs (and less these also occur with each query in the system(s) you plan to compare with), seperating the queries by "<>", e.g., instead of running one query with 10 subexpressions " (count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]), count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"])) " run 10 individual queries " count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) <> count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) <> count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) <> count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) <> count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) <> count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) <> count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) <> count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) <> count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) <> count(doc("dblp")/dblp//author[text()="Grzegorz Rozenberg"]) "
When timing a single query, can it be repeated multiple times in a single call, and the average taken, without being unfair?
Sure. In fact, (working) common subexpression elimination and query result caching are FAIR techniques to speed up processing of multiple queries. The main point of being FAIR with performance experiments and comparisons is to document in detail how your run your experiments and measure your timings, and to report in detail all the behavior you observe. (see also: http://old-www.cwi.nl/htbin/ins1/publications?request=abstract&key=Ma:IS:08 http://old-www.cwi.nl/htbin/ins1/publications?request=abstract&key=MaMa:ICDE:08 ;-)) In the above case, I'd suggest to run both versions, i.e., one query with multiple expression "(Q1, Q2, ..., Qn)" as well as individual queries " Q1 <> Q2 <> ... <> Qn " and report and compare the results, (now ;-)) knowing that patfinder performs pre-query common subexpression elimination.
If I use for example MS SQL Server 2008, there is no gain from a single invocation of the client, whether I have multiple SQL statements
As pointed out above, in XQuery "(Q1,...,Qn)" is in fact ONE single (composed) query of n expressions, that is evaluated as one query, NOT n individual queries. Hope this help --- don't hestitate to ask in case you have more questions or require more elaborate answers! Stefan
SELECT x.query('$q) FROM t; ..., SELECT x.query('$q) FROM t;
Or a single SQL statement with a list of XPath queries
SELECT x.query('($q, ..., $q)') FROM t;
Klem fra Nils
-- http://www.idi.ntnu.no/~nilsgri/ Why is this thus? What is the reason of this thusness? - Artemus Ward
------------------------------------------------------------------------------ Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are powering Web 2.0 with engaging, cross-platform capabilities. Quickly and easily build your RIAs with Flex Builder, the Eclipse(TM)based development software that enables intelligent coding and step-through debugging. Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4312 |

On Mar 16, 2009, at 14:56, Nils Grimsmo wrote:
$ cat www.xq count(doc("dblp")//www)
$ cat www_s10.xq (count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www), count(doc("dblp")//www))
$ cat www_s100.xq (count(doc("dblp")//www), ... count(doc("dblp")//www))
$ mclient --language=xquery --time < www.xq 11760 Timer 22.552 msec
(Assert we are running hot.)
$ mclient --language=xquery --time < www.xq 11760 Timer 21.661 msec
$ mclient --language=xquery --time < www_s10.xq 11760, [snip] 11760 Timer 33.063 msec
$ mclient --language=xquery --time < www_s100.xq 11760, [snip] 11760 Timer 252.414 msec
One last comment: In the above example you can observe the query compile time for detecting the common subexpressions. Jan -- Jan Rittinger Lehrstuhl Datenbanken und Informationssysteme Wilhelm-Schickard-Institut für Informatik Eberhard-Karls-Universität Tübingen http://www-db.informatik.uni-tuebingen.de/team/rittinger
participants (4)
-
Jan Rittinger
-
Lefteris
-
Nils Grimsmo
-
Stefan Manegold