Re: [MonetDB-users] Performance issues with queries in SQL

2 May 2008

      Niels Nes wrote:
...
On Wed, Apr 30, 2008 at 03:58:11PM +0200, Nils Grimsmo wrote:
...
I have tables with around 12 million rows and large primary keys.  I 
find the performance for various queries to be rather strange.
* Why is Q2 (see below) so much slower than Q1?  It is actually faster 
for me to use Q1 and do the rest of the matching myself.
* Why is Q3 so immensely much faster than Q1?  I understand that the 
first part of the primary key is selective in Q3.  Do I understand right 
if Q1 scans until "type" matches, then scans until "name" matches, then 
scans until "value" matches?  No indexes are put on disk, right?  Then 
the matching for Q3 should be scan until "value" matches, which 
shouldn't be more than at most three times faster, right (if three 
columns are read instead of one for Q1)?  I do Q1 twice to look for cash 
effects, but none are seen...  Could some columns from the table "value" 
be cached and kept, without giving room for those from "type_name_value"?
* Q4 has the same performance as Q1, as expected.
* Why isn't Q5 faster than Q6? 82% of the rows have type=3.  Fewer 
should be scanned to find that "!!foo" doesn't exist than for "~~foo", 
as "!" has a low ASCII value, while "~" is high. (See COUNT queries at 
the bottom.)
The performance differences are related to the order in
which the where conditions are applied. Please send us also
the table create statement, then we can use 'explain' to 
see which order is taken. You could also run your queries
prefixed with 'trace' to see where the time goes.
Been playing around a bit, and it seems to me like SELECTs are very slow 
unless the first part of the PRIMARY KEY is selective.

ALTER TABLE base \
         ADD PRIMARY KEY(doc, nenc, pos, complete)

CREATE TABLE base2 \
         AS SELECT nenc,doc,pos,type,name,pathID,complete,value \
         FROM base \
         ORDER BY nenc,doc,pos,complete WITH DATA

ALTER TABLE base2 \
         ADD PRIMARY KEY (nenc,doc,pos,complete)

SELECT b.value \
         FROM base AS b \
         WHERE doc = 1 AND b.nenc = '1.1.1.1';
Timer   27344.739 msec

SELECT b.value \
         FROM _tmp2 AS b \
         WHERE b.nenc = '1.1.1.1' AND doc = 1;
Timer       0.773 msec

The first query takes more than 35'000 times as long.  All rows have 
doc=1.  If no indexes are kept on disk, I do not understand why the 
first query should take more than twice as long as the second.

Here is a query on base2 using LIKE instead.

SELECT b.value \
         FROM _tmp2 AS b \
         WHERE b.nenc LIKE '1.1.1.1%' AND doc = 1;
Timer    1538.009 msec

As the condition is a proper prefix, I also don't understand why this is 
2000 times slower, if no indexes are involved.  (The number of results 
is the same.)

According to Martin.Kersten@cwi.nl (in Message-id 
<4806471A.4090509@cwi.nl>), "Hash indices are used when needed and 
maintained under update. However, they are never saved on disk."  Does 
this mean a hash index is never used in a SELECT?  Or could it be just 
kept in memory if the table is not evicted?  I could not find an answer 
in the Inner Core documentation.

Klem fra Nils