Re: [MonetDB-users] SQL text normalisation principles in relation to MonetDB

14 Jul 2010

      On Wed, Jul 14, 2010 at 04:13:54PM +0100, Sam Mason wrote:
...
On Wed, Jul 14, 2010 at 03:17:52PM +0200, Stefan de Konink wrote:
...
Since my first courses in SQL, that was PostgreSQL at the time. It was a
strong advise always try to normalise duplications of data, because it
would always increase your search preformance to look up numerical keys
over string comparisons.
I can't figure out if you're actually asking about normalisation or,
what sounds more likely to me, natural vs. surrogate keys?  You seem to
mention about aspects of both and that's confusing things for me.  What
to use for keys depends very much on the use case and there aren't any
hard and fast rules here.  I *personally* prefer to use natural keys
where ever possible, but it depends a lot on the use case and sometimes
surrogate keys are a better fit for the problem domain.  I certainly
used to use surrogate keys more often than I do now.
As far as I understand Stefan d.K.'s description and question, the point is
neither normalization (as least not as it is defined in database theory ---
with only on (string) column, there are no functional dependencies and hence
no need / chance for normalization) nor keys, but rather (something like)
dictionary encoding, and the questions whether to rely on what the
underlying DBMS (MOnetDB) does automatically or rather do it "by hand" in
the schema design (and query formulation).

To be honest, I cannot answer the "ultimate" question
"Thus, does normalisation for 'manual' deduplication, hurt or not?"
for Stefan d.K.'s particular case --- if "hurt" refers to performance, there are
too many unknown parameters that play a role; if "hurt" refers to the added
complexity of the schema and hence the queries, it depends on the users SQL
skills and "pain-threshold".

However, in general --- auming "normalisation" indeed refers to "dictionary
encoding" --- , if MonetDB "best-effort" duplicate elimination (and hence
effectively dictionary encoding) works fine (highly depends on your actual
data, i.e., actual values and value distribution of your string column),
then MonetDB will store each string value exactly once, and only the
respective index (can be as small as 1 byte if there a only very few
different string values) per tuple.

In that case, I would not be surprised if the automatic transparent internal
dictionary encoding work more efficently (in terms of both query performance
and schema/query complexity) than doing it in SQL "by hand".

But to be completely sure for that very case at hand, only a proper
realistic experiment will give the real answer.

Stefan M.

-- 
| Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl |
| CWI,  P.O.Box 94079 | http://www.cwi.nl/~manegold/  |
| 1090 GB Amsterdam   | Tel.: +31 (20) 592-4212       |
| The Netherlands     | Fax : +31 (20) 592-4199       |