On Wed, Jul 14, 2010 at 04:13:54PM +0100, Sam Mason wrote:
On Wed, Jul 14, 2010 at 03:17:52PM +0200, Stefan de Konink wrote:
Since my first courses in SQL, that was PostgreSQL at the time. It was a strong advise always try to normalise duplications of data, because it would always increase your search preformance to look up numerical keys over string comparisons.
I can't figure out if you're actually asking about normalisation or, what sounds more likely to me, natural vs. surrogate keys? You seem to mention about aspects of both and that's confusing things for me. What to use for keys depends very much on the use case and there aren't any hard and fast rules here. I *personally* prefer to use natural keys where ever possible, but it depends a lot on the use case and sometimes surrogate keys are a better fit for the problem domain. I certainly used to use surrogate keys more often than I do now.
As far as I understand Stefan d.K.'s description and question, the point is neither normalization (as least not as it is defined in database theory --- with only on (string) column, there are no functional dependencies and hence no need / chance for normalization) nor keys, but rather (something like) dictionary encoding, and the questions whether to rely on what the underlying DBMS (MOnetDB) does automatically or rather do it "by hand" in the schema design (and query formulation). To be honest, I cannot answer the "ultimate" question "Thus, does normalisation for 'manual' deduplication, hurt or not?" for Stefan d.K.'s particular case --- if "hurt" refers to performance, there are too many unknown parameters that play a role; if "hurt" refers to the added complexity of the schema and hence the queries, it depends on the users SQL skills and "pain-threshold". However, in general --- auming "normalisation" indeed refers to "dictionary encoding" --- , if MonetDB "best-effort" duplicate elimination (and hence effectively dictionary encoding) works fine (highly depends on your actual data, i.e., actual values and value distribution of your string column), then MonetDB will store each string value exactly once, and only the respective index (can be as small as 1 byte if there a only very few different string values) per tuple. In that case, I would not be surprised if the automatic transparent internal dictionary encoding work more efficently (in terms of both query performance and schema/query complexity) than doing it in SQL "by hand". But to be completely sure for that very case at hand, only a proper realistic experiment will give the real answer. Stefan M. -- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |