[MonetDB-users] SQL text normalisation principles in relation to MonetDB
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Hi, I rather have a general question and I would like to know something about normalisation in MonetDB in relation to the underlying storage model. Since my first courses in SQL, that was PostgreSQL at the time. It was a strong advise always try to normalise duplications of data, because it would always increase your search preformance to look up numerical keys over string comparisons. Given that in MonetDB, text-strings are deduplicated in a best effort way, factually become numbers, how does this compare to the additional costs of: - enums - foreign key relations For example, some database schema's use enums (or worse: varchars) for two values. For example 'accessible', 'inaccessible'. It is clear that the storage size for this field BOOLEAN NOT NULL, would be sufficient and of the length of one bit (best case). Now within MonetDB the string data is currently optimised, would it therefore cost additional time to create a secondary table to join to vs using a string field. Thus, does normalisation for 'manual' deduplication, hurt or not? Stefan -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.15 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEAREKAAYFAkw9uQAACgkQYH1+F2Rqwn2sjACeNnvDQS+cXjzs1USpomkL6rz8 7wkAn2rF2ZRvEJbQ1cX+oxJsWPcMoTk6 =RzDh -----END PGP SIGNATURE-----
On Wed, Jul 14, 2010 at 03:17:52PM +0200, Stefan de Konink wrote:
Since my first courses in SQL, that was PostgreSQL at the time. It was a strong advise always try to normalise duplications of data, because it would always increase your search preformance to look up numerical keys over string comparisons.
I can't figure out if you're actually asking about normalisation or, what sounds more likely to me, natural vs. surrogate keys? You seem to mention about aspects of both and that's confusing things for me. What to use for keys depends very much on the use case and there aren't any hard and fast rules here. I *personally* prefer to use natural keys where ever possible, but it depends a lot on the use case and sometimes surrogate keys are a better fit for the problem domain. I certainly used to use surrogate keys more often than I do now. -- Sam http://samason.me.uk/
On Wed, Jul 14, 2010 at 04:13:54PM +0100, Sam Mason wrote:
On Wed, Jul 14, 2010 at 03:17:52PM +0200, Stefan de Konink wrote:
Since my first courses in SQL, that was PostgreSQL at the time. It was a strong advise always try to normalise duplications of data, because it would always increase your search preformance to look up numerical keys over string comparisons.
I can't figure out if you're actually asking about normalisation or, what sounds more likely to me, natural vs. surrogate keys? You seem to mention about aspects of both and that's confusing things for me. What to use for keys depends very much on the use case and there aren't any hard and fast rules here. I *personally* prefer to use natural keys where ever possible, but it depends a lot on the use case and sometimes surrogate keys are a better fit for the problem domain. I certainly used to use surrogate keys more often than I do now.
As far as I understand Stefan d.K.'s description and question, the point is neither normalization (as least not as it is defined in database theory --- with only on (string) column, there are no functional dependencies and hence no need / chance for normalization) nor keys, but rather (something like) dictionary encoding, and the questions whether to rely on what the underlying DBMS (MOnetDB) does automatically or rather do it "by hand" in the schema design (and query formulation). To be honest, I cannot answer the "ultimate" question "Thus, does normalisation for 'manual' deduplication, hurt or not?" for Stefan d.K.'s particular case --- if "hurt" refers to performance, there are too many unknown parameters that play a role; if "hurt" refers to the added complexity of the schema and hence the queries, it depends on the users SQL skills and "pain-threshold". However, in general --- auming "normalisation" indeed refers to "dictionary encoding" --- , if MonetDB "best-effort" duplicate elimination (and hence effectively dictionary encoding) works fine (highly depends on your actual data, i.e., actual values and value distribution of your string column), then MonetDB will store each string value exactly once, and only the respective index (can be as small as 1 byte if there a only very few different string values) per tuple. In that case, I would not be surprised if the automatic transparent internal dictionary encoding work more efficently (in terms of both query performance and schema/query complexity) than doing it in SQL "by hand". But to be completely sure for that very case at hand, only a proper realistic experiment will give the real answer. Stefan M. -- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |
participants (3)
-
Sam Mason
-
Stefan de Konink
-
Stefan Manegold