
On 09/02/18 17:28, Roberto Cornacchia wrote:
Just wondering, before I do it myself, whether "normalizing" strings by removing diacritics is a feature that you guys would consider interesting to implement. (I am aware that results of this "normalization" can be unsatisfactory in some cases - I'm ok with most obvious cases covered)
Basically, something like what iconv does here:
$ echo "Ã é ç" | iconv -f UTF-8 -t ASCII//TRANSLIT A e c
I can imagine that these above would be pretty easy to solve using the same hash lookup approach used for case conversion.
These, however, are example that produce more than one symbol in output. So that would be a little more complicated.
$ echo "æ ß" | iconv -f UTF-8 -t ASCII//TRANSLIT ae ss
So I have two questions: 1) Is this something you guys (MonetDB developers) would be interested in implementing? 2) If not - in that case I'll give it a go myself, would you advise an approach similar to the case conversion or to use the iconv lib?
Thanks for your input. Rboerto
It's not something I've thought much about, but I think if you (or anybody) are going to implement this, use iconv. Unicode is a bit more complicated than you may realize. For characters with diacritics, many can occur in at least two different forms, as single Unicode code points (like the ones you used), or as two (or more) code points of which the second (and subsequent) are non-spacing (aka combining). -- Sjoerd Mullender