Re: Diacritic removal

9 Feb 2018


      On 09/02/18 17:28, Roberto Cornacchia wrote:
...
Just wondering, before I do it myself, whether "normalizing" strings by
removing diacritics is a feature that you guys would consider
interesting to implement.
(I am aware that results of this "normalization" can be unsatisfactory
in some cases - I'm ok with most obvious cases covered)
Basically, something like what iconv does here:
$ echo "Ã é ç" | iconv -f UTF-8 -t ASCII//TRANSLIT
A e c
I can imagine that these above would be pretty easy to solve using the
same hash lookup approach used for case conversion. 
These, however, are example that produce more than one symbol in output.
So that would be a little more complicated.
$ echo "æ ß" | iconv -f UTF-8 -t ASCII//TRANSLIT
ae ss
So I have two questions: 
1) Is this something you guys (MonetDB developers) would be interested
in implementing?
2) If not - in that case I'll give it a go myself, would you advise an
approach similar to the case conversion or to use the iconv lib?
Thanks for your input.
Rboerto
It's not something I've thought much about, but I think if you (or
anybody) are going to implement this, use iconv.  Unicode is a bit more
complicated than you may realize.  For characters with diacritics, many
can occur in at least two different forms, as single Unicode code points
(like the ones you used), or as two (or more) code points of which the
second (and subsequent) are non-spacing (aka combining).

-- 
Sjoerd Mullender