On Fri, 9 Feb 2018 at 17:45 Sjoerd Mullender <sjoerd@acm.org> wrote:

On 09/02/18 17:28, Roberto Cornacchia wrote:
> Just wondering, before I do it myself, whether "normalizing" strings by
> removing diacritics is a feature that you guys would consider
> interesting to implement.
> (I am aware that results of this "normalization" can be unsatisfactory
> in some cases - I'm ok with most obvious cases covered)
>
> Basically, something like what iconv does here:
>
> $ echo "Ã é ç" | iconv -f UTF-8 -t ASCII//TRANSLIT
> A e c
>
> I can imagine that these above would be pretty easy to solve using the
> same hash lookup approach used for case conversion.
>
>
> These, however, are example that produce more than one symbol in output.
> So that would be a little more complicated.
>
> $ echo "æ ß" | iconv -f UTF-8 -t ASCII//TRANSLIT
> ae ss
>
>
> So I have two questions:
> 1) Is this something you guys (MonetDB developers) would be interested
> in implementing?
> 2) If not - in that case I'll give it a go myself, would you advise an
> approach similar to the case conversion or to use the iconv lib?
>
> Thanks for your input.
> Rboerto

It's not something I've thought much about, but I think if you (or
anybody) are going to implement this, use iconv. Unicode is a bit more
complicated than you may realize. For characters with diacritics, many
can occur in at least two different forms, as single Unicode code points
(like the ones you used), or as two (or more) code points of which the
second (and subsequent) are non-spacing (aka combining).

--
Sjoerd Mullender

_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list