Grapheme clusters, a.k.a.real characters
On Fri, 21 Jul 2017 01:43 pm, Chris Angelico wrote:
> Strings with all code
> points on the BMP and no combining characters are still able to be
> represented as they are today, again with the empty secondary array.
I presume that since the problem we're trying to solve here is that certain
characters have two representations, this format will automatically decompose
strings. Otherwise, it doesn't really solve the problems with diacritics, where
a single human-readable character like ? or ? has two distinct, and non-equal,
But if it does, then every string with a diacritic (i.e. most Western European
text, if not Eastern European as well) will need combining characters.
If this *doesn't* decompose the strings, then what problem is it actually
> The presence of a single combining character in the string does force
> it to be stored 32 bits per character, so there can be a price to pay.
Right -- so it's really compact for Americans, and blows out for just about
> Similarly, the secondary array will only VERY rarely need to contain
> any pointers; most combined characters consist of a base and one
> combining, or a set of three characters at most.
I don't know if you can make that claim for non-West European languages. I don't
know enough about (for example) Slavic languages, or Thai, or Arabic, or
Chinese, to know whether (base + three combining characters) will be rare or
But emoji sequences will often require four code points, three of which will be
in the supplementary planes.
> There'll be dramatic
> performance costs for strings where piles of combining characters get
> loaded on top of a single base, but at least they can be accurately
They can be accurately represented right now. E.g. there is nothing ambiguous or
inaccurate about U+1F469 U+1F3FD U+200D U+1F52C, "woman scientist with medium
?Cheer up,? they said, ?things could be worse.? So I cheered up, and sure
enough, things got worse.