Grapheme clusters, a.k.a.real characters
On Wed, 19 Jul 2017 12:09 am, Random832 wrote:
> On Fri, Jul 14, 2017, at 08:33, Chris Angelico wrote:
>> What do you mean about regular expressions? You can use REs with
>> normalized strings. And if you have any valid definition of "real
>> character", you can use it equally on an NFC-normalized or
>> NFD-normalized string than any other. They're just strings, you know.
> I don't understand how normalization is supposed to help with this. It's
> not like there aren't valid combinations that do not have a
> corresponding single NFC codepoint (to say nothing of the situation with
> e.g. Indic languages).
Normalisation helps. Suppose you want to search for ? for example, a naive
regular expression engine will only find the exact representation you or your
editor happened to use:
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT
but not both. By normalising, you ensure that both the text you are searching
and the regex you are searching for are in the same state: either composed to a
single code point U+00E9 or decomposed to two U+0065,0301 but never one in one
state and the other in the other.
For characters that don't include a canonical composition form, then there's no
problem: you will always be searching for a decomposed character using a base
character followed by combining characters, so there is no discrepancy and it
will just work.
> In principle probably a viable solution for regex would be to add
> character classes for base and combining characters, and then
> "[[:base:]][[:combining:]]*" can be used as a building block if
I don't know what that means.
Any code point (except for combining characters themselves) can be used as the
base, and the various kinds of combining characters have the Unicode category
Mn (Mark, nonspacing)
Mc (Mark, spacing combining)
Me (Mark, enclosing)
If we're talking about combining accents and diacritics, the one we want is Mc.
But generally, we're not after "any old diacritic", we're after a specific one,
on a specific base.
?Cheer up,? they said, ?things could be worse.? So I cheered up, and sure
enough, things got worse.