osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Grapheme clusters, a.k.a.real characters


On Thu, Jul 20, 2017 at 1:45 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> So let's assume we will expand str to accommodate the requirements of
> grapheme clusters.
>
> All existing code would still produce only traditional strings. The only
> way to introduce the new "super code points" is by invoking the
> str.canonical() method:
>
>     text = "hyv?? y?t?".canonical()
>
> In this case text would still be a fully traditional string because both
> ? and ? are represented by a single code point in NFC. However:
>
>     >>> q = unicodedata.normalize("NFC", "aq?u")
>     >>> len(q)
>     4
>     >>> text = q.canonical()
>     >>> len(text)
>     3
>     >>> t[0]
>     "a"
>     >>> t[1]
>     "q?"
>     >>> t[2]
>     "u"
>     >>> q2 = unicodedata.normalize("NFC", text)
>     >>> len(q2)
>     4
>     >>> text.encode()
>     b'aq\xcc\x88u'
>     >>> q.encode()
>     b'aq\xcc\x88u'

Ahh, I see what you're looking at. This is fundamentally very similar
to what was suggested a few hundred posts ago: a function in the
unicodedata module which yields a string's combined characters as
units. So you only see this when you actually want it, and the process
of creating it is a form of iterating over the string.

This could easily be done, as a class or function in unicodedata,
without any language-level support. It might even already exist on
PyPI.

ChrisA