[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Grapheme clusters, a.k.a.real characters

Chris Angelico <rosuav at gmail.com>:

> Actually, the implementation I detailed was far SIMPLER than I thought
> it would be; I started writing that post trying to prove that it was
> impossible, but it turns out it isn't actually impossible. Just highly
> impractical.

The existing str implementation could be tweaked to accommodate the
"super code points" I proposed:

Add a pointer field to CPython's UCS-4 string variant. Behind the
pointer is an array of 64-bit pointers. If any string code point is
1114112 or greater, subtract 1114112 from it to get an index into the
pointer array.

If the pointer at the index is odd, cast it into uint64_t and shift
right by one bit to get the super code point. Such a packed super code
point can hold 3 full code points (3 * 21 bits).

If the pointer at the index is an even number, it is a reference to a
bigint value representing the super code point.