### osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

# Grapheme clusters, a.k.a.real characters

On Thu, 20 Jul 2017 12:40:08 +1000, Chris Angelico wrote:

> On Thu, Jul 20, 2017 at 12:12 PM, Steve D'Aprano
> <steve+python at pearwood.info> wrote:
>> On Thu, 20 Jul 2017 08:12 am, Gregory Ewing wrote:
>>
>>> Chris Angelico wrote:
>> [snip overly complex and complicated string implementation]
>>
>>
> An accurate description, but in my own defense, I had misunderstood
> Marko's idea. Actually, the implementation I detailed was far SIMPLER
> than I thought it would be; I started writing that post trying to prove
> that it was impossible, but it turns out it isn't actually impossible.
> Just highly impractical.

I haven't really been paying attention to Marko's suggestion in detail,
but if we're talking about a whole new data type, how about a list of
nodes, where each node's data is a decomposed string object guaranteed to
be either:

* an array of single-character code points;

* or a single combining character sequence, emoji variation
sequence, or other grapheme

* with the appropriate length in "characters".

So a string like "Hello World!" would be a single node:

(12, "Hello World!")

Strings will always be in decomposed form, so a "caf? au lait" would
automatically use:

U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT

regardless of your editor, and represented by three nodes:

(3, "caf") (1, "e\N{COMBINING ACUTE ACCENT}") (8, " au lait")

Iterating the string would mean:

for node in nodes:
if node.count = 1:
yield node.string
else:
yield from node.string

Reversing would be:

for node in nodes[::-1]:
if node.count = 1:
yield node.string
else:
yield from node.string[::-1]

Getting the length in graphemes would mean:

sum(node.count for node in nodes)

Indexing and slicing I leave for an exercise.

We lose O(1) indexing and slicing, but the length could be cached.
Calculate the length on demand, then cache it for next time. (This
assumes the string is immutable.) Indexing and slicing would be
proportional to the number of nodes, not the length of the string. So not
as bad as a naive UTF-8 implementation.

Each substring could use the Flexible String Representation, to minimize
the total memory. E.g. in the "caf? au lait" example above, the first and
last node would use one byte per code point, and the middle node would
use two bytes per code point.

Of course, this doesn't *completely* solve the question of end user
expectations. For example, many people would want "ij" to count as a
single character, or "\r\n". And it would complicate the implementation
of the various string methods and the regex engine. It will almost
certainly be much slower than the str type, and use more memory, and it
would be lossy with regards to certain patterns of code points.

For example, it wouldn't distinguish between composed and decomposed
strings, since they're always normalised to decomposed form.

But it might be worth doing, for applications that care about giving a
better user experience when it comes to editing text.

--
Steve