Reversible malformed UTF-8 to malformed UTF-16 encoding
> On 2019-03-19 20:32, Florian Weimer wrote:
>> I've seen occasional proposals like this one coming up:
>> | I therefore suggested 1999-11-02 on the unicode at unicode.org mailing
>> | list the following approach. Instead of using U+FFFD, simply encode
>> | malformed UTF-8 sequences as malformed UTF-16 sequences. Malformed
>> | UTF-8 sequences consist excludively of the bytes 0x80 - 0xff, and
>> | each of these bytes can be represented using a 16-bit value from the
>> | UTF-16 low-half surrogate zone U+DC80 to U+DCFF. Thus, the overlong
>> | "K" (U+004B) 0xc1 0x8b from the above example would be represented
>> | in UTF-16 as U+DCC1 U+DC8B. If we simply make sure that every UTF-8
>> | encoded surrogate character is also treated like a malformed
>> | sequence, then there is no way that a single high-half surrogate
>> | could precede the encoded malformed sequence and cause a valid
>> | UTF-16 sequence to emerge.
>> Has this ever been implemented in any Python version? I seem to
>> remember something like that, but all I could find was me talking
>> about this in 2000.
> Python 3 has "surrogate escape". Have a read of PEP 383 -- Non-decodable
> Bytes in System Character Interfaces.
Thanks, this is the information I was looking for.