OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>


On Wed, 23 May 2018 00:31:03 +0200, Peter J. Holzer wrote:

> On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:
[...]
>> You can find an encoding which is capable of decoding a file. That's
>> not the same thing.
> 
> If the result is correct, it is the same thing.

But how do you know what is correct and what isn't? In the most general 
case, even if you know the language nominally being used, you might not 
be able to recognise good output from bad:

    Max Steele strained his mighty thews against his bonds, but
    the ?-rays had left him as weak as a kitten. The evil Galactic
    Emperor, Gi?x-??in The Terrible of the planet ?e??, laughed: "I 
    have you now, Steele, and by this time tomorrow my armies will
    have overrun your pitiful Earth defences!"

If this text is encoding using MacRoman, then decoded in Latin-1, it 
works, and looks barely any more stupid than the original:

    Max Steele strained his mighty thews against his bonds, but
    the ?-rays had left him as weak as a kitten. The evil Galactic
    Emperor, Gi?x-??in The Terrible of the planet ?e??, laughed: "I
    have you now, Steele, and by this time tomorrow my armies will
    have overrun your pitiful Earth defences!"

but it clearly isn't the original text.

Mojibake is especially difficult to deal with when you are dealing with 
short text snippets like file names or user names which can contain 
arbitrary characters, where there is rarely any way to recognise the 
"correct" string. If you think Gi?x-??in The Terrible is a ludicrous 
example of text, you ought to look at user names on web forums.



-- 
Steve