[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Python 3.2 has some deadly infection

On Sat, Jun 7, 2014 at 1:32 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Michael Torrie <torriem at gmail.com>:
>> On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
>>> Ethan Furman <ethan at stoneleaf.us>:
>>>> ASCII is *not* the state of "this string has no encoding" -- that
>>>> would be Unicode; a Unicode string, as a data type, has no encoding.
>>> Huh?
>> [...]
>> What part of his statement are you saying "Huh?" about?
> Unicode, like ASCII, is a code. Representing text in unicode is
> encoding.

Yes and no. "ASCII" means two things: Firstly, it's a mapping from the
letter A to the number 65, from the exclamation mark to 33, from the
backslash to 92, and so on. And secondly, it's an encoding of those
numbers into the lowest seven bits of a byte, with the high byte left
clear. Between those two, you get a means of representing the letter
'A' as the byte 0x41, and one of them is an encoding.

"Unicode", on the other hand, is only the first part. It maps all the
same characters to the same numbers that ASCII does, and then adds a
few more... a few followed by a few, followed by... okay, quite a lot
more. Unicode specifies that the character OK HAND SIGN, which looks
like ? if you have the right font, is number 1F44C in hex (128076
decimal). This is the "Universal Character Set" or UCS.

ASCII could specify a single encoding, because that encoding makes
sense for nearly all purposes. (There are times when you transmit
ASCII text and use the high bit to mean something else, like parity or
"this is the end of a word" or something, but even then, you follow
the same convention of packing a number into the low seven bits of a
byte.) Unicode can't, because there are many different pros and cons
to the different encodings, and so we have UCS Transformation Formats
like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint
to a sequence of bytes.

You can't represent text in "Unicode" in a computer. Somewhere along
the way, you have to figure out how to store those codepoints as
bytes, or something more concrete (you could, for instance, use a
Python list of Python integers; I can't say that it would be in any
way more efficient than alternatives, but it would be plausible); and
that's the encoding.