Python 3.2 has some deadly infection
On Friday, June 6, 2014 9:27:51 PM UTC+5:30, Steven D'Aprano wrote:
> On Fri, 06 Jun 2014 18:32:39 +0300, Marko Rauhamaa wrote:
> > Michael Torri:
> >> On 06/06/2014 08:10 AM, Marko Rauhamaa wrote:
> >>> Ethan Furman :
> >>>> ASCII is *not* the state of "this string has no encoding" -- that
> >>>> would be Unicode; a Unicode string, as a data type, has no encoding.
> >>> Huh?
> >> [...]
> >> What part of his statement are you saying "Huh?" about?
> > Unicode, like ASCII, is a code. Representing text in unicode is
> > encoding.
> A Unicode string as an abstract data type has no encoding. It is a
> Platonic ideal, a pure form like the real numbers. There are no bytes, no
> bits, just code points. That is what Ethan means. A Unicode string like
> s = u"NOBODY expects the Spanish Inquisition!"
> should not be thought of as a bunch of bytes in some encoding, but as an
> array of code points. Eventually the abstraction will leak, all
> abstractions do, but not for a very long time.
"Should not be thought of" yes thats the Python3 world view
Not even the Python2 world view
And very far from the classic Unix world view.
As Ned Batchelder says in Unipain: http://nedbatchelder.com/text/unipain.html :
Programmers should use the 'unicode sandwich'to avoid 'unipain':
Bytes on the outside, Unicode on the inside, encode/decode at the edges.
The discussion here is precisely about these edges
Combine that with Chris':
> Yes and no. "ASCII" means two things: Firstly, it's a mapping from the
> letter A to the number 65, from the exclamation mark to 33, from the
> backslash to 92, and so on. And secondly, it's an encoding of those
> numbers into the lowest seven bits of a byte, with the high byte left
> clear. Between those two, you get a means of representing the letter
> 'A' as the byte 0x41, and one of them is an encoding.
and the situation appears quite the opposite of Ethan's description:
In the 'old world' ASCII was both mapping and encoding and so there was
never a justification to distinguish encoding from codepoint.
It is unicode that demands these distinctions.
If we could magically go to a world where the number of bits in a byte was 32
all this headache would go away. [Actually just 21 is enough!]