Python 3.2 has some deadly infection
On Mon, 02 Jun 2014 08:54:33 +1000, Tim Delaney wrote:
> On 1 June 2014 12:26, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> "with cross-platform behavior preferred over system-dependent one" --
>> It's not clear how cross-platform behaviour has anything to do with the
>> Internet age. Python has preferred cross-platform behaviour forever,
>> except for those features and modules which are explicitly intended to
>> be interfaces to system-dependent features. (E.g. a lot of functions in
>> the os module are thin wrappers around OS features. Hence the name of
>> the module.)
> There is the behaviour of defaulting input and output to the system
That's a tricky one, but I think on balance that is a case where
defaulting to the system encoding is the right thing to do. Input and out
occurs on the local system you are running on, which by definition isn't
cross-platform. (Non-local I/O is possible, but requires work -- it
doesn't just happen.)
> I personally think we would all be better off if Python (and
> Java, and many other languages) defaulted to UTF-8. This hopefully would
> eventually have the effect of producers changing to output UTF-8 by
> default, and consumers learning to manually specify an encoding when
> it's not UTF-8 (due to invalid codepoints).
UTF-8 everywhere should be our ultimate aim. Then we can forget about
legacy encodings except when digging out ancient documents from archived
floppy disks :-)
> I'm currently working on a product that interacts with lots of other
> products. These other products can be using any encoding - but most of
> the functions that interact with I/O assume the system default encoding
> of the machine that is collecting the data. The product has been in
> production for nearly a decade, so there's a lot of pushback against
> changes deep in the code for fear that it will break working systems.
> The fact that they are working largely by accident appears to escape
> them ...
> FWIW, changing to use iso-latin-1 by default would be the most sensible
> option (effectively treating everything as bytes), with the option for
> another encoding if/when more information is known (e.g. there's often a
> call to return the encoding, and the output of that call is guaranteed
> to be ASCII).
Python 2 does what you suggest, and it is *broken*. Python 2.7 creates
moji-bake, while Python 3 gets it right:
[steve at ando ~]$ python2.7 -c "print u'???'"
[steve at ando ~]$ python3.3 -c "print(u'???')"
Latin-1 is one of those legacy encodings which needs to die, not to be
entrenched as the default. My terminal uses UTF-8 by default (as it
should), and if I use the terminal to input "???", Python ought to see
what I input, not Latin-1 moji-bake.
If I were to use Windows with a legacy code page, then I couldn't even
enter "???" on the command line since none of the legacy encodings
support that set of characters at the same time. I don't know exactly
what I would get if I tried (say, by copying and pasting text from a
Unicode-aware application), but I'd see that it was weird *in the shell*
before it even reaches Python.
On the other hand, if I were to input something supported by the legacy
encoding, let's say I entered "???" while using ISO-8859-7 (Greek), then
Python ought to see "???" and not moji-bake:
py> b = "???".encode('iso-8859-7') # what the shell generates
py> b.decode('latin-1') # what Python interprets those bytes as
Defaulting to the system encoding means that Python input and output just
works, to the degree that input and output on your system just works. If
your system is crippled by the use of a legacy encoding, then Python will
at least be *no worse* than your system.