[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Python 3.2 has some deadly infection

Steven D'Aprano <steve+comp.lang.python at pearwood.info>:

> Nevertheless, there are important abstractions that are written on top
> of the bytes layer, and in the Unix and Linux world, the most
> important abstraction is *text*. In the Unix world, text formats and
> text processing is much more common in user-space apps than binary
> processing.

That linux text is not the same thing as Python's text. Conceptually,
Python text is a sequence of 32-bit integers. Linux text is a sequence
of 8-bit integers.

It is great that lots of computer-to-computer formats are encoded in
ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction
layer that processes Python-esque text.

Case in point:

   $ env | grep UTF
   $ od -c <<<"Hyv?? y?t?"     # "Good night" in Finnish
   0000000   H   y   v 303 244 303 244       y 303 266   t 303 244  \n

The "od" utility is asked to display its input as characters. The locale
info gives a hint that all text data is in UTF-8. Yet what comes out is

How about:

   $ wc -c <<<"Hyv?? y?t?"
   $ tr '?' 'a' <<<"Hyv?? y?t?"
   Hyvaaaa ya?taa

Grep is smarter:

   $ grep v...y <<<"Hyv?? y?t?"
   Hyv?? y?t?

which is why you should always prefix "grep" with LC_ALL=C in your
scripts (makes it far faster, too).