osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Python2.7 unicode conundrum


Hi folks,
what semmingly started out as a weird database character encoding mix-up
could be boiled down to a few lines of pure Python. The source-code
below is real utf8 (as evidenced by the UTF code point 'c3 a4' in the
third line of the hexdump). When just printed, the string "s" is
displayed correctly as '?' (a umlaut), but the string representation
shows that it seems to have been converted to latin-1 'e4' somewhere on
the way.
How can this be avoided?

dh at jenna:~/python$ cat unicode.py
# -*- encoding: utf8 -*-

s = u'?'

print(s)
print((s, ))

dh at jenna:~/python$ hd unicode.py 
00000000  23 20 2d 2a 2d 20 65 6e  63 6f 64 69 6e 67 3a 20  |# -*- encoding: |
00000010  75 74 66 38 20 2d 2a 2d  0a 0a 73 20 3d 20 75 27  |utf8 -*-..s = u'|
00000020  c3 a4 27 0a 0a 70 72 69  6e 74 28 73 29 0a 70 72  |..'..print(s).pr|
00000030  69 6e 74 28 28 73 2c 20  29 29 0a 0a              |int((s,))..|
0000003c
dh at jenna:~/python$ python unicode.py
?
(u'\xe4',)
dh at jenna:~/python$