[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

What's wrong with these codes as failed to read the strings in Chinese? Is it because Chinese characters can't be read on Mac? Many thanks

On 08Nov2018 19:30, Annie Lu <gabriella19930611 at gmail.com> wrote:
># -*- coding: UTF-8 -*-
>... f = open('/Users/annielu/Desktop/namelist1801.txt')
>>>> namelist1801txt = f.read()
>>>> f.close()
>>>> namelist1801txt
> x89\r\xe5\xbc\xa0\xe6\xb2\x81\xe7\x8e\xa5'

It should be fine, but how it works out is very dependent on:

- your Python version, particularly Python 2 versus Python 3

- the text encoding used in the file namelist1801.txt

If you're not using Python 3, I recommend that you do. I _suspect_ from 
the output you have shown, that you are using Python 2.

On a UNIX system (your Mac is a UNIX system, BTW), a text file is a 
stream of bytes.  Because it contains text, that text is encoded to 
bytes in some fashion.  On modern systems, the commonest encoding is 
'utf-8', a variable length encoding of Unicode code points.

In order to read text back from a file, it must be decoded.

You've opened your file as text (which is good, because it contains 

In Python 2 that is pretty simply minded: you get back _byte_ strings: 
Python 2 strings are just arrays of bytes, so no decoding really 
happens. For ASCII text, that gets by. For languages requiring glyphs 
beyond that, interpretation is needed. You need unicode strings, which 
are _not_ Python 2's default, so your text needs converting.

In Python 3, strings are unicode strings to start with. You must still 
indicate the file encoding, but there is a default inferred from your 
operating environment, and that is usually 'utf-8'.

So here's an (untested) Python 2 example loop:

  with open('namelist.txt') as f:
    for line in f:
      line = line.strip()
      print("line =", line)
      uline = unicode(line, 'utf-8')
      print("uline =", uline)

Here's a Python 2 example of taking your text string and converting it:

  >>> s='\xe9\x99\x88\xe5\xb7\x8d\n\xe8\x83\xa1\xe6\x99\xba\xe5\x81\xa5\r\xe9\xbb\x84\xe5\x9d\xa4\xe6\xa6\x95\r\xe6\x9d\x8e\xe6\x98\x9f\xe7\x81\xbf\r\xe5\x88\x98\xe8\xb6\x85'
  >>> unicode(s,'utf-8')
  >>> print(unicode(s,'utf-8'))

I cannot read Chinese text, but the glyphs look like it to my eye.

I'm using a Mac, and did nothing special.

Note that I had to take portion of your text which ended on a complete 
unicode character, otherwise the decode fails. My first cut/paste 
stopped one byte beyond the \x85 that ends the string above, and failed.  
Your entire string should also decode cleanly.

In Python 3 the loop is much cleaner:

  with open('namelist.txt', encoding='utf-8') as f:
    for line in f:
      line = line.strip()
      print("line =", line)

because the file open understands the encoding. I have explicitly 
specified 'utf-8' there, but you may find that it is the default for 

Cameron Simpson <cs at cskk.id.au>