[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Getting Unicode decode error using lxml.iterparse


digitig at gmail.com wrote:

> I'm trying to read my iTunes library in Python using iterparse. My current
> stub is:

>     parser.add_argument('infile', nargs='?',
>     type=argparse.FileType('r'), default=sys.stdin)

> I'm getting an error on one part of the XML:
> 
> 
>  File "C:\Users\digit\Anaconda3\lib\encodings\cp1252.py", line 23, in
>  decode
>     return codecs.charmap_decode(input,self.errors,decoding_table)[0]
> 
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position
> 202: character maps to <undefined>

> I suspect the issue is that it's using cp1252.py, which I don't think is
> UTF-8 as specified in the XML prolog. Is this an iterparse problem, or am
> I using it wrongly?

The wrong encoding is specified implicitly in argparse.FileType("r"). Try 
FileType("rb") or FileType("r", encoding="utf-8") instead (my personal 
approach is to avoid FileType completely).