osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Getting Unicode decode error using lxml.iterparse


digitig at gmail.com writes:

> I'm trying to read my iTunes library in Python using iterparse. My current stub is:
> ...
> My input file (reduced to home in on the error) is:
>
> ---- snip -----
>
> <?xml version="1.0" encoding="UTF-8"?>
> <plist version="1.0">
> <dict>
> 	<dict>
> 		<key>15078</key>
> 		<dict>
> 			<key>Name</key><string>Part 2. The Death Of Enkidu. Skon P?itele M?ho Mne Zdeptal Te??e</string>
> 		</dict>
> 	</dict>
> </dict>
> </plist>
> ...
> I'm getting an error on one part of the XML:
>
>
>  File "C:\Users\digit\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
>     return codecs.charmap_decode(input,self.errors,decoding_table)[0]
>
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 202: character maps to <undefined>
>
>
> I suspect the issue is that it's using cp1252.py, which I don't think is UTF-8 as specified in the XML prolog. Is this an iterparse problem, or am I using it wrongly?

You can tell "lxml" which encoding it should use. Maybe, you did
and it was the wrong one.

If the encoding is not specified, "lxml" will try to determine it
and finally defaults to "utf-8" (which seems to be the correct encoding
for your case).

"lxml" sits on top of the C library "libxml2". It may be possible
that "libxml2" allows an envvar to specify the default encoding
and - maybe - this envvar has an unfortunate value in your case.

As a workaround, you can tell "lxml" explicitly to use "utf-8"
for your parsing.