[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Getting Unicode decode error using lxml.iterparse


dieter schrieb am 23.05.2018 um 08:25:
> If the encoding is not specified, "lxml" will try to determine it
> and finally defaults to "utf-8" (which seems to be the correct encoding
> for your case).

Being an XML parser, it does not do that. XML parsers are designed to
reject non-wellformed content, and that includes anything that cannot be
decoded.

In short, if no encoding is specified, then it's UTF-8, but if there is an
XML declaration that specifies that encoding, then it uses that encoding.

Here, the encoding is specifed as UTF-8, so that's what the parser uses.

Note, however, that the library that the OP uses is not lxml but xml.etree,
i.e. the ElementTree XML support in the standard library.

Stefan