[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Email parsing and unicode/utf8


I just stumbled over some curious behaviour of the stdlib email parsing
APIs which accept strings rather than bytes. It appears that you can't
parse an 8-bit UTF-8 message you have as a str without first encoding it.

The docs
<https://docs.python.org/3/library/email.parser.html#feedparser-api> do
mention some problems (which I saw after the fact):

> class email.parser.FeedParser(_factory=None, *, policy=policy.compat32)
>     Works like BytesFeedParser except that the input to the feed() method must be a string. This is of limited utility, since the only way for such a message to be valid is for it to contain only ASCII text or, if utf8 is True, no binary attachments.
>     Changed in version 3.3: Added the policy keyword.

Okay, cool - let's try parsing a message with text only (no attachments,
no BINARYMIME), with a UTF-8 Content-Type, and a policy with utf8=True.

Python 3.7.1rc2 (default, Oct 14 2018, 15:27:05)
[GCC 8.2.1 20180831 [gcc-8-branch revision 264010]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import email.parser, email.policy
>>> pol = email.policy.SMTPUTF8
>>> pol.utf8
>>> pol.cte_type
>>> msg = '''MIME-Version: 1.0
... Content-Type: text/plain; charset="utf-8"
... Content-Transfer-Encoding: 8bit
... Subject: ?Will it parse? ???.
... ?This message contains two (?) non-ASCII characters!
... '''
>>> fp = email.parser.FeedParser(policy=pol)
>>> fp.feed(msg)
>>> msg_obj = fp.close()
>>> msg_obj
<email.message.EmailMessage object at 0x7ff028012e10>
>>> print(msg_obj.get_content())
?This message contains two (\u0662) non-ASCII characters!

>>> print(msg_obj['Subject'])
?Will it parse? ???.

I don't know WHAT it's doing with the body there... It doesn't look like
utf8 mode actually did anything. Interesting that the subject header
survived! Maybe this is what the utf8=True does?

>>> email.policy.default.utf8
>>> fp2 = email.parser.FeedParser(policy=email.policy.default)
>>> fp2.feed(msg)
>>> msg_obj2 = fp2.close()
>>> print(msg_obj2['Subject'])
?Will it parse? ???.

Nope. Apparently, contrary to what my reading of the docs suggests, the
utf8 flag does nothing at all when parsing.

Just to check that this was in fact a perfectly valid email:

>>> bfp = email.parser.BytesFeedParser(policy=pol)
>>> bfp.feed(msg.encode('utf-8'))
>>> msg_objb = bfp.close()
>>> print(msg_objb.get_content())
?This message contains two (?) non-ASCII characters!

>>> print(msg_objb['Subject'])
?Will it parse? ???.

BytesFeedParser is happy.

Question: Is this a bug? Am I missing something? Does the clause in the
docs about utf8 mean anything?