Putting Unicode characters in JSON
On 23 March 2018 at 00:27, Thomas Jollans <tjol at tjol.eu> wrote:
> On 22/03/18 20:46, Tobiah wrote:
>> I was reading though, that JSON files must be encoded with UTF-8. So
>> should I be doing string.decode('latin-1').encode('utf-8')? Or does
>> the json module do that for me when I give it a unicode object?
> Definitely not. In fact, that won't even work.
>>>> import json
>>>> s = 'd?j? vu'.encode('latin1')
> b'd\xe9j\xe0 vu'
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/usr/lib/python3.6/json/__init__.py", line 231, in dumps
> return _default_encoder.encode(obj)
> File "/usr/lib/python3.6/json/encoder.py", line 199, in encode
> chunks = self.iterencode(o, _one_shot=True)
> File "/usr/lib/python3.6/json/encoder.py", line 257, in iterencode
> return _iterencode(o, 0)
> File "/usr/lib/python3.6/json/encoder.py", line 180, in default
> TypeError: Object of type 'bytes' is not JSON serializable
> You should make sure that either the file you're writing to is opened as
> UTF-8 text, or the ensure_ascii parameter of dumps() or dump() is set to
> True (the default) ? and then write the data in ASCII or any
> ASCII-compatible encoding (e.g. UTF-8).
> Basically, the default behaviour of the json module means you don't
> really have to worry about encodings at all once your original data is
> in unicode strings.
>From my analysis of the OP's comments, I suspect he's using Python 2,
which muddles the distinction between bytes and (Unicode) text, and
that's why he is seeing such strange results.
Getting this right in Python 2 is going to involve having a clear
understanding of how text and bytes differ, and carefully tracking
which values are conceptually text and which are conceptually bytes.
In my view one of the easiest ways of doing this is to try writing the
code you want in Python 3, and watch how it breaks (as you've
demonstrated above, it will!) Then, if you need your code to work in
Python 2, apply the knowledge you've gained to the Python 2 codebase.
Unfortunately, that may not be practical (people can be locked on
Python 2 for all sorts of reasons). If that's the case, then I can't
offer much help to the OP beyond "learn how Unicode works" - which
isn't much help, as that's basically what he asked in the first