Putting Unicode characters in JSON
On Fri, Mar 23, 2018 at 10:47 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Fri, 23 Mar 2018 07:09:50 +1100, Chris Angelico wrote:
>>> I was reading though, that JSON files must be encoded with UTF-8. So
>>> should I be doing string.decode('latin-1').encode('utf-8')? Or does
>>> the json module do that for me when I give it a unicode object?
>> Reconfigure your MySQL database to use UTF-8. There is no reason to use
>> Latin-1 in the database.
> You don't know that. You don't know what technical, compatibility, policy
> or historical constraints are on the database.
Okay. Give me a good reason for the database itself to be locked to
Latin-1. Make sure you explain how potentially saving the occasional
byte of storage (compared to UTF-8) justifies limiting the available
character set to the ones that happen to be in Latin-1, yet it's
essential to NOT limit the character set to ASCII.
>> If that isn't an option, make sure your JSON files are pure ASCII, which
>> is the common subset of UTF-8 and Latin-1.
> And that's utterly unnecessary, since any character which can be stored
> in the Latin-1 MySQL database can be stored in the Unicode JSON.
Irrelevant; if you fetch eight-bit data out of the database, it isn't
going to be a valid JSON file unless (1) it's really ASCII, like I
suggest; (2) you re-encode it to UTF-8; or (3) it was actually UTF-8
all along, despite being declared as Latin-1.
Restricting JSON to ASCII is a very easy and common thing to do. It
just means that every non-ASCII character gets represented as a \u
escape sequence. In Python's JSON encoder, that's the ensure_ascii
parameter. Utterly unnecessary? How about standards-compliant and
entirely effective, unlike the re-encoding that means that the
database-stored blob is invalid JSON and must be re-encoded again on
the way out?