Putting Unicode characters in JSON
On Sat, Mar 24, 2018 at 1:46 AM, Tobiah <toby at tobiah.org> wrote:
> On 03/22/2018 12:46 PM, Tobiah wrote:
>> I have some mailing information in a Mysql database that has
>> characters from various other countries. The table says that
>> it's using latin-1 encoding. I want to send this data out
>> as JSON.
>> So I'm just taking each datum and doing 'name'.decode('latin-1')
>> and adding the resulting Unicode value right into my JSON structure
>> before doing .dumps() on it. This seems to work, and I can consume
>> the JSON with another program and when I print values, they look nice
>> with the special characters and all.
>> I was reading though, that JSON files must be encoded with UTF-8. So
>> should I be doing string.decode('latin-1').encode('utf-8')? Or does
>> the json module do that for me when I give it a unicode object?
> Thanks for all the discussion. A little more about our setup:
> We have used a LAMP stack system for almost 20 years to deploy
> hundreds of websites. The database tables are latin-1 only because
> at the time we didn't know how or care to change it.
> More and more, 'special' characters caused a problem. They would
> not come out correctly in a .csv file or wouldn't print correctly.
> Lately, I noticed that a JSON file I was sending out was delivering
> unreadable characters. That's when I started to look into Unicode
> a bit more. From the discussion, and my own guesses, it looks
> as though all have to do is string.decode('latin-1'), and stuff
> that Unicode object right into my structure that gets handed to
Yep, this is sounding more and more like you need to go UTF-8 everywhere.
> If I changed my database tables to all be UTF-8 would this
> work cleanly without any decoding? Whatever people are doing
> to get these characters in, whether it's foreign keyboards,
> or fancy escape sequences in the web forms, would their intended
> characters still go into the UTF-8 database as the proper characters?
> Or now do I have to do a conversion on the way in to the database?
The best way to do things is to let your Python-MySQL bridge do the
decoding for you; you'll simply store and get back Unicode strings.
That's how things happen by default in Python 3 (I believe; been a
while since I used MySQL, but it's like that with PostgreSQL). My
recommendation is to give it a try; most likely, things will just
> We also get import data that often comes in .xlsx format. What
> encoding do I get when I dump a .csv from that? Do I have to
> ask the sender? I already know that they don't know.
Ah, now, that's a potential problem. A CSV file can't tell you what
encoding it's in. Fortunately, UTF-8 is designed to be fairly
dependable: if you attempt to decode something as UTF-8 and it works,
you can be confident that it really is UTF-8. But ultimately, you have
to just ask the person who exports it: "please export it in UTF-8".
Generally, things should "just work" as long as you're consistent with
encodings, and the easiest way to be consistent is to use UTF-8
everywhere. It's a simple rule that everyone can follow. (Hopefully.