[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Putting Unicode characters in JSON

On 3/23/18 6:35 AM, Chris Angelico wrote:
> On Fri, Mar 23, 2018 at 9:29 PM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> On Fri, 23 Mar 2018 18:35:20 +1100, Chris Angelico wrote:
>>> That doesn't seem to be a strictly-correct Latin-1 decoder, then. There
>>> are a number of unassigned byte values in ISO-8859-1.
>> That's incorrect, but I don't blame you for getting it wrong. Who thought
>> that it was a good idea to distinguish between "ISO 8859-1" and
>> "ISO-8859-1" as two related but distinct encodings?
>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>> The old ISO 8859-1 standard, the one with undefined values, is mostly of
>> historical interest. For the last twenty years or so, anyone talking
>> about either Latin-1 or ISO-8859-1 (with or without dashes) is almost
>> meaning the 1992 IANA superset version which defines all 256 characters:
>>      "In 1992, the IANA registered the character map ISO_8859-1:1987,
>>      more commonly known by its preferred MIME name of ISO-8859-1
>>      (note the extra hyphen over ISO 8859-1), a superset of ISO
>>      8859-1, for use on the Internet. This map assigns the C0 and C1
>>      control characters to the unassigned code values thus provides
>>      for 256 characters via every possible 8-bit value."
>> Either that, or they actually mean Windows-1252, but let's not go there.
> Wait, whaaa.......
> Though in my own defense, MySQL itself seems to have a bit of a
> problem with encoding names. Its "utf8" is actually "UTF-8 with a
> maximum of three bytes per character", in contrast to "utf8mb4" which
> is, well, UTF-8.
> In any case, abusing "Latin-1" to store binary data is still wrong.
> That's what BLOB is for.
> ChrisA

One comment on this whole argument, the original poster asked how to get 
data from a database that WAS using Latin-1 encoding into JSON (which 
wants UTF-8 encoding) and was asking if something needed to be done 
beyond using .decode('Latin-1'), and in particular if they need to use a 
.encode('UTF-8'). The answer should be a simple Yes or No.

Instead, someone took the opportunity to advocate that a wholesale 
change to the database was the only reasonable course of action.

First comment, when someone is proposing a change, it is generally put 
on them the burden of proof that the change is warranted. This is 
especially true when they are telling someone else they need to make 
such a change.

Absolute statements are very hard to prove (but the difficulty of proof 
doesn't relieve the need to provide it), and in fact are fairly easy to 
disprove (one counter example disproves an absolute statement). Counter 
examples to the absolute statement have been provided.

When dealing with a code base, backwards compatibility IS important, and 
casually changing something that fundamental isn't the first thing that 
someone should be thinking about, We weren't given any details about the 
overall system this was part of, and they easily could be other code 
using the database that such a change would break. One easy Python 
example is to look back at the change from Python 2 to Python 3, how 
many years has that gone on, and how many more will people continue to 
deal with it? This was over a similar issue, that at least for today, 
Unicode is the best solution for storing arbitrary text, and forcing 
that change down to the fundamental level.

Richard Damon