[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Convert a list with wrong encoding to utf8

On 2019-02-14 18:16, Calvin Spealman wrote:
> If you see something like this
> '\xce\x86\xce\xba\xce\xb7\xcf\x82
> \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82'
> then you don't have a string, you have raw bytes. You don't "encode" bytes,
> you decode them. If you know this is already encoded as UTF-8 then you just
> need the decode('utf8') part and *not* the encode('latin1') step.
> encode() is something that turns text into bytes
> decode() is something that turns bytes into text
> So, if you already have bytes and you need text, you should only want to be
> doing a decode() and you just need to specific the correct encoding.
It doesn't have a 'b' prefix, so either it's Python 2 or it's a Unicode 
string that was decoded wrongly from the bytes.

> On Thu, Feb 14, 2019 at 12:15 PM <vergos.nikolas at gmail.com> wrote:
>> ?? ??????, 14 ??????????? 2019 - 6:45:29 ?.?. UTC+2, ? ??????? Calvin
>> Spealman ??????:
>> > You can only decode FROM the same encoding you've encoded TO. Any
>> decoding
>> > must know the input it receives follows the rules of its encoding scheme.
>> > latin1 is not utf8.
>> >
>> > However, in your case, you aren't seeing problem with  the decoding. That
>> > step is never reached. It is failing to encode the string as latin1
>> because
>> > it is not compatible with the latin1 scheme. Your string contains
>> > characters which cannot be represented in latin1.
>> >
>> > It really is not clear what you're trying to accomplish here. The string
>> > encoding was already handled when you pulled this out of the database and
>> > you should not need to do anything like this at all. You already have a
>> > decoded string, because in python ALL strings are decoded already.
>> Encoding
>> > is only a process of converting strings to raw bytes for storage or
>> > transmission, which you don't appear to be doing here.
>> Names in database are stored in utf8
>> When the script runs it reads them and handles them as utf8, right?
>> If it like this, then why when i print 'names' list i see bytes in
>> hexadecimal format?
>> '\xce\x86\xce\xba\xce\xb7\xcf\x82
>> \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82'
>> And only if i
>> for name in names:
>>     print( name.encode('latin1').decode('utf8') )
>> i can see the values of 'name' list correctly in Greek.
>> But where did the latin-iso took in place? And aparrt for printing the
>> name like above how can i store them in proper utf ?