[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Convert a list with wrong encoding to utf8

On 02/14/2019 12:02 PM, vergos.nikolas at gmail.com wrote:
> ?? ??????, 14 ??????????? 2019 - 8:16:40 ?.?. UTC+2, ? ??????? Calvin Spealman ??????:
>> If you see something like this
>> '\xce\x86\xce\xba\xce\xb7\xcf\x82
>> \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82'
>> then you don't have a string, you have raw bytes. You don't "encode" bytes,
>> you decode them. If you know this is already encoded as UTF-8 then you just
>> need the decode('utf8') part and *not* the encode('latin1') step.
>> encode() is something that turns text into bytes
>> decode() is something that turns bytes into text
>> So, if you already have bytes and you need text, you should only want to be
>> doing a decode() and you just need to specific the correct encoding.
> I Agree but I don't know in what encoding the string is encoded into. 
> I just tried 
> names = tuple( [s.decode('utf8') for s in names] )
> but i get the error of:
> AttributeError("'str' object has no attribute 'decode'",)

Strictly speaking, that's correct.  A Python 3 string object is already
decoded unicode. It cannot be decoded again.
> but why it says s is a string object? Since we have names in raw bytes is should be a bytes object?

It's clearly not raw bytes.

> How can i turn names from raw bytes to utf-8 strings? 

They apparently aren't raw bytes.  If they were, you could use .decode()

> ps. Who encoded them in raw bytes anyways? Since they fetced directly from the database shouldn't 
> python3 have them stored in names as utf-8 strings? why raw bytes instead?

Something very strange is going on with your database and/or your
queries.  The pymysql api should be already decoding the utf-8 bytes for
you and returning a unicode string.  I have no idea why you're getting a
unicode string that consists of code points that are the same as the
utf-8 bytes.   You'll have to post a little bit more of your code, like
a simple, complete query example (a few lines of code) that shows
absolutely everything you're trying to do to the string.  Also you will
want to use the mysql command-line utilities to try your queries and see
what kind of data you're getting out.  Because if mysql is told to use
utf-8 for varchar, and if you're inserting the data using
correctly-formed utf-8 encoded byte strings, it should come back out in
Python as unicode.