UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>
On 2018-06-03 16:36:12 -0700, bellcanadardp at gmail.com wrote:
> On Tuesday, 22 May 2018 17:23:55 UTC-4, Peter J. Holzer wrote:
> > On 2018-05-20 15:43:54 +0200, Karsten Hilbert wrote:
> > > On Sun, May 20, 2018 at 04:59:12AM -0700, bellcanadardp at gmail.com wrote:
> > > > thank you for the reply, but how exactly am i supposed to find
> > > > oout what is the correct encodeing??
> > >
> > > One CAN NOT.
> > >
> > > The best you can do is to go ask the canonical source of the
> > > file what encoding the file is _supposed_ to be in.
> > I disagree on both counts.
> > 1) For any given file it is almost always possible to find the correct
> > encoding (or *a* correct encoding, as there may be more than one).
> > This may require domain-specific knowledge (e.g. it may be necessary
> > to recognize the human language and know at least some distinctive
> > words, or to know some special symbols likely to be used in a data
> > file), and it almost always takes a bit of detective work and trial
> > and error. But I don't think I ever encountered a file where I
> > couldn't figure out the encoding.
> hello peter ...how exactly would i solve this issue .....
There is no "exactly" here. Determining the encoding of a file depends
on experience and trial and error. However, I can give you some general
Make sure you have a way to reliably display files:
1) As a stream of bytes. On Linux hd works well for this purpose,
although a hex editor might be even better
2) As a unicode text. On Linux terminal emulators usually use UTF-8
encoding, so viewing a file with less should be sufficient.
Beware of programs which try to guess the encoding. They can
fool you. If you don't have anything which works reliably you
might want to have a look at my utf8dump script
(https://www.hjp.at/programs/utf8dump/ (Perl code, sorry ;-)).
As has already been mentioned, chardet usually does a good job.
So first let chardet guess the encoding. Then use iconv to convert
from this encoding to UTF-8 (or any other UTF you can reliably read)
and open it in your text reader (preparation step 2 above) to check
whether the result makes sense. If it does, you are done.
Checking other encodings:
This is where it gets tedious. You could systematically try all
encodings supported by iconv, but there are a lot of them (over
1000!). So you should try to narrow it down: What language is the
file in? On what OS was the file (probably) created? If most of the
non-ascii characters are already correct, but a few are wrong, what
other encodings are there in the same family? But however you
determined the list of candidate encodings, the actual check is the
same as above: Use iconv to convert from the candidate encoding and
check the result for plausibility.
Use the encoding in your program:
When you are done, open the file in your with open(...,
encoding='...') with the encoding you determined above.
> i have a script that works in python 2 but not pytho3..i did 2 to 3.py
> ...but i still get the errro...character undefieed..unicode decode
> error cant decode byte 1x09 in line 7414 from cp 1252..like would you
> have a sraright solution answer??..i cant get a straight answer..it
> was ported from ansi to python...so its utf-8 as far asi can see
If it is utf-8, just open the file with open(filename, encoding="utf-8")
(or open(filename, encoding="utf-8-sig"), if it starts with a BOM).
And follow Steven's advice and read all the stuff he mentioned. It is
important to have a firm understanding of what "character", "byte",
"encoding" etc. mean. If you understand that, the rest is easy (sometimes
tedious, but not difficult). If you don't understand that, you can only
resort to try and error and will be continuously baffled by unexpected
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp at hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 833 bytes
Desc: not available