osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Unicode filenames


On 7/12/19 7:17 AM, Bob van der Poel wrote:
> I have some files which came off the net with, I'm assuming, unicode
> characters in the names. I have a very short program which takes the
> filename and puts into an emacs buffer, and then lets me add information to
> that new file (it's a poor man's DB).
> 
> Next, I can look up text in the file and open the saved filename.
> Everything works great until I hit those darn unicode filenames.
> 
> Just to confuse me even more, the error seems to be coming from a bit of
> tkinter code:
>   if sresults.has_key(textAtCursor):
>          bookname = os.path.expanduser(sresults[textAtCursor].strip())
> 
> which generates
> 
>    UnicodeWarning: Unicode equal comparison failed to convert both arguments
> to Unicode - interpreting them as being unequal  if
> sresults.has_key(textAtCursor):
> 
> I really don't understand the business about "both arguments". Not sure how
> to proceed here. Hoping for a guideline!


(I'm guessing that) the "both arguments" relates to expanduser() because 
this is the first time that the fileNM has been identified to Python as 
anything more than a string of characters.

[a fileNM will be a string of characters, but a string of characters is 
not necessarily a (legal) fileNM!]

Further suggesting, that if you are using Python3 (cf 2), your analysis 
may be the wrong-way-around. Python3 treats strings as Unicode. However, 
there is, and certainly in the past, was, no requirement for OpSys and 
IOCS to encode in Unicode.

The problem (for me) came from MSFT's (for example) many variations of 
ISO-8859-n and that there are no clues as to which of these was used in 
naming the file, and thus many possibly 'translations' into Unicode.

You can start to address the issue by using Python's bytes (instead of 
strings), however that cold reality still intrudes.

Do you know the provenance of these files, eg they are in French and 
from an MS-Win machine? If so, you may be able to use decode() and 
encode(), but...

Still looking for trouble? Knowing a fileNM was in Spanish/Portuguese I 
was able to take the fileNM's individual Unicode characters/surrogates 
and subtract an applicable constant, so that accented letters fell 
'back' into the correct Unicode range. (this is extremely risky, and 
could quite easily make matters worse/more confusing).

I warn you that pursuing this matter involves disappearing down into a 
very deep 'rabbit hole', but YMMV!

WebRefs:
https://docs.python.org/3/howto/unicode.html
https://www.dictionary.com/e/slang/rabbit-hole/
-- 
Regards =dn