Why exception from os.path.exists()?
On Tue, 05 Jun 2018 17:28:24 +0200, Peter J. Holzer wrote:
> If a disk with a file system which allows embedded NUL characters is
> mounted on Linux (let's for the sake of the argument assume it is HFS+,
> although I have to admit that I don't know anything about the internals
> of that filesystem), then the low level filesystem code has to map that
> character to something else. Even the generic filesystem code of the
> kernel will never see that NUL character,
Even if this were true, why is it even the tiniest bit relevant to what
os.path.exists() does when given a path containing a NUL byte?
> let alone the user space. As
> far as the OS is concerned, that file doesn't contain a NUL character.
I don't care about "as far as the OS". I care about users, people like
me. If I say "Here's a file called "sp\0am" then I don't care what the OS
does, or the FS driver, or the disk hardware. I couldn't care less what
the actual byte pattern on the disk is.
If you told me that the pattern of bytes representing that filename was
0x0102030405 then I'd be momentarily impressed by the curious pattern and
then do my best to immediately forget all about it.
As a Python programmer, *why do you care* about NULs? How does this
special treatment make your life as a Python programmer better?
> The whole system (except for some low-level FS-dependent code) will
> always only see the mapped name.
Yes. So what? That's *already the case*. Even Python string you pass to
os.path.exists is already mapped, and errors from the kernel are mapped
to False. Why should NUL be treated differently?
Typical Linux file systems (ext3, ext4, btrfs, ReiserFS etc) don't
support Unicode, only bytes 0...255, but we can query "invalid" file
names containing characters like ? ? or ?, without any problem. We don't
get ValueError just because of some irrelevant technical detail that the
file system doesn't support characters outside of the range of bytes
1...255 (excluding 47). We can do this because Python seamlessly maps
Unicode to bytes and back again.
You may have heard of a little-known operating system called "Windows",
which defaults to NTFS as its file system. I'm told that there are a few
people who use this file system. Even under Linux, you might have
(knowingly or unknowingly) used a network file system or storage device
that used NTFS under the hood.
If so, then every time you query a filename, even an ordinary looking one
like "foo", you could be dealing with multiple NUL bytes, as the NTFS
file system (even under Linux!) uses Unicode file names encoded with
UTF-16. There's a good chance that EVERY filename you've used on a NAS
device or network drive has included embedded NUL bytes.
You've painted a pretty picture of the supposed confusion and difficulty
such NUL bytes would cause, but its all nonsense. We already can
seamlessly and transparently interact with file systems where file names
include NUL bytes under Linux.
BUT even if what you said was true, that Linux cannot deal with NUL bytes
in file names even with driver support, even if passing a NUL byte to the
Linux kernel would cause the fall of human civilization, that STILL
wouldn't require us to raise ValueError from os.path.exists!
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson