[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Why exception from os.path.exists()?

On Tue, 05 Jun 2018 17:28:24 +0200, Peter J. Holzer wrote:
> If a disk with a file system which allows embedded NUL characters is
> mounted on Linux (let's for the sake of the argument assume it is HFS+,
> although I have to admit that I don't know anything about the internals
> of that filesystem), then the low level filesystem code has to map that
> character to something else. Even the generic filesystem code of the
> kernel will never see that NUL character,

Even if this were true, why is it even the tiniest bit relevant to what 
os.path.exists() does when given a path containing a NUL byte?

> let alone the user space. As
> far as the OS is concerned, that file doesn't contain a NUL character.

I don't care about "as far as the OS". I care about users, people like 
me. If I say "Here's a file called "sp\0am" then I don't care what the OS 
does, or the FS driver, or the disk hardware. I couldn't care less what 
the actual byte pattern on the disk is.

If you told me that the pattern of bytes representing that filename was 
0x0102030405 then I'd be momentarily impressed by the curious pattern and 
then do my best to immediately forget all about it.

As a Python programmer, *why do you care* about NULs? How does this 
special treatment make your life as a Python programmer better?

> The whole system (except for some low-level FS-dependent code) will
> always only see the mapped name.

Yes. So what? That's *already the case*. Even Python string you pass to 
os.path.exists is already mapped, and errors from the kernel are mapped 
to False. Why should NUL be treated differently?

Typical Linux file systems (ext3, ext4, btrfs, ReiserFS etc) don't 
support Unicode, only bytes 0...255, but we can query "invalid" file 
names containing characters like ? ? or ?, without any problem. We don't 
get ValueError just because of some irrelevant technical detail that the 
file system doesn't support characters outside of the range of bytes 
1...255 (excluding 47). We can do this because Python seamlessly maps 
Unicode to bytes and back again.

You may have heard of a little-known operating system called "Windows", 
which defaults to NTFS as its file system. I'm told that there are a few 
people who use this file system. Even under Linux, you might have 
(knowingly or unknowingly) used a network file system or storage device 
that used NTFS under the hood.

If so, then every time you query a filename, even an ordinary looking one 
like "foo", you could be dealing with multiple NUL bytes, as the NTFS 
file system (even under Linux!) uses Unicode file names encoded with 
UTF-16. There's a good chance that EVERY filename you've used on a NAS 
device or network drive has included embedded NUL bytes.

You've painted a pretty picture of the supposed confusion and difficulty 
such NUL bytes would cause, but its all nonsense. We already can 
seamlessly and transparently interact with file systems where file names 
include NUL bytes under Linux.

BUT even if what you said was true, that Linux cannot deal with NUL bytes 
in file names even with driver support, even if passing a NUL byte to the 
Linux kernel would cause the fall of human civilization, that STILL 
wouldn't require us to raise ValueError from os.path.exists!

Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson