Re: file-system specification
On Wed, 9 May 2018 11:28:15 -0400
Martin Durant <martin.durant@xxxxxxxxxxx> wrote:
> I have sketched out a possible start of a python-wide file-system specification
> This came about from my work in some other (remote) file-systems implementations for python, particularly in the context of Dask. Since arrow also cares about both local files and, for example, hdfs, I thought that people on this list may have comments and opinions about a possible standard that we ought to converge on. I do not think that my suggestions so far are necessarily right or even good in many cases, but I want to get the conversation going.
Here are some comments:
- API naming: you seem to favour re-using Unix command-line monickers in
some places, while using more regular verbs or names in other
places. I think it should be consistent. Since the Unix
command-line doesn't exactly cover the exposed functionality, and
since Unix tends to favour short cryptic names, I think it's better
to use Python-like naming (which is also more familiar to non-Unix
users). For example "move" or "rename" or "replace" instead of "mv",
- **kwargs parameters: a couple APIs (`mkdir`, `put`...) allow passing
arbitrary parameters, which I assume are intended to be
backend-specific. It makes it difficult to add other optional
parameters to those APIs in the future. So I'd make the
backend-specific directives a single (optional) dict parameter rather
than a **kwargs.
- `invalidate_cache` doesn't state whether it invalidates recursively
or not (recursively sounds better intuitively?). Also, I think it
would be more flexible to take a list of paths rather than a single
- `du`: the effect of the `deep` parameter isn't obvious to me. I don't
know what it would mean *not* to recurse here: what is the size of a
directory if you don't recurse into it?
- `glob` may need a formal definition (are trailing slashes
significant for directory or symlink resolution? this kind of thing),
though you may want to keep edge cases backend-specific.
- are `head` and `tail` at all useful? They can be easily recreated
using a generic `open` facility.
- `read_block` tries to do too much in a single API IMHO, and
using `open` directly is more flexible anyway.
- if `touch` is intended to emulate the Unix API of the same name, the
docstring should state "Create empty file or update last modification
- the information dicts returned by several APIs (`ls`, `info`....)
need standardizing, at least for non backend-specific fields.
- if the backend is a networked filesystem with non-trivial latency,
perhaps the operations would deserve being batched (operate on
several paths at once), though I will happily defer to your expertise
on the topic.