[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: file-system specification

Hi Martin,

On Wed, 9 May 2018 11:28:15 -0400
Martin Durant <martin.durant@xxxxxxxxxxx> wrote:
> I have sketched out a possible start of a python-wide file-system specification
> https://github.com/martindurant/filesystem_spec
> This came about from my work in some other (remote) file-systems implementations for python, particularly in the context of Dask. Since arrow also cares about both local files and, for example, hdfs, I thought that people on this list may have comments and opinions about a possible standard that we ought to converge on. I do not think that my suggestions so far are necessarily right or even good in many cases, but I want to get the conversation going.

Here are some comments:

- API naming: you seem to favour re-using Unix command-line monickers in
  some places, while using more regular verbs or names in other
  places.  I think it should be consistent.  Since the Unix
  command-line doesn't exactly cover the exposed functionality, and
  since Unix tends to favour short cryptic names, I think it's better
  to use Python-like naming (which is also more familiar to non-Unix
  users). For example "move" or "rename" or "replace" instead of "mv",

- **kwargs parameters: a couple APIs (`mkdir`, `put`...) allow passing
  arbitrary parameters, which I assume are intended to be
  backend-specific.  It makes it difficult to add other optional
  parameters to those APIs in the future.  So I'd make the
  backend-specific directives a single (optional) dict parameter rather
  than a **kwargs.

- `invalidate_cache` doesn't state whether it invalidates recursively
  or not (recursively sounds better intuitively?).  Also, I think it
  would be more flexible to take a list of paths rather than a single

- `du`: the effect of the `deep` parameter isn't obvious to me. I don't
  know what it would mean *not* to recurse here: what is the size of a
  directory if you don't recurse into it?

- `glob` may need a formal definition (are trailing slashes
  significant for directory or symlink resolution? this kind of thing),
  though you may want to keep edge cases backend-specific.

- are `head` and `tail` at all useful? They can be easily recreated
  using a generic `open` facility.

- `read_block` tries to do too much in a single API IMHO, and
  using `open` directly is more flexible anyway.

- if `touch` is intended to emulate the Unix API of the same name, the
  docstring should state "Create empty file or update last modification

- the information dicts returned by several APIs (`ls`, `info`....)
  need standardizing, at least for non backend-specific fields.

- if the backend is a networked filesystem with non-trivial latency,
  perhaps the operations would deserve being batched (operate on
  several paths at once), though I will happily defer to your expertise
  on the topic.