Re: Petastorm: PyArrow based library for Tensorflow, PyTorch and others...

Hello Yevgeni,

this looks interesting. Can you make a PR to https://github.com/apache/arrow so that  Petastorm is listed on https://arrow.apache.org/powered_by/ ? 

I browsed a bit through your code. As far as I can see your approach is store to have a set of Parquet files in a directory with a schema that can be translated for Spark, Tensorflow, Torch, … Is this schema persisted in the Parquet file metadata or as a separate file alongside the dataset? Could we extend Arrow's type system a bit to better suit all the frameworks you are targeting. As you had to build a more general schema class, I guess there are definitely things that could not be expressed in Arrow's schema definition. Not sure whether we could extend pyarrow's schema classes to fully support your use case but I would like to understand how to better support it.


