Re: Petastorm: PyArrow based library for Tensorflow, PyTorch and others...
this looks interesting. Can you make a PR to https://github.com/apache/arrow so that Petastorm is listed on https://arrow.apache.org/powered_by/ ?
I browsed a bit through your code. As far as I can see your approach is store to have a set of Parquet files in a directory with a schema that can be translated for Spark, Tensorflow, Torch, … Is this schema persisted in the Parquet file metadata or as a separate file alongside the dataset? Could we extend Arrow's type system a bit to better suit all the frameworks you are targeting. As you had to build a more general schema class, I guess there are definitely things that could not be expressed in Arrow's schema definition. Not sure whether we could extend pyarrow's schema classes to fully support your use case but I would like to understand how to better support it.
On Wed, Sep 26, 2018, at 8:59 PM, Yevgeni Litvin wrote:
> My name is Yevgeni Litvin. I am working on ML infra with a small team
> within Uber ATG. Our team has recently open sourced Petastorm library. It
> heavily relies on Apache Arrow so I wanted to share it with the community.
> The goal of the project is to provide a convenient way for deep learning
> community to use Apache Parquet store with sensor data from Tensorflow,
> PyTorch or other Python based ML frameworks.
> I believe our use of Parquet is different from mainstream applications as
> our field sizes are asymetric (some are huge, such as images, and others
> are small) and rowgroup sizes are relatively small (<100). That required
> some adaptations.
> We use PyArrow mostly for loading the data. We do see great potential for
> further optimizations and speedups by relying more heavily on Arrow as
> in-memory store.
> You can find more information about our project here:
> Would be more than happy to hear comments, feedback and suggestions!
> Thank you,
> - Yevgeni Litvin