[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Petastorm: PyArrow based library for Tensorflow, PyTorch and others...

Hello Uwe,

(messed up with the mailing list settings, sorry if this message shows up
as not part of the original thread)

Created a PR for the "Powered by" - thanks for the suggestion!

We persist the schema into the Parquet metadata, as a custom field. We are
currently working on a version that would use Arrow
tables as the primary data storage (previously we were doing py_dict very
early in ourdata flow and working with python/numpy
types). I am still catching up on Arrow data structure, and maybe you can
shed some light on it. I did not find a way to create
an Array of pa.Tensor's (imagine a rowgroup of images). As a result I end
up keeping the data as array of lists and
utilize side channels for transmitting the shapes, which makes the code
more clunky.

Next step would be to stream tensors directly to Tensorflow directly from
Arrow tables. I guess a native support of Tensors
there could help.


- Yevgeni

> ---------- Forwarded message ----------
> From: "Uwe L. Korn" <uwelk@xxxxxxxxxx>
> To: dev@xxxxxxxxxxxxxxxx
> Cc:
> Bcc:
> Date: Fri, 05 Oct 2018 17:38:13 +0200
> Subject: Re: Petastorm: PyArrow based library for Tensorflow, PyTorch and
> Hello Yevgeni,
> this looks interesting. Can you make a PR to
https://github.com/apache/arrow so that  Petastorm is listed on
https://arrow.apache.org/powered_by/ ?
> I browsed a bit through your code. As far as I can see your approach is
store to have a set of Parquet files in a directory with a schema that can
be > translated for Spark, Tensorflow, Torch, … Is this schema persisted in
the Parquet file metadata or as a separate file alongside the dataset?
Could > we extend Arrow's type system a bit to better suit all the
frameworks you are targeting. As you had to build a more general schema
class, I guess > there are definitely things that could not be expressed in
Arrow's schema definition. Not sure whether we could extend pyarrow's
schema classes to > fully support your use case but I would like to
understand how to better support it.
> Uwe
> On Wed, Sep 26, 2018, at 8:59 PM, Yevgeni Litvin wrote:
> > Hi,
> >
> > My name is Yevgeni Litvin. I am working on ML infra with a small team
> > within Uber ATG. Our team has recently open sourced Petastorm library.
> > heavily relies on Apache Arrow so I wanted to share it with the
> >
> > The goal of the project is to provide a convenient way for deep learning
> > community to use Apache Parquet store with sensor data from Tensorflow,
> > PyTorch or other Python based ML frameworks.
> >
> > I believe our use of Parquet is different from mainstream applications
> > our field sizes are asymetric (some are huge, such as images, and others
> > are small) and rowgroup sizes are relatively small (<100). That required
> > some adaptations.
> >
> > We use PyArrow mostly for loading the data. We do see great potential
> > further optimizations and speedups by relying more heavily on Arrow as
> > in-memory store.
> >
> > You can find more information about our project here:
> >
> > http://eng.uber.com/petastorm/
> > https://github.com/uber/petastorm
> >
> > Would be more than happy to hear comments, feedback and suggestions!
> >
> > Thank you,
> >
> > - Yevgeni Litvin