OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encoding options (delta, rle, ...) in pyarrow bindings


Hello Sebastian,

there is not ETA on delta encoding as no one is actively working on it. There is some basic code implementing the relevant encoders in [1]. This code is not used at all at the moment as it does not fulfill the necessary APIs. The relevant JIRA tickets are [2], [3], and [4]. There you can ask questions (but can also use the ML for that) or discuss the implementation. The existing code needs to be ported to fulfill the interface as defined in [5]. Also note that we have moved the parquet-cpp code into the arrow repository so all changes will go there.

Uwe

[1]: https://github.com/apache/parquet-cpp/blob/d15d2687e9f154e69e956e2a56c8d1fd6c3b7ac8/benchmarks/decode_benchmark.cc 
[2]: https://issues.apache.org/jira/browse/PARQUET-491
[3]: https://issues.apache.org/jira/browse/PARQUET-490
[4]: https://issues.apache.org/jira/browse/PARQUET-492
[5]: https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.h

On Fri, Nov 2, 2018, at 2:33 PM, Sebastian Himberger wrote:
> Uwe, Wes,
> 
> thanks so much. I completely forgot to say that I was asking about parquet.
> It's good to know the current status though. I also didn't know that the
> dictionary encoding already has some form of RLE.
> 
> @Uwe: Any ETA on delta encoding? Is the being worked on or are other things
> more important ATM? I am not asking to generate pressure but out of
> curiosity. I appreciate that this is an open source project and if I need
> it I can just jump in and do it myself.
> 
> Thanks again and have a great day,
> Sebastian
> 
> 
> Am Fr., 2. Nov. 2018 um 14:27 Uhr schrieb Wes McKinney <wesmckinn@xxxxxxxxx
> >:
> 
> > Hi Sebastian -- Uwe is referring to Parquet files. We don't yet have
> > in-memory RLE or Delta encoding in the Arrow columnar format. I suspect
> > this will eventually be added as it can be quite important to improve
> > in-memory query execution performance.
> >
> > Wes
> >
> > On Fri, Nov 2, 2018, 2:18 PM Uwe L. Korn <uwelk@xxxxxxxxxx wrote:
> >
> > > Hello Sebastian,
> > >
> > > currently you can only switch between plain and
> > > dictionary-encoding-combined-with-run-length encoding using the
> > > `use_dictionary` flag on
> > >
> > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> > > . Other encoding are yet only implemented on the read path, we cannot
> > write
> > > delta encodings yet.
> > >
> > > Uwe
> > >
> > > On Fri, Nov 2, 2018, at 12:53 PM, Sebastian Himberger wrote:
> > > > Hi,
> > > >
> > > > I hope this is the right list. I couldn't find a "users" list on the
> > > > website so please forgive me if I am interrupting here.
> > > >
> > > > I am developing an application using the pyarrow module. By reading
> > > through
> > > > the documents I couldn't find a way to specify an encoding like delta
> > or
> > > > run length to a column. Is this not supported yet or am I missing
> > > something?
> > > >
> > > > Thanks so much,
> > > > Sebastian
> > >
> >