Re: PyArrow and Parquet DELTA_BINARY_PACKED
Given the very high compression ratio with your data, it's completely
possible that the difference in size is coming from the larger V2 data
pages. Compare DataPageHeader with DataPageHeaderV2 in parquet.thrift
I'm not a Thrift expert but the serialized V2 data pages look like
they're going to be twice as large at least.
You may be able to increase the size of data pages to test this hypothesis.
On Fri, May 18, 2018 at 3:26 PM, Feras Salim <feribg@xxxxxxxxx> wrote:
> Hi Wes,
> The raw file in CSV is about a gig. Gzipped is about 50mb and the most I
> could compress it with parquet V1 was 21mb and V2 (same settings) about
> 25mb. It's quite surprising that it changes how the data is encoded between
> versions, given that Uwe said "The only difference between the two versions
> is that with version 2, we encode uint32 columns".
> On Fri, May 18, 2018 at 2:24 PM, Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
>> hi Feras,
>> How large are the files? For small files, differences in metadata
>> could impact the file size more significantly. I would be surprised if
>> this were the case with larger files, though (I'm not sure what
>> fraction of a column chunk consists of data page headers vs. actual
>> data in practice)
>> - Wes
>> On Tue, May 15, 2018 at 12:17 AM, Feras Salim <feribg@xxxxxxxxx> wrote:
>> > Hi Uwe,
>> > I'm quite confused by the findings, Im attaching a bunch of files
>> > corresponding to the version and library generating the files.
>> > On the first topic of DELTA_BINARY_PACKED. It seems it's something not
>> > supported on the Java side as well or my implementation is off, but I
>> > copied over the "CsvParquetWriter.java". I created a sample encoder
>> based on
>> > parquet-mr and it seems if dictionary is used, it falls back to PLAIN
>> > instead of DELTA when the dict gets too big. Regardless what I do I can't
>> > make it use DELTA for the attached schema.
>> > In terms of size difference I see the issue in the resulting metadata,
>> > not the root cause. You will see the code is identical with just the
>> > addition of version="2.0". This results in changing the output file
>> > from "ENC:PLAIN_DICTIONARY,PLAIN,RLE" to "ENC:RLE,PLAIN" hence
>> > the size quite substantially.
>> > Let me know if there's anything else I can provide to help debug this
>> > The second part is not critical since I can just use v1 for now, but
>> good to
>> > figure out why the output changes. The first part is a bit more pressing
>> > me since I really want to assess the difference between RLE and
>> > DELTA_BINARY_PACKED on monotonically increasing values like a timestamp,
>> > ticking at a constant rate.
>> > On Sun, May 13, 2018 at 11:58 AM, Uwe L. Korn <uwelk@xxxxxxxxxx> wrote:
>> >> Hello Feras,
>> >> `DELTA_BINARY_PACKED` is at the moment only implemented in parquet-cpp
>> >> the read path. The necessary encoder implementation for this code is
>> >> at the moment.
>> >> The change in file size is something I also don't understand. The only
>> >> difference between the two versions is that with version 2, we encode
>> >> columns in version 1 as INT64 whereas in version two, we can encode
>> them as
>> >> UINT32. This type was not available in version 1. It would be nice, if
>> >> could narrow down the issue to e.g. the column which causes the
>> increase in
>> >> size. You might also use the Java parquet-tools or parquet-cli to
>> >> the size statistics of the parts of the individual Parquet file.
>> >> Uwe
>> >> On Fri, May 11, 2018, at 3:07 AM, Feras Salim wrote:
>> >> > Hi, I was wondering if I'm missing something or currently the
>> >> > `DELTA_BINARY_PACKED` is only available for reading when it comes to
>> >> > parquet files, I can't find a way for the writer to encode timestamp
>> >> > data
>> >> > with `DELTA_BINARY_PACKED`, furthermore I seem to get about 10%
>> >> > in
>> >> > final file size when I change from ver 1 to ver 2 without changing
>> >> > anything
>> >> > else about the schema or data.