[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PyArrow and Parquet DELTA_BINARY_PACKED

hi Feras,

How large are the files? For small files, differences in metadata
could impact the file size more significantly. I would be surprised if
this were the case with larger files, though (I'm not sure what
fraction of a column chunk consists of data page headers vs. actual
data in practice)

- Wes

On Tue, May 15, 2018 at 12:17 AM, Feras Salim <feribg@xxxxxxxxx> wrote:
> Hi Uwe,
> I'm quite confused by the findings, Im attaching a bunch of files
> corresponding to the version and library generating the files.
> On the first topic of DELTA_BINARY_PACKED. It seems it's something not well
> supported on the Java side as well or my implementation is off, but I just
> copied over the "CsvParquetWriter.java". I created a sample encoder based on
> parquet-mr and it seems if dictionary is used, it falls back to PLAIN
> instead of DELTA when the dict gets too big. Regardless what I do I can't
> make it use DELTA for the attached schema.
> In terms of size difference I see the issue in the resulting metadata, but
> not the root cause. You will see the code is identical with just the
> addition of version="2.0". This results in changing the output file metadata
> from "ENC:PLAIN_DICTIONARY,PLAIN,RLE" to "ENC:RLE,PLAIN" hence increasing
> the size quite substantially.
> Let me know if there's anything else I can provide to help debug this one.
> The second part is not critical since I can just use v1 for now, but good to
> figure out why the output changes. The first part is a bit more pressing for
> me since I really want to assess the difference between RLE and
> DELTA_BINARY_PACKED on monotonically increasing values like a timestamp,
> ticking at a constant rate.
> On Sun, May 13, 2018 at 11:58 AM, Uwe L. Korn <uwelk@xxxxxxxxxx> wrote:
>> Hello Feras,
>> `DELTA_BINARY_PACKED` is at the moment only implemented in parquet-cpp on
>> the read path. The necessary encoder implementation for this code is missing
>> at the moment.
>> The change in file size is something I also don't understand. The only
>> difference between the two versions is that with version 2, we encode uint32
>> columns in version 1 as INT64 whereas in version two, we can encode them as
>> UINT32. This type was not available in version 1. It would be nice, if you
>> could narrow down the issue to e.g. the column which causes the increase in
>> size. You might also use the Java parquet-tools or parquet-cli to inspect
>> the size statistics of the parts of the individual Parquet file.
>> Uwe
>> On Fri, May 11, 2018, at 3:07 AM, Feras Salim wrote:
>> > Hi, I was wondering if I'm missing something or currently the
>> > `DELTA_BINARY_PACKED` is only available for reading when it comes to
>> > parquet files, I can't find a way for the writer to encode timestamp
>> > data
>> > with `DELTA_BINARY_PACKED`, furthermore I seem to get about 10% increase
>> > in
>> > final file size when I change from ver 1 to ver 2 without changing
>> > anything
>> > else about the schema or data.