osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PROPOSAL] ParquetIO support for Python SDK


In current PR, there will be two parameters that can control the final row group size, row_group_buffer_size and record_batch_size. The records are first stored as a list of columns and then transformed into a record batch (a data structure defined in pyarrow) when the number of records in the list reaches record_batch_size. Record batches form another list that will be written as a single row group when the byte size of the record batch list exceeds row_group_buffer_size. row_group_buffer_size is normally much bigger than a row group data size in a parquet file so it's not an exact estimation of a row group size written in a file but I guess this is the best option we can do on the given limitation of python parquet libraries. For better estimation of row group size in bytes, the parquet library should provide buffered writing of a row group and a method returning the size of encoded data in the writing buffer. No currently available python parquet library implements these features.


On Tue, Nov 13, 2018 at 4:44 AM Robert Bradshaw <robertwb@xxxxxxxxxx> wrote:
Was there resolution on how to handle row group size, given that it's
hard to pick a decent default? IIRC, the ideal was to base this on
byte sizes; will this be in v1 or will there be other parameter(s)
that we'll have to support going forward?
On Tue, Oct 30, 2018 at 10:42 PM Heejong Lee <heejong@xxxxxxxxxx> wrote:
>
> Thanks all for the valuable feedback on the document. Here's the summary of planned features for ParquetIO Python SDK:
>
> Can read from Parquet file on any storage system supported by Beam
>
> Can write to Parquet file on any storage system supported by Beam
>
> Can configure the compression algorithm of output files
>
> Can adjust the size of the row group
>
> Can read multiple row groups in a single file parallelly (source splitting)
>
> Can partially read by columns
>
>
> It introduces new dependency pyarrow for parquet reading and writing operations.
>
> If you're interested, you can review and test the PR https://github.com/apache/beam/pull/6763
>
> Thanks,
>
> On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath <chamikara@xxxxxxxxxx> wrote:
>>
>> Thanks Heejong. Added some comments. +1 for summarizing the doc in the email thread.
>>
>> - Cham
>>
>> On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay <altay@xxxxxxxxxx> wrote:
>>>
>>> Thank you Heejong. Could you also share a summary of the design document (major points/decisions) in the mailing list?
>>>
>>> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee <heejong@xxxxxxxxxx> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm working on BEAM-4444: Parquet IO for Python SDK.
>>>>
>>>> Issue: https://issues.apache.org/jira/browse/BEAM-4444
>>>> Design doc: https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
>>>> WIP PR: https://github.com/apache/beam/pull/6763
>>>>
>>>> Any feedback is appreciated. Thanks!
>>>>
>>>