osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PROPOSAL] ParquetIO support for Python SDK


Was there resolution on how to handle row group size, given that it's
hard to pick a decent default? IIRC, the ideal was to base this on
byte sizes; will this be in v1 or will there be other parameter(s)
that we'll have to support going forward?
On Tue, Oct 30, 2018 at 10:42 PM Heejong Lee <heejong@xxxxxxxxxx> wrote:
>
> Thanks all for the valuable feedback on the document. Here's the summary of planned features for ParquetIO Python SDK:
>
> Can read from Parquet file on any storage system supported by Beam
>
> Can write to Parquet file on any storage system supported by Beam
>
> Can configure the compression algorithm of output files
>
> Can adjust the size of the row group
>
> Can read multiple row groups in a single file parallelly (source splitting)
>
> Can partially read by columns
>
>
> It introduces new dependency pyarrow for parquet reading and writing operations.
>
> If you're interested, you can review and test the PR https://github.com/apache/beam/pull/6763
>
> Thanks,
>
> On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath <chamikara@xxxxxxxxxx> wrote:
>>
>> Thanks Heejong. Added some comments. +1 for summarizing the doc in the email thread.
>>
>> - Cham
>>
>> On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay <altay@xxxxxxxxxx> wrote:
>>>
>>> Thank you Heejong. Could you also share a summary of the design document (major points/decisions) in the mailing list?
>>>
>>> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee <heejong@xxxxxxxxxx> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm working on BEAM-4444: Parquet IO for Python SDK.
>>>>
>>>> Issue: https://issues.apache.org/jira/browse/BEAM-4444
>>>> Design doc: https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
>>>> WIP PR: https://github.com/apache/beam/pull/6763
>>>>
>>>> Any feedback is appreciated. Thanks!
>>>>
>>>