osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PROPOSAL] ParquetIO support for Python SDK


Thanks all for the valuable feedback on the document. Here's the summary of planned features for ParquetIO Python SDK:
  • Can read from Parquet file on any storage system supported by Beam

  • Can write to Parquet file on any storage system supported by Beam

  • Can configure the compression algorithm of output files

  • Can adjust the size of the row group

  • Can read multiple row groups in a single file parallelly (source splitting)

  • Can partially read by columns


It introduces new dependency pyarrow for parquet reading and writing operations.

If you're interested, you can review and test the PR https://github.com/apache/beam/pull/6763

Thanks,

On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath <chamikara@xxxxxxxxxx> wrote:
Thanks Heejong. Added some comments. +1 for summarizing the doc in the email thread.

- Cham

On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay <altay@xxxxxxxxxx> wrote:
Thank you Heejong. Could you also share a summary of the design document (major points/decisions) in the mailing list?

On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee <heejong@xxxxxxxxxx> wrote:
Hi,

I'm working on BEAM-4444: Parquet IO for Python SDK.


Any feedback is appreciated. Thanks!