[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help organize Parquet-related C++, Python issues

I just spent some time combing through Arrow JIRA issues that mention "parquet"

We now have 60 Python-related issues appropriately labeled


I noted there are some bugs reported that are duplicates of each
other, but will need to examine more closely to confirm

There's another 17 that are more C++-related

On Mon, Nov 12, 2018 at 1:16 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
> hi folks,
> As some of you may have noticed, we are accumulating a mountain of
> Parquet-related JIRA issues, many of them resulting from people using
> Apache Arrow to do data engineering in Python and running into
> problems.
> To help with having better visibility into all the relevant Parquet
> issues, and with the monorepo merge behind us, I created a couple wiki
> pages linked to from the main
> https://cwiki.apache.org/confluence/display/ARROW page:
> * C++ issue dashboard: https://cwiki.apache.org/confluence/x/fpWzBQ
> * Python issue dashboard:
> https://cwiki.apache.org/confluence/display/ARROW/Python+Parquet+Development
> Many Parquet issues in the ARROW project are not found in these
> dashboards because they lack the "parquet" label. Please help with
> project organization by remembering to apply the "parquet" label to
> any issue.
> Since Ruby also supports Parquet now via GLib, and R support for
> Parquet is coming soon, we need to do what we can to grow the
> community of people working on the core Parquet libraries and the
> things they depend on, like the IO and memory management subsystems of
> the Arrow C++ libraries.
> In general, I think it is very important for us to have fast and
> reliable C++ support (and language bindings) for the 5 major file
> formats in use in data warehousing:
> * CSV
> * JSON
> * Parquet
> * Avro
> * ORC
> Antoine has been leading efforts on reading CSV files, and we will
> need to make a push into JSON and Avro at some point.
> Thanks
> Wes