Re: Intro to pandas + pyarrow integration?
One of the goals of Apache Arrow is to define an open standard for
in-memory columnar data (which may be called "tables" or "data frames"
in some domains). Among other things, the Arrow columnar format is
optimized for memory efficiency and analytical processing performance
on very large (even larger-than-RAM) data sets.
The way to think about it is that pandas has its own in-memory
representation for columnar data, but it is "proprietary" to pandas.
To make use of pandas's analytical facilities, you must convert data
to pandas's memory representation. As an example, pandas represents
strings as NumPy arrays of Python string objects, which is very
wasteful. Uwe Korn recently demonstrated an approach to using Arrow
inside pandas, but this would require a lot of work to port algorithms
to run against Arrow: https://github.com/xhochy/fletcher
We are working to develop the standard data frame type operations as
reusable libraries within this project, and these will run natively
against the Arrow columnar format. This is a big project; we would
love to have you involved with the effort. One of the reasons I have
spent so much of my time the last few years on this project is that I
believe it is the best path to build a faster, more efficient
pandas-like library for data scientists.
On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan <buchanae@xxxxxxxx> wrote:
> Hello all.
> I'm confused about the current level of integration between pandas and pyarrow. Am I correct in understanding that currently I'll need to convert pyarrow Tables to pandas DataFrames in order to use most of the pandas features? By "pandas features" I mean every day slicing and dicing of data: merge, filtering, melt, spread, etc.
> I have a dataframe which starts out from small files (< 1GB) and quickly explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm interested in whether arrow can provide a better, optimized dataframe.