osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PyArrow & Python Multiprocessing


Take a look at the Plasma object store
https://arrow.apache.org/docs/python/plasma.html.

Here's an example using it (along with multiprocessing to sort a pandas
dataframe)
https://github.com/apache/arrow/blob/master/python/examples/plasma/sorting/sort_df.py.
It's possible the example is a bit out of date.

You may be interested in taking a look at Ray
https://github.com/ray-project/ray. We use Plasma/Arrow under the hood to
do all of these things but hide a lot of the bookkeeping (like object ID
generation). For your setting, you can think of it as a replacement for
Python multiprocessing that automatically uses shared memory and Arrow for
serialization.

On Wed, May 16, 2018 at 10:02 AM Corey Nolet <cjnolet@xxxxxxxxx> wrote:

> I've been reading through the PyArrow documentation and trying to
> understand how to use the tool effectively for IPC (using zero-copy).
>
> I'm on a system with 586 cores & 1TB of ram. I'm using Panda's Dataframes
> to process several 10's of gigs of data in memory and the pickling that is
> done by Python's multiprocessing API is very wasteful.
>
> I'm running a little hand-built map-reduce where I chunk the dataframe into
> N_mappers number of chunks, run some processing on them, then run some
> number N_reducers to finalize the operation. What I'd like to be able to do
> is chunk up the dataframe into Arrow Buffer objects and just have each
> mapped task read their respective Buffer object with the guarantee of
> zero-copy.
>
> I see there's a couple Filesystem abstractions for doing memory-mapped
> files. Durability isn't something I need and I'm willing to forego the
> expense of putting the files on disk.
>
> Is it possible to write the data directly to memory and pass just the
> reference around to the different processes? What's the recommended way to
> accomplish my goal here?
>
>
> Thanks in advance!
>