[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

PyArrow & Python Multiprocessing

I've been reading through the PyArrow documentation and trying to
understand how to use the tool effectively for IPC (using zero-copy).

I'm on a system with 586 cores & 1TB of ram. I'm using Panda's Dataframes
to process several 10's of gigs of data in memory and the pickling that is
done by Python's multiprocessing API is very wasteful.

I'm running a little hand-built map-reduce where I chunk the dataframe into
N_mappers number of chunks, run some processing on them, then run some
number N_reducers to finalize the operation. What I'd like to be able to do
is chunk up the dataframe into Arrow Buffer objects and just have each
mapped task read their respective Buffer object with the guarantee of

I see there's a couple Filesystem abstractions for doing memory-mapped
files. Durability isn't something I need and I'm willing to forego the
expense of putting the files on disk.

Is it possible to write the data directly to memory and pass just the
reference around to the different processes? What's the recommended way to
accomplish my goal here?

Thanks in advance!