osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using CUDA enabled pyarrow


Also, a Google document would be fine to get started. We can transfer it to
Confluence and JIRA after some discussion

On Thu, Oct 4, 2018, 11:45 AM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:

> hi Pearu -- yes, I had thought of this work working on the arrow_gpu
> library. Some time ago I opened
>
> https://issues.apache.org/jira/browse/ARROW-1470
>
> thinking that it would be good to combine the MemoryPool* concept and
> the AllocateBuffer concept into a single abstract interface. Such an
> interface for CUDA could also optimize small allocations by allocating
> larger "pages" if desired. So Before adding a CudaMemoryPool we should
> consider if we want to define a BufferAllocator interface
> On Thu, Oct 4, 2018 at 5:04 AM Pearu Peterson
> <pearu.peterson@xxxxxxxxxxxxx> wrote:
> >
> > Hi,
> > Currently, the arrow host memory management includes MemoryPool to
> > accelerate memory operations (new/free).
> > Would there be interest in supporting the same concept in CUDA memory
> > management to reduce the overhead of cudaMalloc/cudaFree?
> > Best regards,
> > Pearu
> >
> > On Wed, Oct 3, 2018 at 11:44 PM Pearu Peterson <
> pearu.peterson@xxxxxxxxxxxxx>
> > wrote:
> >
> > > Hi,
> > > I can make the initial design document from the existing comments.
> > > Do you have examples of some earlier design documents used for similar
> > > purpose? Would shared google docs be OK?
> > >
> > > Btw, I also figured out an answer to my original question, here is a
> > > partial codelet for accessing the batch columns that I was missing:
> > >
> > > cbuf = <CudaBuffer instance>
> > > cbatch = pa.cuda.read_record_batch(cbuf, schema)
> > > for col in cbatch:
> > >     null_buf, data_buf = col.buffers()
> > >     cdata_buf = CudaBuffer.from_buffer(data_buf)
> > >     if null_buf is not None: ...
> > >     ...
> > >
> > > This is used in CudaNDArray that allows accessing the items from host,
> > > very similar to DeviceNDArray of numba.cuda:
> > >
> https://github.com/Quansight/pygdf/blob/arrow-cuda/pygdf/cudaarray.py
> > > (excuse the coding, its wip and experimental)
> > >
> > > Best regards,
> > > Pearu
> > >
> > >
> > >
> > >
> > > On Wed, Oct 3, 2018 at 11:29 PM Wes McKinney <wesmckinn@xxxxxxxxx>
> wrote:
> > >
> > >> What are the action items on this? Sounds like we need to start a
> > >> design document. I'm afraid I don't have the bandwidth to champion GPU
> > >> functionality at the moment but I will participate in design
> > >> discussions and help break down complex tasks into more accessible
> > >> JIRA issues.
> > >>
> > >> Thanks
> > >> Wes
> > >> On Fri, Sep 28, 2018 at 9:44 AM Wes McKinney <wesmckinn@xxxxxxxxx>
> wrote:
> > >> >
> > >> > Seems like there is a fair bit of work to do to specify APIs and
> > >> > semantics. I suggest we create a Google document or something
> > >> > collaborative where we can enumerate and discuss the issues we want
> to
> > >> > resolve, and then make a list of the concrete development.
> > >> >
> > >> > The underlying problem IMHO in ARROW-2446 is that we do not have the
> > >> > notion of device. An instance of CudaBuffer is only necessary so
> that
> > >> > the appropriate virtual dtor can be invoked to release the memory.
> As
> > >> > long as a buffer referencing it is aware of the underlying device,
> > >> > then our code can dispatch to the correct code paths. At the moment
> we
> > >> > can only really detect whether an arrow::Buffer* is a device buffer
> by
> > >> > dynamic_cast, and then that is not reliable because we may be a
> slice
> > >> > On Fri, Sep 28, 2018 at 7:17 AM Pearu Peterson
> > >> > <pearu.peterson@xxxxxxxxxxxxx> wrote:
> > >> > >
> > >> > > Hi Wes,
> > >> > >
> > >> > > Yes, it makes sense.
> > >> > >
> > >> > > If I understand you correctly then defining a device abstraction
> > >> would also
> > >> > > bring Buffer and CudaBuffer under the same umbrella (that would be
> > >> opposite
> > >> > > approach to ARROW-2446, btw).
> > >> > >
> > >> > > This issue is also related to
> > >> > >
> https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h
> > >> > > that defines a specification for data locality (for ndarrays but
> the
> > >> > > concept is the same for buffers).
> > >> > >
> > >> > > ARROW-2447 defines API that uses Buffer::cpu_data(), hence also
> > >> > > Buffer::cuda_data(), Buffer::disk_data() etc.
> > >> > >
> > >> > > I would like to propose a more general model (no guarantees that
> it
> > >> would
> > >> > > make sense implementation-wise :) ):
> > >> > > 0. CPU would be considered as any other device (this would be in
> line
> > >> with
> > >> > > dlpack). To name few devices: HOST, CUDA, DISK, FPGA, etc. and
> why not
> > >> > > remote databases defined by URL.
> > >> > > 1. A device is defined as a unit that has (i) a memory for holding
> > >> data,
> > >> > > and (ii) it may have a processor(s) for processing the data
> > >> (computations).
> > >> > > For instance, HOST device has RAM and CPU(s); a CUDA device has
> device
> > >> > > memory and GPU(s); a DISK device has memory but no processing
> unit,
> > >> etc.
> > >> > > 2. Different devices can access other devices memory using the
> same
> > >> API
> > >> > > methods (say, Buffer.data()). For processing the data by a device
> (in
> > >> case
> > >> > > the device has a processor), the data is copied to device memory
> > >> on-demand,
> > >> > > unless the data is stored in the same device as the the
> processor. For
> > >> > > instance, for processing the CUDA data with CPU, HOST device would
> > >> need to
> > >> > > copy CUDA device data to HOST memory (that works currently) and
> > >> vice-versa
> > >> > > (that works as well, e.g. using CudaHostBuffer). In another setup,
> > >> CUDA
> > >> > > device might need to use data from DISK: according to this
> proposal,
> > >> the
> > >> > > DISK data would be copied directly to CUDA device (bypassing HOST
> > >> memory if
> > >> > > technically possible).
> > >> > > So, in short, the implementation has to check whether the
> processor
> > >> and the
> > >> > > memory are on the same device before processing the data, if not,
> the
> > >> data
> > >> > > is copied using the on-demand approach. By on-demand approach, I
> mean
> > >> that
> > >> > > the data references are passed around as a pair: (device id,
> device
> > >> > > pointer).
> > >> > > 3. All the above is controlled from a master device process.
> Usually,
> > >> the
> > >> > > master device would be HOST, but it does not have to be always so.
> > >> > >
> > >> > > PS: I realize that this discussion diverges from the original
> > >> subject, feel
> > >> > > free to rename the subject if needed.
> > >> > >
> > >> > > Best regards,
> > >> > > Pearu
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Fri, Sep 28, 2018 at 12:49 PM Wes McKinney <
> wesmckinn@xxxxxxxxx>
> > >> wrote:
> > >> > >
> > >> > > > hi Pearu,
> > >> > > >
> > >> > > > Yes, I think it would be a good idea to develop some tools to
> make
> > >> > > > interacting with device memory using the existing data
> structures
> > >> work
> > >> > > > seamlessly.
> > >> > > >
> > >> > > > This is all closely related to
> > >> > > >
> > >> > > > https://issues.apache.org/jira/browse/ARROW-2447
> > >> > > >
> > >> > > > I would say step 1 would be defining the device abstraction.
> Then we
> > >> > > > can add methods or properties to the data structures in pyarrow
> to
> > >> > > > show the location of the memory, whether CUDA or host RAM, etc.
> We
> > >> > > > could also have a memory-mapped device for memory maps to be
> able to
> > >> > > > communicate that data is on disk. We could then define virtual
> APIs
> > >> > > > for host-side data access to ensure that memory is copied to the
> > >> host
> > >> > > > if needed (e.g. in the case of indexing into the values of an
> array)
> > >> > > >
> > >> > > > There are some small details around the handling of device in
> the
> > >> case
> > >> > > > of hierarchical memory references. So if we say
> > >> `buffer->GetDevice()`
> > >> > > > then even if it's a sliced buffer (which will be the case after
> > >> using
> > >> > > > any IPC reader APIs), it needs to return the right device. This
> > >> means
> > >> > > > that we probably need to define a SlicedBuffer type that
> delegates
> > >> > > > GetDevice() calls to the parent buffer
> > >> > > >
> > >> > > > Let me know if what I'm saying makes sense. Kou and Antoine
> probably
> > >> > > > have some thoughts about this also.
> > >> > > >
> > >> > > > - Wes
> > >> > > > On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
> > >> > > > <pearu.peterson@xxxxxxxxxxxxx> wrote:
> > >> > > > >
> > >> > > > > Hi,
> > >> > > > >
> > >> > > > > Consider the following use case:
> > >> > > > >
> > >> > > > > schema = <pa.Schema instance>
> > >> > > > > cbuf = <pa.cuda.CudaBuffer instance>
> > >> > > > > cbatch = pa.cuda.read_record_batch(schema, cbuf)
> > >> > > > >
> > >> > > > > Note that cbatch is pa.RecordBatch instance where data
> pointers
> > >> are
> > >> > > > device
> > >> > > > > pointers.
> > >> > > > >
> > >> > > > > for col in cbatch.columns:
> > >> > > > >     # here col is, say, FloatArray, that data pointer is a
> device
> > >> pointer
> > >> > > > >     # as a result, accessing col data, say, taking a slice,
> leads
> > >> to
> > >> > > > > segfaults
> > >> > > > >     print(col[0])
> > >> > > > >
> > >> > > > > The aim of this message would be establishing a user-friendly
> way
> > >> to
> > >> > > > > access, say, a slice of the device data so that only the
> > >> requested data
> > >> > > > is
> > >> > > > > copied to host.
> > >> > > > >
> > >> > > > > Or more generally, should there be a CUDA specific RecordBatch
> > >> that
> > >> > > > > implements RecordBatch API that can be used from host?
> > >> > > > >
> > >> > > > > For instance, this would be similar to DeviceNDArray in numba
> that
> > >> > > > > basically implements ndarray API for device data while the API
> > >> can be
> > >> > > > used
> > >> > > > > from host.
> > >> > > > >
> > >> > > > > What do you think? What would be the proper approach? (I can
> do
> > >> the
> > >> > > > > implementation).
> > >> > > > >
> > >> > > > > Best regards,
> > >> > > > > Pearu
> > >> > > >
> > >>
> > >
>