Re: Some initial GPU questions
Hello Wes, Antoine,
Thanks for your very detailed responses!
It is really good to know that what is in arrow/gpu now is already setup to
integrate into various GPU producer / consumers.
The other responses made sense (assume in-memory and rely on orchestration,
explicit over implicit, roadmap discussions on confluence, integrating CIs).
On Tue, Jun 26, 2018 at 1:04 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
> hi Anthony,
> Antoine is right that a Device abstraction is needed. I hadn't seen
> ARROW-2447 (I was on vacation in April) but I will comment there.
> It would be helpful to collect more requirements from GPU users -- one
> of the reasons that I set up the arrow/gpu project to begin with was
> to help catalyze collaborations with the GPU community. Unfortunately,
> that hasn't really happened yet after nearly a year, so hopefully we
> can get more folks involved in the near future.
> Some answers to your questions inline:
> On Tue, Jun 26, 2018 at 11:55 AM, Anthony Scopatz <scopatz@xxxxxxxxx>
> > Hello All,
> > As some of you may know, a few of us at Quansight have started (in
> > parntership with NVIDIA) have started looking at Arrow's GPU capabilites.
> > We are excited to help improve and expand Arrow's GPU support, but we did
> > have a few initial scoping questions.
> > Feel free to break these out into separate discussion threads if needed.
> > Hopefully, some of them will be easy enough to answer.
> > 1. What is the status of the GPU code in arrow now? E.g.
> > https://github.com/apache/arrow/tree/master/cpp/src/arrow/gpu Is
> > actively working on this part of the code base? Are there other folks
> > working on GPU support? I'd love to chat, if so!
> The code there is basically waiting for one or more stakeholder users
> to get involved and help drive the roadmap. What is there now is
> pretty basic.
> To give you some context, I observed that some parts of this project
> (IPC / data structure reconstruction on GPUs) were being reimplemented
> in https://github.com/gpuopenanalytics/libgdf. So I started by setting
> up basic abstractions to plug the CUDA driver API into Arrow's various
> abstract interfaces for memory management and IO. I then implemented
> GPU-specialized IPC read and write functions so that these code paths
> in arrow/ipc can function without having the data be addressable in
> CPU memory. See the GPU IPC unit tests here:
> I contributed some patches to MapD and hoped to rework more of their
> Arrow interop to use these functions, but didn't get 100% the way
> there last winter.
> With MapD, libgdf, BlazingDB, and other current and future GPU Arrow
> producers and consumers, I think there's plenty of components like
> these that it would make sense to develop here.
> > 2. Should arrow compute assume that everything fits in memory? Arrow
> > seem to handle data that is larger than memory via the Buffer API. Are
> > there restrictions that using Buffers imply that we should be aware
> This is a large question. Many database systems work on
> larger-than-memory datasets by splitting the problem into fragments
> that do fit into memory. I think it would be reasonable to start by
> developing computational functions that operate on in-memory data,
> then leaving it up to a task scheduler implementation to orchestrate
> an algorithm on larger-than-memory datasets. This is similar to how
> Dask has used pandas to work in an out-of-core and distributed
> > 3. What is the imagined interface be the pyarrow and a GPU DataFrame?
> > One idea is to have the selection of main memory and the GPU to be
> > transparent to the user. Another possible suggestion is to be
> explicit to
> > the user about where the data lives, for example:
> > >>> import pyarrow as pa
> > >>> a = pa.array(..., type=...) # create pyarrow array instance
> > >>> a_g = a.to_gpu(<device parameters>) # send `a` to GPU
> > >>> def foo(a): ... return ... # a function doing operations with `a`
> > >>> r = foo(a) # perform operations with `a`, runs on CPU
> > >>> r_g = foo(a_g) # perform operations with `a_g`, runs on GPU
> > >>> assert r == r_g.to_mem() # results are the same
> Data frames are kind of a semantic construct. As an example, pandas
> utilizes data structures and a mix of low-level algorithms that run
> against NumPy arrays to define the semantics for what is a "pandas
> DataFrame". But, since the Arrow columnar format was born from the
> needs of analytic database systems and in-memory analytics systems
> like pandas, we've captured more of the semantics of data frames than
> in a generic array computing library.
> In the case of Arrow, we have strived to be "front end agnostic", so
> if the objective is to develop front ends for Python programmers, then
> our goal would be to provide within pyarrow the data structures,
> metadata, IO / data access, and computational building blocks to do
> that. The pyarrow API is intended to give the developer explicit
> control over as much as possible, so they can decide what happens and
> when in their application or front-end API.
> > 4. Who has been working on arrow compute kernels, are there any design
> > docs or discussions we should look at? We've been following the
> > discussions and also the Ursa Labs Roadmap
> > <https://ursalabs.org/tech/#arrow-native-computation-engine>.
> On the C++ side, it's been mostly me, Uwe, Phillip Cloud, and Antoine.
> We built a few things to unblock some use cases we had (like type
> casting). I expect that longer term we'll have a mix of pre-compiled
> kernels (similar to TensorFlow's operator-kernel subsystem -- nearest
> analogue I can think of) and runtime-compiled kernels (i.e. LLVM /
> I wrote up some my thoughts on this in the Ursa Labs document you
> cited, but we don't have much in the way of roadmap documents for
> function kernels in the Arrow community. I started a separate thread
> about documentation organization in part to kickstart more roadmapping
> -- I would say that the ASF Confluence space for Arrow would be the
> best place for this work to happen.
> > 5. Should the user be able be able to switch between compute
> > implementations at runtime, or only at compile time?
> It has been my hope to develop kernel dispatch machinery that can take
> into account the execution device in addition to the input types.
> Currently, we are only doing kernel selection based on input types and
> other kernel parameters. If, at dispatch time / runtime, the code
> indicated that the data was on the GPU, then a GPU kernel would be
> > 6. Arrow's CI doesn't currently seem to support GPUs. If a free GPU CI
> > service were to come along, would Arrow be open to using it?
> Yes, I think so. Apache Spark has a Jenkins instance administered by
> UC Berkeley that's integrated with their GitHub. I can imagine a
> similar system where a bot will trigger builds in a GPU-enabled
> Jenkins when certain conditions are met (commit message flags) or if
> the developer requests.
> > Other than that we'd love to know where and how we can plug in and help
> Thanks! Glad to have more folks involved on this.
> - Wes
> > Be Well
> > Anthony
> > --
> > Asst. Prof. Anthony Scopatz
> > Nuclear Engineering Program
> > Mechanical Engineering Dept.
> > University of South Carolina
> > scopatz@xxxxxxxxxx
> > Cell: (512) 827-8239
> > Book a meeting with me at https://scopatz.youcanbook.me/
> > Open up an issue: https://github.com/scopatz/me/issues
> > Check my calendar
> > <https://www.google.com/calendar/embed?src=scopatz%40gmail.com>
Asst. Prof. Anthony Scopatz
Nuclear Engineering Program
Mechanical Engineering Dept.
University of South Carolina
Cell: (512) 827-8239
Book a meeting with me at https://scopatz.youcanbook.me/
Open up an issue: https://github.com/scopatz/me/issues
Check my calendar