osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Arrow and oamap [was: Re: Gandiva Initiative]


Thanks Jim, that's helpful. In the long run, we'd like to make sure
that the Arrow libraries can serve as a robust dependency for
computational systems like OAMap. It doesn't strike me that OAMap is a
library that could be used within the Arrow project, though there are
some things which could be implemented as part of a function kernel
library or function code-generator. The streaming data / IPC machinery
we have developed could be useful for working with large on-disk
datasets or efficient data movement between nodes in a cluster.

Let us know as you run into issues or missing features so we can
incorporate into our development roadmap.

best
Wes

On Mon, Jun 25, 2018 at 5:25 PM, Jim Pivarski <jpivarski@xxxxxxxxx> wrote:
> What Martin said about OAMap and ROOT is true: there's no dependence, ROOT
> and Arrow are both backends.
>
> What Wes said about embeddability is also right: OAMap is pure Python+Numpy
> and would be hard to use within a C++ framework (or Java). This has already
> been an issue with a potential user in the ALICE experiment. They're
> considering Arrow in a C++ framework.
>
> I've been quiet for a few months because I've been trying to work OAMap
> into Dask and having trouble with the fact that OAMap data are "loose—" the
> schema is a separate thing from the data and different columns of the data
> can be in different files, in different file format, on different machines.
> I've found another reformulation of the problem in a bottom-up way, in
> which the data are in an extended set of array types— like Numpy arrays,
> but they can be chunked, jagged, AOS-SOA split, generated on demand, etc.
> The schema is implicit in the nesting of these objects, rather than being
> explicit and separate from the objects, and they reproduce all of the
> desirable effects of OAMap.
>
> So, my project being a research project, I'm following this promising line
> of attack. But it probably cuts this discussion short if it was about Gandiva
> vs OAMap as duplicating effort. Martin, I was planning on sending you a
> note about this when I had a working example, particularly if I could
> convince Dask that these arrays can be Dask lazy arrays. (I started working
> on this alternate approach exactly two weeks ago— it's very recent.)
>
> Meanwhile, I'm also going to look into Gandiva and possibly recommend it to
> the ALICE experiment.
>
> Cheers,
> Jim
>
>
>
>
> On Mon, Jun 25, 2018, 4:00 PM Martin Durant <martin.durant@xxxxxxxxxxx>
> wrote:
>
>> Let me just quickly correct a couple of point, for clarity.
>> The following module
>> https://github.com/diana-hep/oamap/blob/master/oamap/backend/arrow.py
>> is a proof-of-concept of oamap running directly on arrow memory - this is
>> the original reason I raised the topic here.
>>
>> There are also POCs showing operation on numpy records and parquet files.
>> oamap was written with ROOT in mind,
>> but is not necessarily tied to it; it just so happens that ROOT data tends
>> to have just the kind of deeply nested structures
>> that tend to perform terribly in a pandas apply situation or other python
>> object-based processing. oamap depends on
>> numba/llvm-lite
>>
>> So indeed maybe this all belongs as a conversation in pyarrow rather than
>> arrow, but insomuch as it enables - or may in
>> the future enable - machine-speed computation on in-memory nested arrow
>> data, I think oamap should be on
>> everyone’s radar as a interesting and useful project.
>>
>> > On 25 Jun 2018, at 16:45, Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
>> >
>> > hi Martin,
>> >
>> > These projects are very different. Many analytic databases feature
>> > code generation (recently a lot of these use LLVM -- see Hyper, Apache
>> > Impala, and others) on the hot paths for function evaluation (e.g. for
>> > evaluating the expressions in the SELECT part or the WHERE part) --
>> > the reason people are excited about Gandiva is that it makes this type
>> > of functionality available as a library running atop an open standard
>> > memory format (Arrow columnar), so can be used in any programming
>> > language assuming suitable bindings can be developed. This is very
>> > much in line with our vision for creating a "deconstructed database"
>> > (see a talk that Julien gave on this topic:
>> >
>> https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database
>> )
>> >
>> > I have not looked a great deal at oamap, but it does not use the Arrow
>> > columnar format AFAIK. It is written in Python and presumes some other
>> > technologies in use (like the ROOT format).
>> >
>> > So to summarize:
>> >
>> > Gandiva
>> > * Compiles analytical expressions to execute against Arrow columnar
>> format
>> > * Is written in C++ and can be embedded in other systems (Dremio is
>> > using it from Java)
>> >
>> > oamap
>> > * Does not use the Arrow columnar format
>> > * Presumes other technologies in use (ROOT)
>> > * Is written in Python, and would be challenging to use an embedded
>> > system component
>> >
>> > I'm certain these projects can learn from each other -- I have spoken
>> > with Jim (one of the developers of oamap) in the past, so welcome
>> > further discussion here on the mailing list.
>> >
>> > Thanks,
>> > Wes
>> >
>> > On Mon, Jun 25, 2018 at 1:27 PM, Martin Durant
>> > <martin.durant@xxxxxxxxxxx> wrote:
>> >> I am a little surprised by the very positive reception to Gandiva
>> (which doubtless is very useful - I know very little about it) versus when
>> I brought up the prospect of using oamap (
>> https://github.com/diana-hep/oamap ) on this mailing list.
>> >>
>> >> oamap uses numba to compile *python* functions at run-time and can walk
>> complex nested schema down to leaf nodes in native python syntax (for-loops
>> and attribute/item lookup) but at full machine speed, and without
>> materialising any objects along the way. It was written for the ROOT
>> format, but has implementations for simple types in parquet and arrow,
>> which each do the nested lists and dict things similarly but differently.
>> >>
>> >> Would someone care to explain the silence over oamap?
>> >>
>> >>> On 25 Jun 2018, at 02:06, Praveen Kumar <praveen@xxxxxxxxxx> wrote:
>> >>>
>> >>> Hi Everyone,
>> >>>
>> >>> I am Praveen, another engineer working on Gandiva. The interest and
>> speed of engagement around this is great !!Excited to engage with you folks
>> on this.
>> >>>
>> >>> Thx.
>> >>>
>> >>> On 2018/06/22 18:09:42, Julian Hyde <j...@xxxxxxxxxx> wrote:
>> >>>> This is exciting. We have wanted to build an Arrow adapter in Calcite
>> for some time and have a prototype (see
>> https://issues.apache.org/jira/browse/CALCITE-2173 <
>> https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we
>> can use Gandiva. I know that Gandiva has Java bindings, but will these
>> allow queries to be compiled and executed from a pure Java process?>
>> >>>>
>> >>>> Can you describe Gandiva’s governance model? Without an open
>> governance model, companies that compete with Dremio may be wary about
>> contributing.>
>> >>>>
>> >>>> Can you compare and contrast your approach to Hyper[1]? Hyper is also
>> concerned with efficient use to the bus, and also uses LLVM, but it has a
>> different memory format and places much emphasis on lock-free data
>> structures.>
>> >>>>
>> >>>> I just attended SIGMOD and there were interesting industry papers
>> from MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the
>> tricks MemSQL uses to achieve SIMD parallelism on queries such as “select
>> k4, sum(x) from t group by k4” (where k4 has 4 values).>
>> >>>>
>> >>>> I missed part of the RAPID talk, but I got the impression that they
>> are using disk-based algorithms (e.g. hybrid hash join) to handle data
>> spread between fast and slow memory.>
>> >>>>
>> >>>> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this
>> would be good target for Gandiva also. It is a table scan with a range
>> filter (returning 98% of rows), a low-cardinality aggregate (grouping by
>> two fields with 3 values each), and several aggregate functions, the
>> arguments of which contain common sub-expressions.>
>> >>>>
>> >>>> SELECT>
>> >>>>   l_returnflag,>
>> >>>>   l_linestatus,>
>> >>>>   sum(l_quantity),>
>> >>>>   sum(l_extendedprice),>
>> >>>>   sum(l_extendedprice * (1 - l_discount)),>
>> >>>>   sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),>
>> >>>>   avg(l_quantity),>
>> >>>>   avg(l_extendedprice),>
>> >>>>   avg(l_discount),>
>> >>>>   count(*)>
>> >>>> FROM lineitem>
>> >>>> WHERE l_shipdate <= date '1998-12-01' - interval '90’ day>
>> >>>> GROUP BY>
>> >>>>   l_returnflag,>
>> >>>>   l_linestatus>
>> >>>> ORDER BY>
>> >>>>   l_returnflag,>
>> >>>>   l_linestatus;>
>> >>>>
>> >>>> Julian>
>> >>>>
>> >>>> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf <
>> http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>>
>> >>>>
>> >>>> [2]
>> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
>> <
>> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
>> >>
>> >>>>
>> >>>> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 <
>> https://dl.acm.org/citation.cfm?id=3183713.3190658>>
>> >>>>
>> >>>> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 <
>> https://dl.acm.org/citation.cfm?id=3183713.3190655>>
>> >>>>
>> >>>>> On Jun 22, 2018, at 7:22 AM, ravindrap@xxxxxxxxx wrote:>
>> >>>>>>
>> >>>>> Hi everyone,>
>> >>>>>>
>> >>>>> I'm Ravindra and I'm a developer on the Gandiva project. I do
>> believe that the combination of arrow and llvm for efficient expression
>> evaluation is powerful, and has a broad range of use-cases. We've just
>> started and hope to finesse and add a lot of functionality over the next
>> few months.>
>> >>>>>>
>> >>>>> Welcome your feedback and participation in gandiva !!>
>> >>>>>>
>> >>>>> thanks & regards,>
>> >>>>> ravindra.>
>> >>>>>>
>> >>>>> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@xxxxxxxxxx> wrote: >
>> >>>>>> Hey Guys,>
>> >>>>>>>
>> >>>>>> Dremio just open sourced a new framework for processing data in
>> Arrow data>
>> >>>>>> structures [1], built on top of the Apache Arrow C++ APIs and
>> leveraging>
>> >>>>>> LLVM (Apache licensed). It also includes Java APIs that leverage
>> the Apache>
>> >>>>>> Arrow Java libraries. I expect the developers who have been working
>> on this>
>> >>>>>> will introduce themselves soon. To read more about it, take a look
>> at our>
>> >>>>>> Ravindra's blog post (he's the lead developer driving this work):
>> [2].>
>> >>>>>> Hopefully people will find this interesting/useful.>
>> >>>>>>>
>> >>>>>> Let us know what you all think!>
>> >>>>>>>
>> >>>>>> thanks,>
>> >>>>>> Jacques>
>> >>>>>>>
>> >>>>>>>
>> >>>>>> [1] https://github.com/dremio/gandiva>
>> >>>>>> [2]
>> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/>
>> >>>>>>>
>> >>>>
>> >>
>> >> —
>> >> Martin Durant
>> >> martin.durant@xxxxxxxxxxx
>> >>
>> >>
>> >>
>>
>> —
>> Martin Durant
>> martin.durant@xxxxxxxxxxx
>>
>>
>>
>>