[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Gandiva Initiative


Sorry for the delay, Julian. My replies inline.

On Fri, Jun 22, 2018 at 11:39 PM Julian Hyde <jhyde@xxxxxxxxxx> wrote:

> This is exciting. We have wanted to build an Arrow adapter in Calcite for
> some time and have a prototype (see
> https://issues.apache.org/jira/browse/CALCITE-2173 <
> https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we
> can use Gandiva. I know that Gandiva has Java bindings, but will these
> allow queries to be compiled and executed from a pure Java process?
>

Yes. Dremio is a java process and uses the java bindings for gandiva. You
could take a look at the maven unit tests for an example.


>
> Can you describe Gandiva’s governance model? Without an open governance
> model, companies that compete with Dremio may be wary about contributing.
>

Jacques has replied on this.


>
> Can you compare and contrast your approach to Hyper[1]? Hyper is also
> concerned with efficient use to the bus, and also uses LLVM, but it has a
> different memory format and places much emphasis on lock-free data
> structures.
>
> I just attended SIGMOD and there were interesting industry papers from
> MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the tricks
> MemSQL uses to achieve SIMD parallelism on queries such as “select k4,
> sum(x) from t group by k4” (where k4 has 4 values).
>
> I missed part of the RAPID talk, but I got the impression that they are
> using disk-based algorithms (e.g. hybrid hash join) to handle data spread
> between fast and slow memory.
>
> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this would
> be good target for Gandiva also. It is a table scan with a range filter
> (returning 98% of rows), a low-cardinality aggregate (grouping by two
> fields with 3 values each), and several aggregate functions, the arguments
> of which contain common sub-expressions.
>


Thanks for the references - I'll look into them and get back.

Gandiva doesn't attempt to solve query optimization, efficient disk reads
or work distribution across threads/VMs. We expect the higher layers (i.e
users of gandiva) to handle this.

The expression builder returns a compiled, immutable "llvm module", which
can be shared across threads. Once an expression is built, both the
inputs/outputs are arrow vectors (actually, the input is a row batch).
There is no locking within gandiva in the evaluation path.

We are also targeting performance evaluation using TPC-H, but we plan to
first address projections and filters before moving to aggregations.


>
>   SELECT
>     l_returnflag,
>     l_linestatus,
>     sum(l_quantity),
>     sum(l_extendedprice),
>     sum(l_extendedprice * (1 - l_discount)),
>     sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
>     avg(l_quantity),
>     avg(l_extendedprice),
>     avg(l_discount),
>     count(*)
>   FROM lineitem
>   WHERE l_shipdate <= date '1998-12-01' - interval '90’ day
>   GROUP BY
>     l_returnflag,
>     l_linestatus
>   ORDER BY
>     l_returnflag,
>     l_linestatus;
>
> Julian
>
> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf <
> http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>
>
> [2]
> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
> <
> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
> >
>
> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 <
> https://dl.acm.org/citation.cfm?id=3183713.3190658>
>
> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 <
> https://dl.acm.org/citation.cfm?id=3183713.3190655>
>
> > On Jun 22, 2018, at 7:22 AM, ravindrap@xxxxxxxxx wrote:
> >
> > Hi everyone,
> >
> > I'm Ravindra and I'm a developer on the Gandiva project. I do believe
> that the combination of arrow and llvm for efficient expression evaluation
> is powerful, and has a broad range of use-cases. We've just started and
> hope to finesse and add a lot of functionality over the next few months.
> >
> > Welcome your feedback and participation in gandiva !!
> >
> > thanks & regards,
> > ravindra.
> >
> > On 2018/06/21 19:15:20, Jacques Nadeau <jacques@xxxxxxxxxx> wrote:
> >> Hey Guys,
> >>
> >> Dremio just open sourced a new framework for processing data in Arrow
> data
> >> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
> >> LLVM (Apache licensed). It also includes Java APIs that leverage the
> Apache
> >> Arrow Java libraries. I expect the developers who have been working on
> this
> >> will introduce themselves soon. To read more about it, take a look at
> our
> >> Ravindra's blog post (he's the lead developer driving this work): [2].
> >> Hopefully people will find this interesting/useful.
> >>
> >> Let us know what you all think!
> >>
> >> thanks,
> >> Jacques
> >>
> >>
> >> [1] https://github.com/dremio/gandiva
> >> [2]
> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
> >>
>
>