[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (ARROW-4124) [C++] Abstract aggregation kernel API

Wes McKinney created ARROW-4124:

             Summary: [C++] Abstract aggregation kernel API
                 Key: ARROW-4124
                 URL: https://issues.apache.org/jira/browse/ARROW-4124
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Wes McKinney
             Fix For: 0.13.0

Related to the particular details of implementing various aggregation types, we should first put a bit of energy into the abstract API for aggregating data in a multi-threaded setting

Aggregators must support both hash/group (e.g. "group by" in SQL or data frame libraries) modes and non-group modes. 

Aggregations ideally should also support filter pushdown. For example:

select $AGG($EXPR)
from $TABLE

Some systems might materialize the post-predicate / filtered version of {{$EXPR}}, then aggregate that. pandas does this for example. Vectorized performance can be much improved by filtering inside the aggregation kernel. How the predicate true/false values are handled may depend on the implementation details of the kernel (e.g. SUM or MEAN will be a bit different from PRODUCT)

This message was sent by Atlassian JIRA