[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SQL Query Set Analyzer

You can find an overview of the work that has been done in Hive for
materialized view integration in the following link:
Materialized views can be stored in external tables such as Druid-backed
tables too. Druid rules that in Calcite are used to push computation
to Druid from Hive.

The rewriting algorithm itself is in Calcite. The algorithm can take advantage
of constraints (PK-FK relationship between tables) to produce additional
correct rewritings, can execute rollups, etc. However, it does not assume any
specific schema layout, which may make it useful for multiple ETL workloads.
The most recent addition is the support for partitioned materialized views,
including the extension in the cost model to take into account partition pruning
during the planning phase.

Incremental maintenance is supported. Most of that code lives in Hive, but it relies
on the rewriting algorithm too. It only works for materialized views that use Hive
transactional tables, either full ACID or insert-only. Basically Hive exposes explicitly
the data contained in the materialization via filter condition, e.g., mv1 contains data
for transactions (x, y, z), then let the rewriting algorithm trigger a partial rewriting
which reads new contents from the sources tables and processed contents from mv1.
Finally, an additional step transforms the rewritten expression into an INSERT or
MERGE statement depending on the materialized view expression (MERGE for
materialized views containing aggregations). Since not all tables in Hive support
UPDATE needed for MERGE, we were thinking about allowing some target materialized
views with definitions that include aggregates to use INSERT and then force the rollup
at runtime, e.g., for Druid.
bq. Maybe it depends on the aggregation functions that are used?
The result of some aggregate functions cannot be (always) incrementally maintained in
the presence of UPDATE/DELETE operations on source tables, e.g., min and max, though
some rewriting to minimize full rebuilds can be used if count is added as an additional
column to the materialized view. Incremental maintenance in presence of UPDATE/DELETE
operations in source tables is not supported in Hive yet, hence this is not implemented.

I would like to think that of the problems described below, we are getting to the
'more interesting stuff' in the Hive project, though there is some consolidation needed for
existing work too. That is why we are also interested in any effort related to materializations
recommendation. I believe the most powerful abstraction to use would be RelNode, which
can be useful for any system representing its queries internally using that representation,
instead of relying on SQL nodes which are more closely tight to the parser.

Concerning the ´feedback loop´, this recent paper by MSFT describes a system that does
something similar to what James was describing (for SCOPE):



On 8/6/18, 3:32 PM, "Julian Hyde" <jhyde@xxxxxxxxxx> wrote:

    It’s hard to automatically recommend a set of MVs from past queries. The design space is just too large. But if you are designing MVs for interactive BI, you can use the “lattice” model. This works because many queries will be filter-join-aggregate queries on a star schema (i.e. a central fact table and dimension tables joined over many-to-one relationships). (Or perhaps a join between two or more such queries.) 
    Do the queries you are trying to optimize have that pattern?
    If so, you might start by creating a lattice for each such star schema. Then the lattice can suggest MVs that are summary tables.
    (Lattice suggester is one step more meta - it recommends lattices - but given where you are, I would suggest hand-writing one or two lattices.)
    Calcite is a framework, and this unfortunately means that you have to write Java code to use these features. It might be easier if you use the new “server” module, which supports CREATE MATERIALIZED VIEW as a DDL statement. Then you can create some demos for your colleagues that are wholly or mostly SQL.
    The simplest way to populate a materialized view is the CREATE MATERIALIZED VIEW statement. It basically does the same as CREATE TABLE AS SELECT (executes a query, stores the results in a table) but it leaves behind the metadata about where that data came from.
    Materialized views can in principle be maintained incrementally, but how you do it depends upon what changes are allowed (append only? Replace rows and write the old rows to an audit table?). We’ve not done a lot of work on it. I believe the Hive folks have given this more thought than I have.
    > On Aug 3, 2018, at 11:11 PM, James Taylor <jamestaylor@xxxxxxxxxx> wrote:
    > Both the Lattice Suggestor and Quark sound like what I need for an
    > automated solution, but I have some more basic follow up questions first.
    > Here's our basic use case (very similar to Zheng Shao's, I believe):
    > - Our company has stood up Presto for data analysts
    > - Nightly ETL jobs populate Hive tables with lots of data
    > - Analysts run adhoc queries over data using Presto
    > - The top CPU using queries are pretty complex (2-3 pages of complex SQL,
    > lots of joins and aggregation)
    > There are some basic/obvious stuff that can be done manually first:
    > - Provide better visibility into which queries are expensive
    > - Ask query owners to produce their own materialized views and manually
    > change their queries to use them (I believe there's some amount of this
    > already)
    > Then there's kind of a middle ground:
    > - Ask query owners to identify what they think are the top few materialized
    > views to build
    > - Manually build these materialized views in the daily ETL job.
    > - Use Calcite to rewrite the query to use the materialized views. Can
    > Calcite do this and would it be a problem if the queries are Presto
    > queries? I'd need to make sure I provided Calcite with the cost information
    > it needs, right?
    > - Dark launch to test that the rewritten query returns the same results as
    > the original query (and measure the perf improvement)
    > But the more interesting stuff is:
    > - Automatically identifying the materialized views that should be built.
    > Sounds like both the Lattice Suggestor and Quark are potentially a good
    > fit. I'm not clear on what is output by the Suggestor. Would it spit out a
    > CREATE VIEW statement (or could what it outputs produce that)? How does the
    > Suggestor compare with Quark?
    > - Automatically build the materialized views. Would the Lattice framework
    > or Quark help me with that? Would it be possible to incrementally build the
    > materialized views or would it be necessary to build the materialized views
    > from the beginning of time again and again (clearly not feasible given the
    > size of the tables)? Maybe it depends on the aggregation functions that are
    > used?
    > - And the nirvana is a kind of feedback loop - based on the top N expensive
    > queries, identify and build the materialized views, use them transparently
    > during querying, and then retire them when they're infrequently used.
    > Would it be a better choice to build the materialized views as Druid
    > tables? That'd require a Druid connector to Presto, though. This reminds me
    > of the work you already did, Julian, with Hive+Druid (i.e. CALCITE-1731)
    > but for Presto instead of Hive. Do you think any of that would transfer
    > over in some way?
    > WDYT? Huge amount of work? Any advice is much appreciated.
    > Thanks,
    > James
    > On Thu, Jul 26, 2018 at 11:29 AM, Julian Hyde <jhyde@xxxxxxxxxx <mailto:jhyde@xxxxxxxxxx>> wrote:
    >> PS
    >> +1 for Babel.
    >> If you are analyzing a set of queries, it is very likely that these
    >> queries were written to be executed against another database. Babel aims to
    >> take such queries and convert them into Calcite relational algebra. The
    >> process might occasionally be lossy, if Calcite's algebra does not support
    >> a feature, but the algebra is still useful.
    >>> On Jul 26, 2018, at 11:25 AM, Julian Hyde <jhyde@xxxxxxxxxx> wrote:
    >>> There are many possible analyzers, but Lattice Suggester is one that I
    >> am working on and is relatively mature. It looks at lots of queries and
    >> builds lattices (star schemas with measures) from those queries. It finds
    >> commonality by “growing” lattices - adding measures, adding derived
    >> expressions, adding many-to-one joins. Lattice Suggester takes a set of SQL
    >> query strings, then it parses them (to SqlNode), validates, and converts to
    >> relational algebra (RelNode). Then it looks for patterns in the relational
    >> algebra. Working at the algebra level as opposed to the SQL parse tree is a
    >> net benefit, but some things (e.g. figuring out the original column alias
    >> for an expression) are a bit more difficult.
    >>> Lattice suggester is under development in my
    >> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester <
    >> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester <https://github.com/julianhyde/calcite/tree/1870-lattice-suggester>>
    >> branch. The tests pass, and I hope to have it merged into master in the
    >> next month or two.
    >>> I agree with Devjyoti that it’s difficult to gather together all
    >> possible analyses in one tool. Lattice Suggester is a good base for
    >> analyses that model queries as a filter/project/aggregate of a pre-joined
    >> star schema — a very common model in BI; for example, it could model which
    >> combinations of columns are commonly used as filters.
    >>> For analyses that are not tied to star schemas, feel free to create new
    >> tools. The tools would benefit from collaborative development, and I think
    >> that Calcite would be a good home for them.
    >>> Julian
    >>>> On Jul 25, 2018, at 10:28 PM, Devjyoti Patra <devjyotip@xxxxxxxxxx
    >> <mailto:devjyotip@xxxxxxxxxx <mailto:devjyotip@xxxxxxxxxx>>> wrote:
    >>>> Hi Zheng,
    >>>> At Qubole, we are building something very similar to what you are
    >> looking
    >>>> for. And from experience, I can tell you that it is a lot easy to build
    >> it
    >>>> than what one may think.
    >>>> We use Calcite parser to parse the SQL into Sqlnode and then use
    >> different
    >>>> tree visitors to extract query attributes like  tables, filter columns,
    >>>> joins, subqueries etc.,
    >>>> Our approach is very similar to Uber's QueryParser project (
    >>>> https://github.com/uber/queryparser <https://github.com/uber/queryparser> <https://github.com/uber/ <https://github.com/uber/>
    >> queryparser> ), but we go deeper in our analysis of
    >>>> finding queries that are semantically similar to some canonicalized
    >> form.
    >>>> If you intend to begin from scratch, I can give you some pointers to get
    >>>> started.
    >>>> Thanks,
    >>>> Devjyoti
    >>>> On Thu, Jul 26, 2018 at 9:37 AM, Zheng Shao <zshao9@xxxxxxxxx <mailto:zshao9@xxxxxxxxx> <mailto:
    >> zshao9@xxxxxxxxx>> wrote:
    >>>>> Hi,
    >>>>> We are thinking about starting a project to analyze huge number of SQL
    >>>>> queries (think millions) to identify common patterns:
    >>>>> * Common sub queries
    >>>>> * Common filtering conditions (columns) for a table
    >>>>> * Common join keys for table pairs
    >>>>> Are there any existing projects on that direction using Calcite?  Would
    >>>>> love to leverage instead of building from scratch.
    >>>>> Zheng