Re: SQL Query Set Analyzer
I am the creator of Quark (Hopefully you meant this project:
https://github.com/qubole/quark). Quark was an experiment in federated
materialized views. We noticed that data teams store views of their data in
Redshift, Vertica etc but the relationships were not captured. I dont think
Quark will be a complete solution for your use case. There are too many
gaps that I wont get into in this thread.
To make sure I understand correctly, what you need is:
- A data engine that supports materialized views
- An analysis and recommendation engine for managing materialized views.
If you use Hive, then materialized views are available and what is
currently missing is the recommendation engine.
If you use Presto or Spark, then materialized views are not yet available.
The choices are (in decreasing order of effort) :
- Add materialized views support to these engines.
- Add Calcite as an alternate optimizer and use materialized views.
We've tried both and decided on a third approach because of the effort
involved and turn around time in a live solution.
- Support a simplified version of Materialized Views like Vertica
Projections or Dremio Reflections.
We realized through experiments and on customer data that we can get
substantial performance improvements by having materialized views on single
tables that support simple transformations like
- Sorting or bucketing
- Narrow tables (contains a subset of columns)
This dramatically simplifies the implementation of optimizer rules and the
recommendation engine. The downside is that this approach cannot optimize
joins or aggregates. However, it seems like a good step in the right
We have code for this approach in Hive, Presto and Spark as well as a
proof-of-concept recommendation engine. Since this is not standard SQL and
we dont have an open source distribution of these engines, we dont have a
good vehicle to open source these components. I can provide patches if
there is interest. The experience is similar to Vertica Projections.
If you take this approach, then the effort is not huge. From our experience
it is about 2-3 man months to have a working version.
On Tue, Aug 7, 2018 at 7:43 AM Julian Hyde <jhyde@xxxxxxxxxx> wrote:
> Regarding the SCOPE paper you reference. That was on my mind too (I went
> to the talk at SIGMOD). A materialized view is created only if the same
> query is used *textually identically* in different parts of the ETL
> process, so it is mainly for optimizing batch jobs that are largely the
> same night after night. Lattices are a better approach for optimizing
> interactive BI work-loads.
> > On Aug 6, 2018, at 4:57 PM, Jesus Camacho Rodriguez <
> jcamachorodriguez@xxxxxxxxxxxxxxx> wrote:
> > You can find an overview of the work that has been done in Hive for
> > materialized view integration in the following link:
> > https://cwiki.apache.org/confluence/display/Hive/Materialized+views <
> > Materialized views can be stored in external tables such as Druid-backed
> > tables too. Druid rules that in Calcite are used to push computation
> > to Druid from Hive.
> > The rewriting algorithm itself is in Calcite. The algorithm can take
> > of constraints (PK-FK relationship between tables) to produce additional
> > correct rewritings, can execute rollups, etc. However, it does not
> assume any
> > specific schema layout, which may make it useful for multiple ETL
> > The most recent addition is the support for partitioned materialized
> > including the extension in the cost model to take into account partition
> > during the planning phase.
> > Incremental maintenance is supported. Most of that code lives in Hive,
> but it relies
> > on the rewriting algorithm too. It only works for materialized views
> that use Hive
> > transactional tables, either full ACID or insert-only. Basically Hive
> exposes explicitly
> > the data contained in the materialization via filter condition, e.g.,
> mv1 contains data
> > for transactions (x, y, z), then let the rewriting algorithm trigger a
> partial rewriting
> > which reads new contents from the sources tables and processed contents
> from mv1.
> > Finally, an additional step transforms the rewritten expression into an
> INSERT or
> > MERGE statement depending on the materialized view expression (MERGE for
> > materialized views containing aggregations). Since not all tables in
> Hive support
> > UPDATE needed for MERGE, we were thinking about allowing some target
> > views with definitions that include aggregates to use INSERT and then
> force the rollup
> > at runtime, e.g., for Druid.
> > bq. Maybe it depends on the aggregation functions that are used?
> > The result of some aggregate functions cannot be (always) incrementally
> maintained in
> > the presence of UPDATE/DELETE operations on source tables, e.g., min and
> max, though
> > some rewriting to minimize full rebuilds can be used if count is added
> as an additional
> > column to the materialized view. Incremental maintenance in presence of
> > operations in source tables is not supported in Hive yet, hence this is
> not implemented.
> > I would like to think that of the problems described below, we are
> getting to the
> > 'more interesting stuff' in the Hive project, though there is some
> consolidation needed for
> > existing work too. That is why we are also interested in any effort
> related to materializations
> > recommendation. I believe the most powerful abstraction to use would be
> RelNode, which
> > can be useful for any system representing its queries internally using
> that representation,
> > instead of relying on SQL nodes which are more closely tight to the
> > Concerning the ´feedback loop´, this recent paper by MSFT describes a
> system that does
> > something similar to what James was describing (for SCOPE):
> > -Jesús
> > On 8/6/18, 3:32 PM, "Julian Hyde" <jhyde@xxxxxxxxxx <mailto:
> jhyde@xxxxxxxxxx>> wrote:
> > It’s hard to automatically recommend a set of MVs from past queries.
> The design space is just too large. But if you are designing MVs for
> interactive BI, you can use the “lattice” model. This works because many
> queries will be filter-join-aggregate queries on a star schema (i.e. a
> central fact table and dimension tables joined over many-to-one
> relationships). (Or perhaps a join between two or more such queries.)
> > Do the queries you are trying to optimize have that pattern?
> > If so, you might start by creating a lattice for each such star
> schema. Then the lattice can suggest MVs that are summary tables.
> > (Lattice suggester is one step more meta - it recommends lattices -
> but given where you are, I would suggest hand-writing one or two lattices.)
> > Calcite is a framework, and this unfortunately means that you have to
> write Java code to use these features. It might be easier if you use the
> new “server” module, which supports CREATE MATERIALIZED VIEW as a DDL
> statement. Then you can create some demos for your colleagues that are
> wholly or mostly SQL.
> > The simplest way to populate a materialized view is the CREATE
> MATERIALIZED VIEW statement. It basically does the same as CREATE TABLE AS
> SELECT (executes a query, stores the results in a table) but it leaves
> behind the metadata about where that data came from.
> > Materialized views can in principle be maintained incrementally, but
> how you do it depends upon what changes are allowed (append only? Replace
> rows and write the old rows to an audit table?). We’ve not done a lot of
> work on it. I believe the Hive folks have given this more thought than I
> > Julian
> >> On Aug 3, 2018, at 11:11 PM, James Taylor <jamestaylor@xxxxxxxxxx>
> >> Both the Lattice Suggestor and Quark sound like what I need for an
> >> automated solution, but I have some more basic follow up questions
> >> Here's our basic use case (very similar to Zheng Shao's, I believe):
> >> - Our company has stood up Presto for data analysts
> >> - Nightly ETL jobs populate Hive tables with lots of data
> >> - Analysts run adhoc queries over data using Presto
> >> - The top CPU using queries are pretty complex (2-3 pages of complex
> >> lots of joins and aggregation)
> >> There are some basic/obvious stuff that can be done manually first:
> >> - Provide better visibility into which queries are expensive
> >> - Ask query owners to produce their own materialized views and manually
> >> change their queries to use them (I believe there's some amount of this
> >> already)
> >> Then there's kind of a middle ground:
> >> - Ask query owners to identify what they think are the top few
> >> views to build
> >> - Manually build these materialized views in the daily ETL job.
> >> - Use Calcite to rewrite the query to use the materialized views. Can
> >> Calcite do this and would it be a problem if the queries are Presto
> >> queries? I'd need to make sure I provided Calcite with the cost
> >> it needs, right?
> >> - Dark launch to test that the rewritten query returns the same results
> >> the original query (and measure the perf improvement)
> >> But the more interesting stuff is:
> >> - Automatically identifying the materialized views that should be built.
> >> Sounds like both the Lattice Suggestor and Quark are potentially a good
> >> fit. I'm not clear on what is output by the Suggestor. Would it spit
> out a
> >> CREATE VIEW statement (or could what it outputs produce that)? How does
> >> Suggestor compare with Quark?
> >> - Automatically build the materialized views. Would the Lattice
> >> or Quark help me with that? Would it be possible to incrementally build
> >> materialized views or would it be necessary to build the materialized
> >> from the beginning of time again and again (clearly not feasible given
> >> size of the tables)? Maybe it depends on the aggregation functions that
> >> used?
> >> - And the nirvana is a kind of feedback loop - based on the top N
> >> queries, identify and build the materialized views, use them
> >> during querying, and then retire them when they're infrequently used.
> >> Would it be a better choice to build the materialized views as Druid
> >> tables? That'd require a Druid connector to Presto, though. This
> reminds me
> >> of the work you already did, Julian, with Hive+Druid (i.e. CALCITE-1731)
> >> but for Presto instead of Hive. Do you think any of that would transfer
> >> over in some way?
> >> WDYT? Huge amount of work? Any advice is much appreciated.
> >> Thanks,
> >> James
> >> On Thu, Jul 26, 2018 at 11:29 AM, Julian Hyde <jhyde@xxxxxxxxxx
> <mailto:jhyde@xxxxxxxxxx><mailto:jhyde@xxxxxxxxxx <mailto:jhyde@xxxxxxxxxx>>>
> >>> PS
> >>> +1 for Babel.
> >>> If you are analyzing a set of queries, it is very likely that these
> >>> queries were written to be executed against another database. Babel
> aims to
> >>> take such queries and convert them into Calcite relational algebra. The
> >>> process might occasionally be lossy, if Calcite's algebra does not
> >>> a feature, but the algebra is still useful.
> >>>> On Jul 26, 2018, at 11:25 AM, Julian Hyde <jhyde@xxxxxxxxxx <mailto:
> jhyde@xxxxxxxxxx>> wrote:
> >>>> There are many possible analyzers, but Lattice Suggester is one that I
> >>> am working on and is relatively mature. It looks at lots of queries and
> >>> builds lattices (star schemas with measures) from those queries. It
> >>> commonality by “growing” lattices - adding measures, adding derived
> >>> expressions, adding many-to-one joins. Lattice Suggester takes a set
> of SQL
> >>> query strings, then it parses them (to SqlNode), validates, and
> converts to
> >>> relational algebra (RelNode). Then it looks for patterns in the
> >>> algebra. Working at the algebra level as opposed to the SQL parse tree
> is a
> >>> net benefit, but some things (e.g. figuring out the original column
> >>> for an expression) are a bit more difficult.
> >>>> Lattice suggester is under development in my
> >>> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester <
> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester> <
> >>> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester <
> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester <
> >>> branch. The tests pass, and I hope to have it merged into master in the
> >>> next month or two.
> >>>> I agree with Devjyoti that it’s difficult to gather together all
> >>> possible analyses in one tool. Lattice Suggester is a good base for
> >>> analyses that model queries as a filter/project/aggregate of a
> >>> star schema — a very common model in BI; for example, it could model
> >>> combinations of columns are commonly used as filters.
> >>>> For analyses that are not tied to star schemas, feel free to create
> >>> tools. The tools would benefit from collaborative development, and I
> >>> that Calcite would be a good home for them.
> >>>> Julian
> >>>>> On Jul 25, 2018, at 10:28 PM, Devjyoti Patra <devjyotip@xxxxxxxxxx
> >>> <mailto:devjyotip@xxxxxxxxxx <mailto:devjyotip@xxxxxxxxxx> <mailto:
> devjyotip@xxxxxxxxxx <mailto:devjyotip@xxxxxxxxxx>>>> wrote:
> >>>>> Hi Zheng,
> >>>>> At Qubole, we are building something very similar to what you are
> >>> looking
> >>>>> for. And from experience, I can tell you that it is a lot easy to
> >>> it
> >>>>> than what one may think.
> >>>>> We use Calcite parser to parse the SQL into Sqlnode and then use
> >>> different
> >>>>> tree visitors to extract query attributes like tables, filter
> >>>>> joins, subqueries etc.,
> >>>>> Our approach is very similar to Uber's QueryParser project (
> >>>>> https://github.com/uber/queryparser <
> https://github.com/uber/queryparser> <https://github.com/uber/queryparser
> <https://github.com/uber/queryparser>> <https://github.com/uber/ <
> https://github.com/uber/> <https://github.com/uber/ <
> >>> queryparser> ), but we go deeper in our analysis of
> >>>>> finding queries that are semantically similar to some canonicalized
> >>> form.
> >>>>> If you intend to begin from scratch, I can give you some pointers to
> >>>>> started.
> >>>>> Thanks,
> >>>>> Devjyoti
> >>>>> On Thu, Jul 26, 2018 at 9:37 AM, Zheng Shao <zshao9@xxxxxxxxx
> <mailto:zshao9@xxxxxxxxx><mailto:zshao9@xxxxxxxxx <mailto:zshao9@xxxxxxxxx>>
> >>> zshao9@xxxxxxxxx>> wrote:
> >>>>>> Hi,
> >>>>>> We are thinking about starting a project to analyze huge number of
> >>>>>> queries (think millions) to identify common patterns:
> >>>>>> * Common sub queries
> >>>>>> * Common filtering conditions (columns) for a table
> >>>>>> * Common join keys for table pairs
> >>>>>> Are there any existing projects on that direction using Calcite?
> >>>>>> love to leverage instead of building from scratch.
> >>>>>> Zheng