Re: SQL Query Set Analyzer
Regarding the SCOPE paper you reference. That was on my mind too (I went to the talk at SIGMOD). A materialized view is created only if the same query is used *textually identically* in different parts of the ETL process, so it is mainly for optimizing batch jobs that are largely the same night after night. Lattices are a better approach for optimizing interactive BI work-loads.
> On Aug 6, 2018, at 4:57 PM, Jesus Camacho Rodriguez <jcamachorodriguez@xxxxxxxxxxxxxxx> wrote:
> You can find an overview of the work that has been done in Hive for
> materialized view integration in the following link:
> https://cwiki.apache.org/confluence/display/Hive/Materialized+views <https://cwiki.apache.org/confluence/display/Hive/Materialized+views>
> Materialized views can be stored in external tables such as Druid-backed
> tables too. Druid rules that in Calcite are used to push computation
> to Druid from Hive.
> The rewriting algorithm itself is in Calcite. The algorithm can take advantage
> of constraints (PK-FK relationship between tables) to produce additional
> correct rewritings, can execute rollups, etc. However, it does not assume any
> specific schema layout, which may make it useful for multiple ETL workloads.
> http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information <http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information>
> The most recent addition is the support for partitioned materialized views,
> including the extension in the cost model to take into account partition pruning
> during the planning phase.
> Incremental maintenance is supported. Most of that code lives in Hive, but it relies
> on the rewriting algorithm too. It only works for materialized views that use Hive
> transactional tables, either full ACID or insert-only. Basically Hive exposes explicitly
> the data contained in the materialization via filter condition, e.g., mv1 contains data
> for transactions (x, y, z), then let the rewriting algorithm trigger a partial rewriting
> which reads new contents from the sources tables and processed contents from mv1.
> Finally, an additional step transforms the rewritten expression into an INSERT or
> MERGE statement depending on the materialized view expression (MERGE for
> materialized views containing aggregations). Since not all tables in Hive support
> UPDATE needed for MERGE, we were thinking about allowing some target materialized
> views with definitions that include aggregates to use INSERT and then force the rollup
> at runtime, e.g., for Druid.
> bq. Maybe it depends on the aggregation functions that are used?
> The result of some aggregate functions cannot be (always) incrementally maintained in
> the presence of UPDATE/DELETE operations on source tables, e.g., min and max, though
> some rewriting to minimize full rebuilds can be used if count is added as an additional
> column to the materialized view. Incremental maintenance in presence of UPDATE/DELETE
> operations in source tables is not supported in Hive yet, hence this is not implemented.
> I would like to think that of the problems described below, we are getting to the
> 'more interesting stuff' in the Hive project, though there is some consolidation needed for
> existing work too. That is why we are also interested in any effort related to materializations
> recommendation. I believe the most powerful abstraction to use would be RelNode, which
> can be useful for any system representing its queries internally using that representation,
> instead of relying on SQL nodes which are more closely tight to the parser.
> Concerning the ´feedback loop´, this recent paper by MSFT describes a system that does
> something similar to what James was describing (for SCOPE):
> https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf <https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf>
> On 8/6/18, 3:32 PM, "Julian Hyde" <jhyde@xxxxxxxxxx <mailto:jhyde@xxxxxxxxxx>> wrote:
> It’s hard to automatically recommend a set of MVs from past queries. The design space is just too large. But if you are designing MVs for interactive BI, you can use the “lattice” model. This works because many queries will be filter-join-aggregate queries on a star schema (i.e. a central fact table and dimension tables joined over many-to-one relationships). (Or perhaps a join between two or more such queries.)
> Do the queries you are trying to optimize have that pattern?
> If so, you might start by creating a lattice for each such star schema. Then the lattice can suggest MVs that are summary tables.
> (Lattice suggester is one step more meta - it recommends lattices - but given where you are, I would suggest hand-writing one or two lattices.)
> Calcite is a framework, and this unfortunately means that you have to write Java code to use these features. It might be easier if you use the new “server” module, which supports CREATE MATERIALIZED VIEW as a DDL statement. Then you can create some demos for your colleagues that are wholly or mostly SQL.
> The simplest way to populate a materialized view is the CREATE MATERIALIZED VIEW statement. It basically does the same as CREATE TABLE AS SELECT (executes a query, stores the results in a table) but it leaves behind the metadata about where that data came from.
> Materialized views can in principle be maintained incrementally, but how you do it depends upon what changes are allowed (append only? Replace rows and write the old rows to an audit table?). We’ve not done a lot of work on it. I believe the Hive folks have given this more thought than I have.
>> On Aug 3, 2018, at 11:11 PM, James Taylor <jamestaylor@xxxxxxxxxx> wrote:
>> Both the Lattice Suggestor and Quark sound like what I need for an
>> automated solution, but I have some more basic follow up questions first.
>> Here's our basic use case (very similar to Zheng Shao's, I believe):
>> - Our company has stood up Presto for data analysts
>> - Nightly ETL jobs populate Hive tables with lots of data
>> - Analysts run adhoc queries over data using Presto
>> - The top CPU using queries are pretty complex (2-3 pages of complex SQL,
>> lots of joins and aggregation)
>> There are some basic/obvious stuff that can be done manually first:
>> - Provide better visibility into which queries are expensive
>> - Ask query owners to produce their own materialized views and manually
>> change their queries to use them (I believe there's some amount of this
>> Then there's kind of a middle ground:
>> - Ask query owners to identify what they think are the top few materialized
>> views to build
>> - Manually build these materialized views in the daily ETL job.
>> - Use Calcite to rewrite the query to use the materialized views. Can
>> Calcite do this and would it be a problem if the queries are Presto
>> queries? I'd need to make sure I provided Calcite with the cost information
>> it needs, right?
>> - Dark launch to test that the rewritten query returns the same results as
>> the original query (and measure the perf improvement)
>> But the more interesting stuff is:
>> - Automatically identifying the materialized views that should be built.
>> Sounds like both the Lattice Suggestor and Quark are potentially a good
>> fit. I'm not clear on what is output by the Suggestor. Would it spit out a
>> CREATE VIEW statement (or could what it outputs produce that)? How does the
>> Suggestor compare with Quark?
>> - Automatically build the materialized views. Would the Lattice framework
>> or Quark help me with that? Would it be possible to incrementally build the
>> materialized views or would it be necessary to build the materialized views
>> from the beginning of time again and again (clearly not feasible given the
>> size of the tables)? Maybe it depends on the aggregation functions that are
>> - And the nirvana is a kind of feedback loop - based on the top N expensive
>> queries, identify and build the materialized views, use them transparently
>> during querying, and then retire them when they're infrequently used.
>> Would it be a better choice to build the materialized views as Druid
>> tables? That'd require a Druid connector to Presto, though. This reminds me
>> of the work you already did, Julian, with Hive+Druid (i.e. CALCITE-1731)
>> but for Presto instead of Hive. Do you think any of that would transfer
>> over in some way?
>> WDYT? Huge amount of work? Any advice is much appreciated.
>> On Thu, Jul 26, 2018 at 11:29 AM, Julian Hyde <jhyde@xxxxxxxxxx <mailto:jhyde@xxxxxxxxxx><mailto:jhyde@xxxxxxxxxx <mailto:jhyde@xxxxxxxxxx>>> wrote:
>>> +1 for Babel.
>>> If you are analyzing a set of queries, it is very likely that these
>>> queries were written to be executed against another database. Babel aims to
>>> take such queries and convert them into Calcite relational algebra. The
>>> process might occasionally be lossy, if Calcite's algebra does not support
>>> a feature, but the algebra is still useful.
>>>> On Jul 26, 2018, at 11:25 AM, Julian Hyde <jhyde@xxxxxxxxxx <mailto:jhyde@xxxxxxxxxx>> wrote:
>>>> There are many possible analyzers, but Lattice Suggester is one that I
>>> am working on and is relatively mature. It looks at lots of queries and
>>> builds lattices (star schemas with measures) from those queries. It finds
>>> commonality by “growing” lattices - adding measures, adding derived
>>> expressions, adding many-to-one joins. Lattice Suggester takes a set of SQL
>>> query strings, then it parses them (to SqlNode), validates, and converts to
>>> relational algebra (RelNode). Then it looks for patterns in the relational
>>> algebra. Working at the algebra level as opposed to the SQL parse tree is a
>>> net benefit, but some things (e.g. figuring out the original column alias
>>> for an expression) are a bit more difficult.
>>>> Lattice suggester is under development in my
>>> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester <https://github.com/julianhyde/calcite/tree/1870-lattice-suggester> <
>>> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester <https://github.com/julianhyde/calcite/tree/1870-lattice-suggester><https://github.com/julianhyde/calcite/tree/1870-lattice-suggester <https://github.com/julianhyde/calcite/tree/1870-lattice-suggester>>>
>>> branch. The tests pass, and I hope to have it merged into master in the
>>> next month or two.
>>>> I agree with Devjyoti that it’s difficult to gather together all
>>> possible analyses in one tool. Lattice Suggester is a good base for
>>> analyses that model queries as a filter/project/aggregate of a pre-joined
>>> star schema — a very common model in BI; for example, it could model which
>>> combinations of columns are commonly used as filters.
>>>> For analyses that are not tied to star schemas, feel free to create new
>>> tools. The tools would benefit from collaborative development, and I think
>>> that Calcite would be a good home for them.
>>>>> On Jul 25, 2018, at 10:28 PM, Devjyoti Patra <devjyotip@xxxxxxxxxx <mailto:devjyotip@xxxxxxxxxx>
>>> <mailto:devjyotip@xxxxxxxxxx <mailto:devjyotip@xxxxxxxxxx> <mailto:devjyotip@xxxxxxxxxx <mailto:devjyotip@xxxxxxxxxx>>>> wrote:
>>>>> Hi Zheng,
>>>>> At Qubole, we are building something very similar to what you are
>>>>> for. And from experience, I can tell you that it is a lot easy to build
>>>>> than what one may think.
>>>>> We use Calcite parser to parse the SQL into Sqlnode and then use
>>>>> tree visitors to extract query attributes like tables, filter columns,
>>>>> joins, subqueries etc.,
>>>>> Our approach is very similar to Uber's QueryParser project (
>>>>> https://github.com/uber/queryparser <https://github.com/uber/queryparser> <https://github.com/uber/queryparser <https://github.com/uber/queryparser>> <https://github.com/uber/ <https://github.com/uber/> <https://github.com/uber/ <https://github.com/uber/>>
>>> queryparser> ), but we go deeper in our analysis of
>>>>> finding queries that are semantically similar to some canonicalized
>>>>> If you intend to begin from scratch, I can give you some pointers to get
>>>>> On Thu, Jul 26, 2018 at 9:37 AM, Zheng Shao <zshao9@xxxxxxxxx <mailto:zshao9@xxxxxxxxx><mailto:zshao9@xxxxxxxxx <mailto:zshao9@xxxxxxxxx>> <mailto:
>>> zshao9@xxxxxxxxx>> wrote:
>>>>>> We are thinking about starting a project to analyze huge number of SQL
>>>>>> queries (think millions) to identify common patterns:
>>>>>> * Common sub queries
>>>>>> * Common filtering conditions (columns) for a table
>>>>>> * Common join keys for table pairs
>>>>>> Are there any existing projects on that direction using Calcite? Would
>>>>>> love to leverage instead of building from scratch.