[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SQL Query Set Analyzer

Both the Lattice Suggestor and Quark sound like what I need for an
automated solution, but I have some more basic follow up questions first.
Here's our basic use case (very similar to Zheng Shao's, I believe):
- Our company has stood up Presto for data analysts
- Nightly ETL jobs populate Hive tables with lots of data
- Analysts run adhoc queries over data using Presto
- The top CPU using queries are pretty complex (2-3 pages of complex SQL,
lots of joins and aggregation)

There are some basic/obvious stuff that can be done manually first:
- Provide better visibility into which queries are expensive
- Ask query owners to produce their own materialized views and manually
change their queries to use them (I believe there's some amount of this

Then there's kind of a middle ground:
- Ask query owners to identify what they think are the top few materialized
views to build
- Manually build these materialized views in the daily ETL job.
- Use Calcite to rewrite the query to use the materialized views. Can
Calcite do this and would it be a problem if the queries are Presto
queries? I'd need to make sure I provided Calcite with the cost information
it needs, right?
- Dark launch to test that the rewritten query returns the same results as
the original query (and measure the perf improvement)

But the more interesting stuff is:
- Automatically identifying the materialized views that should be built.
Sounds like both the Lattice Suggestor and Quark are potentially a good
fit. I'm not clear on what is output by the Suggestor. Would it spit out a
CREATE VIEW statement (or could what it outputs produce that)? How does the
Suggestor compare with Quark?
- Automatically build the materialized views. Would the Lattice framework
or Quark help me with that? Would it be possible to incrementally build the
materialized views or would it be necessary to build the materialized views
from the beginning of time again and again (clearly not feasible given the
size of the tables)? Maybe it depends on the aggregation functions that are
- And the nirvana is a kind of feedback loop - based on the top N expensive
queries, identify and build the materialized views, use them transparently
during querying, and then retire them when they're infrequently used.

Would it be a better choice to build the materialized views as Druid
tables? That'd require a Druid connector to Presto, though. This reminds me
of the work you already did, Julian, with Hive+Druid (i.e. CALCITE-1731)
but for Presto instead of Hive. Do you think any of that would transfer
over in some way?

WDYT? Huge amount of work? Any advice is much appreciated.


On Thu, Jul 26, 2018 at 11:29 AM, Julian Hyde <jhyde@xxxxxxxxxx> wrote:

> PS
> +1 for Babel.
> If you are analyzing a set of queries, it is very likely that these
> queries were written to be executed against another database. Babel aims to
> take such queries and convert them into Calcite relational algebra. The
> process might occasionally be lossy, if Calcite's algebra does not support
> a feature, but the algebra is still useful.
> > On Jul 26, 2018, at 11:25 AM, Julian Hyde <jhyde@xxxxxxxxxx> wrote:
> >
> > There are many possible analyzers, but Lattice Suggester is one that I
> am working on and is relatively mature. It looks at lots of queries and
> builds lattices (star schemas with measures) from those queries. It finds
> commonality by “growing” lattices - adding measures, adding derived
> expressions, adding many-to-one joins. Lattice Suggester takes a set of SQL
> query strings, then it parses them (to SqlNode), validates, and converts to
> relational algebra (RelNode). Then it looks for patterns in the relational
> algebra. Working at the algebra level as opposed to the SQL parse tree is a
> net benefit, but some things (e.g. figuring out the original column alias
> for an expression) are a bit more difficult.
> >
> > Lattice suggester is under development in my
> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester <
> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester>
> branch. The tests pass, and I hope to have it merged into master in the
> next month or two.
> >
> > I agree with Devjyoti that it’s difficult to gather together all
> possible analyses in one tool. Lattice Suggester is a good base for
> analyses that model queries as a filter/project/aggregate of a pre-joined
> star schema — a very common model in BI; for example, it could model which
> combinations of columns are commonly used as filters.
> >
> > For analyses that are not tied to star schemas, feel free to create new
> tools. The tools would benefit from collaborative development, and I think
> that Calcite would be a good home for them.
> >
> > Julian
> >
> >
> >
> >
> >> On Jul 25, 2018, at 10:28 PM, Devjyoti Patra <devjyotip@xxxxxxxxxx
> <mailto:devjyotip@xxxxxxxxxx>> wrote:
> >>
> >> Hi Zheng,
> >>
> >> At Qubole, we are building something very similar to what you are
> looking
> >> for. And from experience, I can tell you that it is a lot easy to build
> it
> >> than what one may think.
> >> We use Calcite parser to parse the SQL into Sqlnode and then use
> different
> >> tree visitors to extract query attributes like  tables, filter columns,
> >> joins, subqueries etc.,
> >>
> >> Our approach is very similar to Uber's QueryParser project (
> >> https://github.com/uber/queryparser <https://github.com/uber/
> queryparser> ), but we go deeper in our analysis of
> >> finding queries that are semantically similar to some canonicalized
> form.
> >> If you intend to begin from scratch, I can give you some pointers to get
> >> started.
> >>
> >> Thanks,
> >> Devjyoti
> >>
> >>
> >> On Thu, Jul 26, 2018 at 9:37 AM, Zheng Shao <zshao9@xxxxxxxxx <mailto:
> zshao9@xxxxxxxxx>> wrote:
> >>
> >>> Hi,
> >>>
> >>> We are thinking about starting a project to analyze huge number of SQL
> >>> queries (think millions) to identify common patterns:
> >>> * Common sub queries
> >>> * Common filtering conditions (columns) for a table
> >>> * Common join keys for table pairs
> >>>
> >>> Are there any existing projects on that direction using Calcite?  Would
> >>> love to leverage instead of building from scratch.
> >>>
> >>> Zheng
> >>>
> >