|
|
Sponsor |
Re: INTERSECT and EXCEPT Design: msg#00334apache.db.derby.devel
RPost wrote: >Jack Klebanoff wrote:Other relational databases do not all support the "DISTINCT" keyword, so applications that use it will not be portable. That being said, I think that adding the "DISTINCT" keyword would be a good project for a neophyte who wants to get his or her feet wet in the Derby parser. > The architecture of the Derby optimizer makes it difficult to do further optimizations.His mail has been helpful in the past. I hope that I can stimulate more. I don't think that we should spend a lot of effort on optimizing the INTERSECT and EXCEPT operators. It isn't used often and the performance of the current implementation is always at least decent. We should put our optimization efforts into optimizing the performance of other operations. (Please speak up if you think that INTERSECT or EXCEPT will be used often). >The UNION and EXCEPT operators have the same precedence. The INTERSECT operator has higher >precedence, soYes. I included the UNION/UNION ALL example because it is important to get the associativity right even in the UNION case. I have updated my paper (attached) to say so explicitly, and include an example with UNION and EXCEPT. >IntersectOrExceptNode uses the ORDER BY columns as the most significant part of the sort key for its inputs. Any >columns not in the ORDER BY list are tacked on to the least significant part of the sort keys of the inputs. This >ensures that the output of the INTERSECT or EXCEPT will be properly ordered without an additional sort stepNo. It isn't very easy for the set operators to find out the structure of the operands or to deduce whether there may be duplicate rows in either of the inputs. If the IntersectOrExceptNode class could know that one of the operands would use a unique index it could use that column in its sort key, avoiding a sort. However the structure of the optimizer requires that orderby lists be generated in the optimization preprocess phase, before the main body of the optimizer has started. So it is hard to use knowledge about unique keys if we could get it. Jack Intersect & Except DesignJack KlebanoffFeb. 22 2005 IntroductionThis paper describes the implementation of the INTERSECT and EXCEPT operators. This paper assumes basic familiarity with SQL and the language (compiler) portion of Derby. The INTERSECT and EXCEPT operators operate on table expressions producing the intersection and difference, respectively. The syntax is (roughly):
queryExpression INTERSECT [ALL] queryExpression By default these operators remove duplicates, which can occur if there are duplicates in the inputs. If ALL is specified then duplicates are not removed. If t1 has m copies of row R and t2 has n copies then t1 INTERSECT ALL t2 returns min(m,n) copies of R, and t1 EXCEPT ALL t2 returns max( 0, m-n) copies of R. The EXCEPT operator has the same precedence as UNION. INTERSECT has higher precedence. The implementation is spread across several classes, primarily
ExecutionThe INTERSECT and EXCEPT operations are executed in a similar fashion, by class SetOpResultSet. The two inputs are sorted separately. The sort key consists of all the columns. Then SetOpResultSet scans the two inputs simultaneously. The INTERSECT operation outputs approximately any row from its left input that is also found in its right output. The EXCEPT operation outputs approximately any row from its left input that is not found in its right output. Handling of duplicates complicates the picture a little, which is the reason for the "approximately" caveat. However the scans proceed in a strictly forward direction; there is no need for backtracking.If the left and right inputs have N and M rows respectively the sorts take time O(N*log(N) + M*log(M)). The final scan takes time O(N + M). So the time for the whole operation is O(N*log(N) + M*log(M)). Alternative Execution PlansOther implementations are possible.
The current implementation was chosen because it always provides at least decent speed and memory utilization, and in many, though certainly not all cases, it is the best implementation. We could have provided several implementations and let the optimizer choose the best, but this does not seem worthwhile for operations that are seldom used. Binding, Optimization, and Code GenerationThe IntersectOrExceptNode class handles compilation of both INTERSECT and EXCEPT operations, because they are implemented similarly. We do very little in the way of optimization because we have only implemented one execution plan; the optimizer has nothing to choose from.The INTERSECT and EXCEPT operators are bound much like the UNION operator. The bind methods are all found in super class SetOperatorNode, which is shared with UnionNode. IntersectOrExceptNode generates OrderByNodes for its inputs at the start of optimization, in the preprocess phase. Any column ordering can be used for the sorts, as long as the same one is used for both the left and right inputs. IntersectOrExceptNode tries to be a little clever about picking the column ordering for the sorts. If the INTERSECT or EXCEPT output must be ordered then IntersectOrExceptNode uses the ORDER BY columns as the most significant part of the sort key for its inputs. Any columns not in the ORDER BY list are tacked on to the least significant part of the sort keys of the inputs. This ensures that the output of the INTERSECT or EXCEPT will be properly ordered without an additional sort step. The architecture of the Derby optimizer makes it difficult to do further optimizations. SelectNode processing requires that order by lists be pushed down to them at the start of preprocessing. If an input to INTERSECT or EXCEPT is a SELECT (a common case) then IntersectOrExceptNode has to decide whether it needs its inputs ordered before it calls the preprocess method of its inputs. That means that it must chose its execution plan at the start of the optimization process, not as the result of the optimization process. Code generation is straighforward. It generates code that invokes the ResultSetFactory.getSetOpResultSet method. ParserThe parser, in java/engine/org/apache/derby/impl/sql/compile/sqlgrammar.jj, implements the EXCEPT, INTERSECT, and UNION operators in the queryExpression method.The UNION and EXCEPT operators have the same precedence. The INTERSECT operator has higher precedence, so t1 EXCEPT t2 UNION t3is equivalent to (t1 EXCEPT t2) UNION t3and t1 UNION ALL t2 UNION t3is equivalent to (t1 UNION ALL t2) UNION t3while t1 EXCEPT t2 INTERSECT t3is equivalent to t1 EXCEPT (t2 INTERSECT t3) Note that the EXCEPT operator is not associative, nor is UNION ALL associative with UNION, so the correct associativity is important, even when the query _expression_ uses the same operator. t1 EXCEPT (t2 EXCEPT t3)is not equivalant to (t1 EXCEPT t2) EXCEPT t3 This precedence and associativity is implemented in the structure of queryExpression. The higher precedence of the INTERSECT operator is implemented by building queryExpression out of nonJoinQueryTerm. A nonJoinQueryTerm consists of one or more nonJoinQueryPrimaries connected by INTERSECT operators. A queryExpression consists of one or more nonJoinQueryTerms connected by EXCEPT or UNION operators.
Our parser is a recursive descent parser. Recursive descent parsers want recursion on the
right. Therefore they do not handle SQL's operator associativity naturally. The natural grammar for
queryExpression is queryExpression ::= nonJoinQueryTerm | queryExpression UNION [ALL] nonJoinQueryTerm | queryExpression EXCEPT [ALL] nonJoinQueryTermThis captures the correct associativity of the UNION and EXCEPT operators, but we cannot use it because it has recursion on the left. Therefore our parser uses the following grammar:
We have right recursion, but each queryExpression has to know whether it is a simple _expression_ or
the right hand side of a UNION or INTERSECT operation. The nonJoinQueryTerm method uses a similar
trick to implement SQL standard associativity while using right recursion. INTERSECT associativity matters
when INTERSECT ALL is mixed with plain INTERSECT.
The creation of union, intersect, or except query tree nodes is done in the nonJoinQueryTerm method.
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | [Announce] Jeremy Boynes joins Derby PPMC, Jean T. Anderson |
|---|---|
| Next by Date: | Re: INTERSECT and EXCEPT Design, Jeremy Boynes |
| Previous by Thread: | Re: INTERSECT and EXCEPT Design, Jeffrey Lichtman |
| Next by Thread: | Re: INTERSECT and EXCEPT Design, Jeremy Boynes |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |
Home | sitemap
| advertise | OSDir is
an inevitable website.
|