A Framework for Testing Query Transformation Rules Hicham Elmongui Purdue University Vivek Narasayya, Ravi Ramamurthy Microsoft Research 4/10/2019 ACM SIGMOD 2009
Query Optimizer Database System Optimizer Responsible for producing a good execution plan for a given SQL query Crucial for decision support queries
Query Optimizer Components Search Strategy Rule Engine Apply rule Query Execution Plan Cost Model Cardinality Estimation Query Optimizer 4/10/2019 ACM SIGMOD 2009
Query Transformation Rules Apply Join Associativity Rule Logical Rule R S Apply Join To Hash Join Rule Hash Join Implementation Rule Search space extensible by adding new rules Group By, De-correlation, Star Join, etc. Modern optimizers have large number of rules 4/10/2019 ACM SIGMOD 2009
Implementing Rule Engine is Non-Trivial SELECT D.Name FROM DEPT D WHERE D.BUDGET <= ( SELECT COUNT(E.eno)*10000 FROM E WHERE E.Dno = D.Dno) SELECT D.Name FROM DEPT D , EMP E WHERE D.no = E.Dno GROUP BY D.Name HAVING D.Budget <= COUNT(E.Eno)*10000 Count Bug in De-correlation Rewrite rules can be subtle Implementation errors can lead to incorrect results RAGS paper (VLDB’98) 4 DBMSs disagreed on query results 16% of the time! 4/10/2019
Testing Optimizer Rule Engine Coverage Is a given rule (or set of rules) exercised? Correctness Does exercising a rule (or set of rules) change the query results? Performance How does a rule (or set of rules) affect query performance? 4/10/2019 ACM SIGMOD 2009
Rule Coverage Definitions of when a rule is exercised Query Transformation rules exercised API to track which rules are exercised for a given query Q1 1 2 3 4 5 … n Q2 1 2 3 4 5 … n … Qm 1 2 3 4 5 … n Definitions of when a rule is exercised Rule must generate at least one expression during optimization At least one expression in the final plan must be generated by rule 4/10/2019 ACM SIGMOD 2009
Testing Rule Coverage Generate query such that each rule is exercised Hard to precisely characterize when a rule will be exercised Depends on rule semantics, optimizer heuristics etc. Extend for a set of rules (e.g. rule pairs) Large space of combinations Efficient query generation Time required to generate query that exercises rule should be as small as possible Need multiple queries per rule (or set of rules) Random query generation can be inefficient 4/10/2019 ACM SIGMOD 2009
Rule Correctness R ≠R΄ bug Query Q Disable rule r2 Query Q Results R Transformation rules exercised Plan P Query Q 1 2 3 4 5 … n Optimize Execute Disable rule r2 Results R΄ Plan P΄ Query Q 1 2 3 4 5 … n Optimize Execute R ≠R΄ bug 4/10/2019 ACM SIGMOD 2009
Testing Rule Correctness Transformation rules exercised Plan P Query Q 1 2 3 4 5 … n Optimize Disable rule r2 Disable rule rn-1 Disable rule r3 Plan P2 Plan Pn-1 Plan P2 For each rule, repeat for multiple such queries (k) Need to execute if P ≠ P΄ Queries are usually complex Equivalence of plan P and P΄cannot be inferred in most cases Time consuming 4/10/2019 ACM SIGMOD 2009
DBMS Testing Data Generation Query Generation Quickly generating Billion-Record databases (SIGMOD’94) Flexible Database Generators (VLDB’05) Reverse Query Processing (ICDE’07) MUDD: A Multi-dimensional data generator(WOSP’04) Query Generation RAGS (VLDB’98) Generating Thousand Benchmark Queries in Seconds (VLDB’04) Genetic approach (VLDB’07) Unit testing query transformation rules (DBTest’08) Generating queries with cardinality constraints (TKDE’o6, SIGMOD’08) 4/10/2019 ACM SIGMOD 2009
Query Generation for Rule Testing RAGS (VLDB’98) Stochastic SQL statement generation Control SQL generated via configuration parameters #Joins, #columns in Group-By, max sub-query depth, … Genetic approach (VLDB’07) Queries are mutated, combined, etc. to generate new queries Feedback function applied on each query to determine “fitness” E.g. prefer queries with non-empty results 4/10/2019 ACM SIGMOD 2009
Our Contributions Query generation Correctness validation Exploit “rule patterns” to identify necessary condition for a rule to be exercised Significantly reduces number of trials compared to previous approaches Correctness validation Novel problem of test suite compression Significantly reduce time for correctness testing Shown to be NP-Hard Principled solution (factor 2 approximation) 4/10/2019 ACM SIGMOD 2009
QRel Framework QREL: (DBTest’08) Programming framework for generating queries Generate logical query tree from tree “pattern” Generate SQL from a given logical query tree 4/10/2019 ACM SIGMOD 2009
Architecture 4/10/2019 ACM SIGMOD 2009
Rule Patterns Rule (Rule Name, Rule Pattern, Substitution) Input expression e If e matches Rule Pattern Generate new expression by invoking Substitution function on e Apply rule R S T Rule Pattern for Join Commutativity R S T 4/10/2019 ACM SIGMOD 2009
Exposing Rule Patterns Idea: Optimizer exposes a Rule Pattern for a given rule Returns (a subset of) necessary conditions for rule to be exercised Encoded using XML in our implementation Query Optimizer DBMS “Join Commutativity” Query Generation Tool 4/10/2019 ACM SIGMOD 2009
Rule Interactions Bugs in implementation of one rule may manifest when another rule is also applied “Get to Index Scan” rule Index Scan I (a, d) Get S “Join to Merge Join” rule Merge Join Get R R.a = S.b Index Scan I (d, a) “Get to Index Scan” rule Get S “Join to Merge Join” rule Merge Join Get R R.a = S.b ACM SIGMOD 2009 4/10/2019
Rule Composition Rule Pattern for Pulling GB above Join Group-By Rule Pattern for Join Commutativity Wildcard Combine rule patterns by replacing a wildcard node with the other rule pattern Other kinds of composition possible as well Group-By Group-By Group-By Group-By 4/10/2019 ACM SIGMOD 2009
Query Generation Algorithm For each rule pair (r1,r2) Select a composition of rule patterns T = Generate logical query tree for rule pattern S = Generate SQL statement for T // use QREL Repeat if r1 and r2 not exercised when S is optimized T2 T3 Group-By T1 Group-By SELECT T3.a, … FROM T1, T2, T3 WHERE … GROUP BY T3.a, … 4/10/2019 ACM SIGMOD 2009
Experiments Number of trials significantly fewer using Rule Patterns 12x reduction in number of trials for rule pairs 4/10/2019 ACM SIGMOD 2009
Test Suite Compression 110 100 r1 Q1 130 Baseline Cost = 100 + 150 + 300 + 110 + 160 + 400 = 1220 r2 160 Q2 150 500 r3 400 Q3 300 Find sub-graph of bipartite graph such that Each rule is selected Degree of each rule node is equal to test suite size (k) Sum of the edge costs is minimized Problem is NP-Hard (reduction from Set Cover problem) 4/10/2019 ACM SIGMOD 2009
Set Cover Heuristic Benefit(Q) = Number of new rules exercised/ Cost(Q) Greedily add query with largest “Benefit” Add edges corresponding to Q 110 100 r1 Q1 Benefit(Q1) = 3/100 Benefit(Q2) = 1/150 Benefit(Q3) = 1/200 130 r2 160 Q2 150 500 r3 400 Q3 300 Total Solution Cost = 100 + 110 + 130+ 500 = 840 Key drawback: ignores edge costs Turning off a rule can significantly plan cost 4/10/2019 ACM SIGMOD 2009
Top K Independent Algorithm For each rule r, add k edges with the lowest cost Factor 2 approximation of the optimal Ignores node cost 110 100 r1 Q1 130 r2 160 Q2 150 500 r3 400 Q3 300 Total solution cost = 100 + 150+ 110 + 130 + 160 = 650 In practice much better than alternatives 4/10/2019 ACM SIGMOD 2009
Experiments Top K Independent is significantly better Even better for case of rule pairs Further optimizations, experiments in paper 4/10/2019 ACM SIGMOD 2009
Conclusion Testing query optimizer rule engine is important Query generation for rule testing Significant gains by exploiting rule patterns Correctness validation Dramatic reductions possible using test suite compression Many open problems in rule testing Other variants of “rule exercising” Other kinds of rule interactions Data generation to ensure other necessary conditions (e.g. star join optimization rule requires FK relationship) 4/10/2019 ACM SIGMOD 2009