Semantic Query Optimization Techniques November 16, 2005 By : Mladen Kovacevic
Background 1980's, semantic information stored in dbs as integrity constraints could be used for query optimization semantic: “of or relating to meaning or the study of meaning” ( integrity : preserve data consistency when changes made in db. no extensive implementation existing today (1999)
Introduction Key factor in relational database system’s improvement in query execution time, is query optimization. Query execution can be improved by: –Analyzing integrity information, and rewriting queries exploiting this information (JE & PI) –Avoid expensive sorting costs (Order Optimization) –Exploiting uniqueness by knowing rows will be unique, thus, avoiding extra sorts. (EU)
Presentation Overview Semantic Query Optimization techniques –Join Elimination (JE) –Predicate Introduction (PI) –Order Optimization (OO) –Exploiting Uniqueness (EU)
Some Motivation Describing two techniques in SQO, demonstrated in DB2 UDB. –Predicate Introduction –Join Elimination Reasons: –rewriting queries by hand showed that these two provided consistent optimization. –practical to implement –extendible to other DBMS’s. Data sets used : TPC-D and APB-1 OLAP benchmarks only REFERENTIAL INTEGRITY constraints and CHECK CONSTRAINTS used!
Semantic Query Optimization (SQO) Techniques Join Elimination: Some joins need NOT be evaluated since the result may be known apriori (more on this later) Join Introduction: Adding a join can help if relation is small compared to original relations and highly selective. Predicate Elimination : If predicate known to be always true, can be eliminated from query (DISTINCT clause on Primary Key – Uniqueness exploitation!) Predicate Introduction: New predicates on indexed attributes can result in a much faster access plan. Detecting the Empty Answer Set : If query predicates inconsistent with integrity constraints, the query does not have answer.
Why SQO implementations not used? Deductive Databases : Many cases SQO techniques were designed for deductive databases, thus not appearing to be useful in relational database context. CPU & I/O Speeds similar : When being developed, CPU & I/O speeds were not as dramatically different –(savings in I/O not worth the CPU time added) Lack of Integrity Constraints : Thought that many integrity constraints are needed for SQO to be useful
Two-stage Optimizer Examples of SQO techniques always designed for a two-stage optimizer –Stage 1 : logically equivalent queries created (DB2’s query rewrite optimization) –Stage 2 : generate plans of all these queries, choosing the one with lowest estimated cost. (DB2’s query plan optimization) Join order, join methods, join site in a distributed database, method for accessing input table, etc.
Join Elimination ·Simple : Eliminate relation where join is over tables related through referential integrity constraint, and primary key table referenced only in the join VIEW DEFINITION CREATE VIEW Supplier_Info (n, a, c) as SELECT s_name, s_address, n_name FROM tpcd.supplier, tpcd.nation WHERE s_nationkey = n_nationkey QUERY SELECT s_n, s_a FROM Supplier_Info
Join Elimination (con’t) ·Query can be rewritten internally as: SELECT s_n, s_a FROM tpcd.supplier Why do such a simple rewrite? User may not have access to the supplier table, and/or may only know about the view. Sometimes GUI managers create these “dumb” queries so need to optimize Non-programmers write queries often, and may not even think about this. Algorithm for generic redundant join removal provided in paper.
Example – Join Elimination SELECT p_name, p_retailprice, s_name, s_address FROM tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplier WHERE p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey and l_shipdate between ' ' and ' ' and l_discount >= 0.1 GROUP BY p_name, p_retailprice, s_name, s_address ORDER BY p_name, s_name PART PARTKEY SUPPLIE R SUPPKEY PARTSUP P PARTKEY SUPPKEY LINEITEM PARTKEY SUPPKEY 1 – many relationship
Example : Join Elimination Any immediate improvements that can be seen here? p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey P_PARTKEYPS_PARTKEYL_PARTKEY S_SUPPKEYPS_SUPPKEYL_SUPPKEY P_PARTKEY = PS_PARTKEY PS_PARTKEY = L_PARTKEY S_SUPPKEY = PS_SUPPKEY PS_SUPPKEY = L_SUPPKEY S_SUPPKEY = L_SUPPKEY PS_PARTKEY = L_PARTKEY
Results 100 MB db size Execution Time : 58.5 sec -> sec (35 % improvement) I/O Cost: > 1498 page reads (67 % improvement)
Results – OLAP Environment In OLAP (online analytical processing) servers, using a star schema (one fact table, with several dimension tables) improvements ranged from 2% to 96 %. –In these cases, much improvement came from CPU cost instead of I/O, because dimension tables were small enough to fit into memory...
Predicate Introduction Techniques discussed : –Index Introduction : add new predicate on attribute if index exists on that attribute. –Assumption : index retrieval is better than table scan, is this always good? –Scan Reduction : reduce number of tuples that qualify for a join. –Problem : Not very common; unlikely that there will be any check constraints or predicates with inequalities about join columns Detecting empty query answer set (not shown as query execution time essentially 0)
Example - Predicate Introduction SELECT sum(l_extendedprice * l_discount) as revenue FROM tpcd.lineitem WHERE l_shipdate >= date(‘ ’) and l_shipdate < date(‘ ’)+ 1 year and l_discount between.06 – 0.01 and and l_quantity < 24; Check Constraint : l_shipdate <= l_receiptdate Index : l_receiptdate Maintaining semantics, we can add : –l_receiptdate >= date(‘ ’)
Example - Predicate Introduction SELECT sum(l_extendedprice * l_discount) as revenue FROM tpcd.lineitem WHERE l_shipdate >= date(‘ ’) and l_shipdate < date(‘ ’)+ 1 year and l_receiptdate >= date(‘ ’) and l_discount between.06 – 0.01 and and l_quantity < 24; Check Constraint : l_shipdate <= l_receiptdate Index : l_receiptdate Maintaining semantics, we can add : –l_receiptdate >= date(‘ ’) Why would we want to do this? In order to have optimizer choose a plan using the index. Is this always good? NO! What if most of the rows in the table need to be returned? We should use a tablescan instead.
Predicate Introduction - Algorithm Input : set of all check constraints defined for a database and the set of all predicates in query Output: set of all non-redundant formulas derivable from the input set. This answer set can then be added to the query, but only a few are potentially useful. The goal in the paper was to choose additions that would guarantee improvement. Conditions in paper: Conservative approach of introducing predicates that will have the plan optimizer use an index. Insist on only one index available with the query predicate.
Predicate Introduction - Results
Why?
Why Longer Execution for P3/P5? P2 and P3 are the same except for the following P2 : SELECT... FROM... WHERE l_shipdate >= date (' ') and l_shipdate < date (' ') + 1 month P3 : SELECT... FROM... WHERE l_shipdate >= date (' ') and l_shipdate < date (' ') + 1 month Difference in table shows that P2 has 2 % of the tuples falling in the range while P3 has 48 % of the tuples fall in the category : BOTH plans will choose index scan! P3 is so large that tablescan is better in this case. 1.Cost model underestimates cost of locking/unlocking index pages 2.Estimated number of tuples goes down because of the reduction factor problem (multiply in the new predicate added)
Adjustments for Reduction Factor Problem Add new predicate only when it contains a major column of an index and a scan of that index is sufficient to answer the query (thus, no table scan necessary) Original Index : New Index :
Order Optimization Techniques Access plan strategies exploit the physical orderings provided either by indexes or sorting GOAL: optimize the sorting strategy Techniques –Pushing down sorts in joins –Minimizing the number of sorting columns –Detecting when sorting can be avoided because of predicates, keys or indexes Order Optimization : detecting when indexes provide an interesting order, so that sorting can be either avoided, and used as sparingly as possible. Interesting Orders : when the side effect of a join produces rows in sorted order, which can be taken advantage of later (if another join needed, ORDER BY, GROUP BY, DISTINCT)
Fundamental Operators Order optimization requires the following operations –Reduce Order –Test Order –Cover Order –Homogenize Order
Order Optimization Results
Exploiting Uniqueness Checking to see if query contains unnecessary DISTINCT clauses –How does this make improvements? Removing duplicates is performed by SORTING, a costly operation. Example is removing DISTINCT keyword from query if it is applied onto the primary key itself (since primary keys are, by definition, distinct)
How to exploit uniqueness? Using knowledge about: –Keys –Table Constraints –Query Predicates Cannot always be tested efficiently, so we look for a sufficient solution.
Summary Important Outcome : experimental evidence showing SQO can provide effective enhancement to the traditional query optimization. –Join Elimination : geared towards OLAP environment (where very useful) –Independent on existence of complex integrity constraint – semantic reasoning used about referential integrity constraints –Easy to implement and execute –Predicate Introduction : guaranteeing improvements more difficult, needing rather severe restrictions imposed (limits the applicability of this approach) –Order Optimization : utilizing functional dependencies and table information, we use it in creating a “smart” access plan, avoiding or optimizing sort operations. –Exploiting Uniqueness : uniqueness is powerful when it reduces the number of expensive sorts. Discovering true ways of exploiting this technique are quite tricky and specific.
References Qi Cheng, Jarek Gryz, Fred Koo, et al: Implementation of Two Semantic Query Optimization Techniques in DB2 Universal Database. Proceedings of the 25 th VLDB Conference, Edinburg, Scotland,1999. David E. Simmen, Eugene J. Shekita, Timothy Malkemus: Fundamental Techniques for Order Optimization. SIGMOD Conference 1996: G. N. Paulley, Per-ke Larson: Exploiting Uniqueness in Query Optimization. ICDE 1994: 68-79
The End.