Proactive Re-optimization

Proactive Re-optimization
Pedro Bizarro Joint work with Shivnath Babu and David DeWitt

What is the Problem? Sometimes database query optimizers choose execution plans that are sub-optimal by orders of magnitude

How the Problem Arises (1)
Statistics may not be up-to-date Statistics may be missing Missing statistics are estimated based on other (possibly estimated) statistics assumptions (independency, uniformity, etc) default values Errors in Estimates

How the Problem Arises (2)
Errors on estimated sizes of intermediate tables grow exponentially [IC91] Cost functions are not smooth Cost Memory Small errors may become big errors [IC91] Ioannidis and Christodoulakis. On the Propagation of Errors in the Size of Join Results. SIGMOD’91

Roadmap Introduction Re-Optimization (and its Limitations)
Proactive Re-Optimization Bounding Boxes Robust and Switchable Plans Random Samples Experimental Results Conclusions

Re-optimization: Current Approach E.g., [KD98, I+99, M+04]
Use conventional optimizers Add check operators to plan to: check for significant discrepancies between estimated and observed values check when plan becomes sub-optimal Execute and react approach Trigger re-optimization if check fails [KD98] Kabra and DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. SIGMOD’98 [I+99] Ives, et al. An Adaptive Query Execution System for Data Integration. SIGMOD’99. [MR+04] Markl, et al. Robust Query Processing through Progressive Optimization. SIGMOD'04

Re-optimization: Current Approach (contd.)
Traditional optimizer chooses plan P: Re-optimizer chooses same plan P and adds checks R S INLJ T R S INLJ T CHECK

Re-optimization’s Main Limitation
Optimizer picks plan unaware of possible future re-optimizations I.e., optimization assumes no re-optimization What can go wrong? Can we do better?

Re-optimization Limitations
Re-optimizing is expensive (could avoid it by using robust plans) May lose partial work If start on P1 and re-optimize to P2, will repeat scan on R Size of σ(R) Cost P1 P2 Example query: σ(R) S P1: P2: S σ(R) HHJ Estimated size of σ(R) P1 is risky! P2 is robust.

Re-optimization Limitations (2)
Limited information collected at run-time Check operators only detect when to re-optimize E.g: Too long to find a good plan σ(R) INLJ S T σ(R) HHJ S T Re-optimize S HHJ σ(R) T Re-optimize S HHJ T σ(R) Re-optimize

Proactive Re-Optimization in a Nutshell
If DBMS knows re-optimization may happen: Try to avoid it! Pick robust (and switchable) plans Collect statistics for future re-optimization Plan for it!

Building Blocks of Proactive Re-optimization
Use of bounding boxes Use of robust plans and switchable plans Enhanced run-time statistics collection Intervals around estimates to represent uncertainty Close to optimal in bounding box Set of plans, each close to optimal in part of bounding box To detect sub-optimality faster and to avoid re-optimization thrashing

Proactive Re-optimization Architecture
QUERY 1. Compute bounding boxes for estimates 2. Use bounding boxes to pick robust or switchable plans CATALOG Estimate within the bounding box? Run-time estimates No, reoptimize 3. Execute query; Collect accurate statistics estimates Yes, use robust or switchable plan Execution

Bounding Boxes: Representing Uncertainty
Interval around estimate is: wide if optimizer uncertain about estimate narrow if optimizer certain about estimate Uncertainty is measured from the way the statistic is estimated [KD98], e.g.: Histogram -> very certain Multiplication of selectivities -> uncertain Default guess -> very uncertain Etc. [KD98] Kabra and DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. SIGMOD’98

Bounding Boxes: Representing Uncertainty
Interval around estimate is: wide if optimizer uncertain about estimate narrow if optimizer certain about estimate Estimated|S| Query: σ(R) S Bounding box high est. low Estimated|σ(R)| low estimated high

Bounding Boxes: Plan Costing and Pruning
Costing - Computes three costs per each plan tree: (2-dim bounding box using cardinality estimates from sub-plans) Pruning For each join subset and interesting order find 3 plans: BestLow  the plan with lowest CLow BestEst  the plan with lowest CEst BestHigh  the plan with lowest CHigh Cost? |S| CLow, cost here CEst, cost here CHigh, cost here HHJ high est. S σ(R) low |σ(R)| low estimated high

cost within 20% of best in all 3 points of bounding box
Selecting Plans Query: σ(R) S P1: P2: S σ(R) HHJ At the end of plan enumeration there are three seed plans: Size of σ(R) Cost P1 P2 Bounding box for σ(R) estimate BestEst=P1 BestHigh=P2 BestLow=P1 high estimated low Four cases: The seeds are the same plan One of the seeds is robust A switchable plan can be created from them No single plan, not robust, not switchable cost within 20% of best in all 3 points of bounding box

Switchable Plans Goal: Avoid re-optimization but still run the best plan in bounding box How: Define switchable plans to allow late binding decision Plans are switchable if: Have a different root operator Have the same sub-plan as one of the inputs to the root Have the same base table as other input Hash1 INLJ Scan R Scan S Hash2 Scan T Hash3 E.g.: Index Seek on T

Switchable Plans Execute (part of) the common sub-plan
Collect run-time estimates Instantiate the best seed plan for those estimates late binding decision Hash1 INLJ Scan R Scan S Index Seek on T INLJ Hash2 Hash2 Scan T Hash1 Hash3 Scan R Scan S Scan T Hash3 Hash1 Scan R Scan S Index Seek on T Scan T Scan T ? ? ?

Implementation of a Switchable Plan
? switch operator buffer INLJ, Index seek on T Hash2, Scan T Hash3, Scan T T Hash1 Scan R Scan S Compare with alternative: avoid re-optimization, just buffer, count Buffer tuples until a tuple random sample is obtained Compute estimate and pass it up to switch operator Switch operator instantiates correct operator Minimal overhead

Observing Statistics at Run-Time
Uses Detect when to re-optimize Pick candidate switchable plan Goals Must be efficient Must be quick Must be accurate

The Idea: Random Sample Prefix
Prefix output of operators with random sample of their entire output Normal output without random sample prefix a a b c d e e f g h h i j Output with random sample prefix Emit eos(30%) punctuation to parent operator b c d f g i j Propagate sample prefixes bottom up Implemented for file scan, indexed scan, nested-loops joins, hash join

Implementing Random Sample Prefixes
Samples of tables computed ahead of time: For each table R, there is another table R_sample Modified scan operator: scan R_sample emit eos scan R skipping tuples in R_sample Modified nested-loops join operator: Pass eos from outer relation True random sample of join if outer is FK side See paper for hash join eos … NLJ eos

Experimental Evaluation
Built within Predator DBMS Implemented three optimizers: Rio, our Proactive re-optimizer Reactive re-optimizer Traditional dynamic programming optimizer Synthetic version of DMV dataset from IBM Correlated attributes More details in the paper Our implementation of [MR+04] [MR+04] Markl, et al. Robust Query Processing through Progressive Optimization. SIGMOD'04

Using Robust Plans Query σ(A) C
Error in selection of A because there are no histograms. Error varied by having optimizer use default value and use different selection predicates with different sizes.

Using Switchable Plans Query σ(A) C

3-way join Query σ(A) C σ(O)

Query Complexity Errors due to correlated attributes

Conclusions Ever increasing data, queries, and system
Statistics will be uncertain Optimizer mistakes will happen Promising approach: Proactive re-optimization Bounding boxes Robust and switchable plans Quick, efficient, accurate run-time stats collection Future work: improve individual components

Jennifer Widom for discussion and feedback
Acknowledgements Jennifer Widom for discussion and feedback Guy Lohman and Volker Markl for providing DMV data and workload generator

Thank you! Questions? Feedback? Check out our demo!

Assume error in Estimate σ(A)!
3-way Join: σ(A) C σ(O) C σ(O) HHJ σ(A) Traditional C σ(O) HHJ σ(A) Re-Optimizer C σ(O) Switch σ(A) Proactive Re-Opt Sub-optimal C σ(O) HHJ σ(A) Re-optimizes Re-optimizes Sub-optimal C σ(O) HHJ σ(A) Optimal [Use example from section 6.2 in the paper] [Optimal plan is P16b but there is error in estimates] [At the estimated point, P16a is chosen] [Traditional opt runs sub-optimal plan P16a] [Reactive Re-Opt starts with P16a, detects sub-optimality but collects insufficient info and picks another sub-optimal plan P16d. Loses work. It cannot detect sub-optimality of P16d.] [Rio starts with a plan with a pair of switchable ops. Quickly detects estimate outside bounding box. New, accurate estimate is used to re-optimize into optimal plan P16b] Assume error in Estimate σ(A)!

Proactive Re-optimization

Similar presentations

Presentation on theme: "Proactive Re-optimization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Proactive Re-optimization

Similar presentations

Presentation on theme: "Proactive Re-optimization"— Presentation transcript:

Similar presentations

About project

Feedback