EN 600.619: Adv. Storage and TP Systems Cost-Based Query Optimization.

EN 600.619: Adv. Storage and TP Systems Cost-Based Query Optimization

EN 600.619: Adv. Storage and TP Systems The Optimization Process Logical query plan –As an expression tree Rewrite query plan to improve performance Create physical plan –Select algorithms to implement logical plan

EN 600.619: Adv. Storage and TP Systems An Expression Tree SELECT title, birthdate FROM MovieStar, StarsIn WHERE year=1996 AND gender=‘F’ AND starName= name;

EN 600.619: Adv. Storage and TP Systems An Alternate (Better) Logical Plan SELECT title, birthdate FROM MovieStar, StarsIn WHERE year=1996 AND gender=‘F’ AND starName= name;

EN 600.619: Adv. Storage and TP Systems Query Optimization Heuristics Push operators as far down the plan as possible Do selections as soon as possible –Reduce intermediate result sizes Select then project Perform joins as late as possible –They are more costly Group associative and commutative operators –Let the physical plan reorder execution

EN 600.619: Adv. Storage and TP Systems Improving the Plan Through query rewriting Split the selection

EN 600.619: Adv. Storage and TP Systems Improving the Plan Through query rewriting Split the selection Push the projection

EN 600.619: Adv. Storage and TP Systems Grouping Operators The physical (not logical) plan should pick the order

EN 600.619: Adv. Storage and TP Systems The Physical Plan Choose algorithms and estimate result size to generate concrete costs of a plan E.g. joins –Discipline: Hash, Index, Sort –Materialize, pipeline, ripple, parallel, etc. Large literature on different disciplines for all operations –Suitable for an entire (albeit detailed) course Also, how to search for good plans –Branch and bound, hill climbing, dynamic programming, etc. Result size and choice of algorithm are independent –For relation algebra operations

EN 600.619: Adv. Storage and TP Systems Estimating Result Sizes Most inaccurate and difficult part of query processing –Cost of an operation is a f ( algorithm, size estimate ) –Given exact size, costing is very accurate Sometime sizing can be exact –Equality queries for unique attributes are 0/1 –Joins on key (foreign key) fields –Good schema design improves query execution For many operations it is difficult –Joins: expand (cross product) or reduce (more often) –Range queries: produce multiple tuples 50% accuracy is considered good……ugh!

EN 600.619: Adv. Storage and TP Systems Problems w/ Estimating Size Need to know result sizes a-priori –Know them exactly after query execution Techniques need to be lightweight –Performing I/O as part of estimation reduces query performance General approach –Statistics on underlying tables for important queries –Small, summary data structures (in-memory execution) Techniques –Histograms, sampling, wavelets

EN 600.619: Adv. Storage and TP Systems Histograms SELECT Jan.day, July,day FROM Jan, July WHERE Jan.temp = July.temp Join estimate = T 1 T 2 /V tuple product/width Estimate: 5x20/10 + 10x5/10 = 10 Better than est. w/out histogram 245x245/100 = 600

EN 600.619: Adv. Storage and TP Systems On Histograms Workload defined –Keep for important fields. Similar concept to indexes. Data defined –Keep when they improve performance. –Don’t need a histogram for the uniform distribution Complications –Update queries invalidate statistics –Need to be pre-computed, often prior to witnessing workload –Composing histograms (for multiple attributes) leads to inaccuracies What the world needs is fully incremental histograms on that support multi-attribute queries

EN 600.619: Adv. Storage and TP Systems STHoles Bruno, Chaudhuri, and Gravano. STHoles: A Multidimensional Workload-Aware Histogram, SIGMOD 2001. Generate histograms from analyzing query results –No examination of data sets –Leverage workload information and query feedback Supports overlapped and nested buckets –Multi-resolution histogram –Buckets allocated where they are most needed, e.g. if there are no queries to a region, no statistics are kept

EN 600.619: Adv. Storage and TP Systems Feedback-Based Optimization

EN 600.619: Adv. Storage and TP Systems Visualizing Histograms

EN 600.619: Adv. Storage and TP Systems Histogram Construction Start with an empty histogram New queries punch ‘holes’ in the histogram, creating regions of refinement

EN 600.619: Adv. Storage and TP Systems Policies Identify and drill candidate holes

EN 600.619: Adv. Storage and TP Systems Policies Shrink regions to preserve rectangular spaces –Ease of description and improved accuracy

EN 600.619: Adv. Storage and TP Systems Policies Merge buckets (with similar densities) to improve histogram under a space budget

EN 600.619: Adv. Storage and TP Systems STHoles Redux Quality histograms Runtime overhead (<10%) –Dynamic construction of histograms –But, no pre-processing Preferable in several situations –Frequently updated data, needs distribution to change –Shifting workloads -- STHoles can redirect attention to new regions dynamically. (This is what’s cool.)

EN 600.619: Adv. Storage and TP Systems Cost-Based Query Optimization.

Similar presentations

Presentation on theme: "EN 600.619: Adv. Storage and TP Systems Cost-Based Query Optimization."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EN 600.619: Adv. Storage and TP Systems Cost-Based Query Optimization.

Similar presentations

Presentation on theme: "EN 600.619: Adv. Storage and TP Systems Cost-Based Query Optimization."— Presentation transcript:

Similar presentations

About project

Feedback