Download presentation
Presentation is loading. Please wait.
Published bySamantha Franklin Modified over 9 years ago
1
EN 600.619: Adv. Storage and TP Systems Cost-Based Query Optimization
2
EN 600.619: Adv. Storage and TP Systems The Optimization Process Logical query plan –As an expression tree Rewrite query plan to improve performance Create physical plan –Select algorithms to implement logical plan
3
EN 600.619: Adv. Storage and TP Systems An Expression Tree SELECT title, birthdate FROM MovieStar, StarsIn WHERE year=1996 AND gender=‘F’ AND starName= name;
4
EN 600.619: Adv. Storage and TP Systems An Alternate (Better) Logical Plan SELECT title, birthdate FROM MovieStar, StarsIn WHERE year=1996 AND gender=‘F’ AND starName= name;
5
EN 600.619: Adv. Storage and TP Systems Query Optimization Heuristics Push operators as far down the plan as possible Do selections as soon as possible –Reduce intermediate result sizes Select then project Perform joins as late as possible –They are more costly Group associative and commutative operators –Let the physical plan reorder execution
6
EN 600.619: Adv. Storage and TP Systems Improving the Plan Through query rewriting Split the selection
7
EN 600.619: Adv. Storage and TP Systems Improving the Plan Through query rewriting Split the selection Push the projection
8
EN 600.619: Adv. Storage and TP Systems Grouping Operators The physical (not logical) plan should pick the order
9
EN 600.619: Adv. Storage and TP Systems The Physical Plan Choose algorithms and estimate result size to generate concrete costs of a plan E.g. joins –Discipline: Hash, Index, Sort –Materialize, pipeline, ripple, parallel, etc. Large literature on different disciplines for all operations –Suitable for an entire (albeit detailed) course Also, how to search for good plans –Branch and bound, hill climbing, dynamic programming, etc. Result size and choice of algorithm are independent –For relation algebra operations
10
EN 600.619: Adv. Storage and TP Systems Estimating Result Sizes Most inaccurate and difficult part of query processing –Cost of an operation is a f ( algorithm, size estimate ) –Given exact size, costing is very accurate Sometime sizing can be exact –Equality queries for unique attributes are 0/1 –Joins on key (foreign key) fields –Good schema design improves query execution For many operations it is difficult –Joins: expand (cross product) or reduce (more often) –Range queries: produce multiple tuples 50% accuracy is considered good……ugh!
11
EN 600.619: Adv. Storage and TP Systems Problems w/ Estimating Size Need to know result sizes a-priori –Know them exactly after query execution Techniques need to be lightweight –Performing I/O as part of estimation reduces query performance General approach –Statistics on underlying tables for important queries –Small, summary data structures (in-memory execution) Techniques –Histograms, sampling, wavelets
12
EN 600.619: Adv. Storage and TP Systems Histograms SELECT Jan.day, July,day FROM Jan, July WHERE Jan.temp = July.temp Join estimate = T 1 T 2 /V tuple product/width Estimate: 5x20/10 + 10x5/10 = 10 Better than est. w/out histogram 245x245/100 = 600
13
EN 600.619: Adv. Storage and TP Systems On Histograms Workload defined –Keep for important fields. Similar concept to indexes. Data defined –Keep when they improve performance. –Don’t need a histogram for the uniform distribution Complications –Update queries invalidate statistics –Need to be pre-computed, often prior to witnessing workload –Composing histograms (for multiple attributes) leads to inaccuracies What the world needs is fully incremental histograms on that support multi-attribute queries
14
EN 600.619: Adv. Storage and TP Systems STHoles Bruno, Chaudhuri, and Gravano. STHoles: A Multidimensional Workload-Aware Histogram, SIGMOD 2001. Generate histograms from analyzing query results –No examination of data sets –Leverage workload information and query feedback Supports overlapped and nested buckets –Multi-resolution histogram –Buckets allocated where they are most needed, e.g. if there are no queries to a region, no statistics are kept
15
EN 600.619: Adv. Storage and TP Systems Feedback-Based Optimization
16
EN 600.619: Adv. Storage and TP Systems Visualizing Histograms
17
EN 600.619: Adv. Storage and TP Systems Histogram Construction Start with an empty histogram New queries punch ‘holes’ in the histogram, creating regions of refinement
18
EN 600.619: Adv. Storage and TP Systems Policies Identify and drill candidate holes
19
EN 600.619: Adv. Storage and TP Systems Policies Shrink regions to preserve rectangular spaces –Ease of description and improved accuracy
20
EN 600.619: Adv. Storage and TP Systems Policies Merge buckets (with similar densities) to improve histogram under a space budget
21
EN 600.619: Adv. Storage and TP Systems STHoles Redux Quality histograms Runtime overhead (<10%) –Dynamic construction of histograms –But, no pre-processing Preferable in several situations –Frequently updated data, needs distribution to change –Shifting workloads -- STHoles can redirect attention to new regions dynamically. (This is what’s cool.)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.