CS 440 Database Management Systems Query Optimization 1.

Slides:



Advertisements
Similar presentations
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Advertisements

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
CS 540 Database Management Systems
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Cost-Based Transformations. Why estimate costs? Well, sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g.
1 Relational Query Optimization Module 5, Lecture 2.
Cs44321 CS4432: Database Systems II Query Optimizer – Cost Based Optimization.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
Cost based transformations Initial logical query plan Two candidates for the best logical query plan.
Cost-Based Transformations. Why estimate costs? Sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g. Pushing.
CMSC724: Database Management Systems Instructor: Amol Deshpande
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Query Processing & Optimization
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CPS216: Advanced Database Systems Notes 03:Query Processing (Overview, contd.) Shivnath Babu.
Query Processing Presented by Aung S. Win.
Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
CS411 Database Systems Kazuhiro Minami 12: Query Optimization.
CSCE Database Systems Chapter 15: Query Execution 1.
Database Management 9. course. Execution of queries.
CPS216: Advanced Database Systems Notes 07:Query Execution Shivnath Babu.
CPS216: Advanced Database Systems Notes 08:Query Optimization (Plan Space, Query Rewrites) Shivnath Babu.
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
CPS216: Data-Intensive Computing Systems Introduction to Query Processing Shivnath Babu.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
1 Relational Query Optimization Chapter Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks :  An SQL.
CPS216: Advanced Database Systems Notes 09:Query Optimization (Cost-based optimization) Shivnath Babu.
CS411 Database Systems Kazuhiro Minami 12: Query Optimization.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
CS4432: Database Systems II Query Processing- Part 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
1 Lecture 25: Query Optimization Wednesday, November 26, 2003.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 15 – Query Optimization.
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
1 Database Systems ( 資料庫系統 ) December 13, 2004 Chapter 15 By Hao-hua Chu ( 朱浩華 )
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Query Processing – Implementing Set Operations and Joins Chap. 19.
CS 540 Database Management Systems
Chapter 13: Query Processing
CSE 544: Lecture 14 Wednesday, 5/15/2002 Optimization, Size Estimation.
CS4432: Database Systems II Query Processing- Part 1 1.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016.
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,
CS 440 Database Management Systems
CS 540 Database Management Systems
CS 440 Database Management Systems
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Introduction to Query Optimization
Database Management Systems (CS 564)
Lecture 26: Query Optimization
Lecture 27: Optimizations
CPS216: Advanced Database Systems Notes 03:Query Processing (Overview, contd.) Shivnath Babu.
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Lecture 26 Monday, December 3, 2001.
Lecture 27 Wednesday, December 5, 2001.
Lecture 24: Wednesday, November 27, 2002.
Presentation transcript:

CS 440 Database Management Systems Query Optimization 1

DBMS Architecture Query Executor Buffer Manager Storage Manager Storage Transaction Manager Logging & Recovery Lock Manager Buffers Lock Tables Main Memory User/Web Forms/Applications/DBA query transaction Query Optimizer Query Rewriter Query Parser Files & Access Methods Past lectures Today’s lecture

Many query plans to execute a SQL query 3 S U T R S R T U S R T U Even more plans: multiple algorithms to execute each operation S R T U Sort-merge hash join Table-scan index-scan Table-scan index-scan Compute the join of R(A,B) S(B,C) T(C,D) U(D,E)

Query optimization: picking the fastest plan Optimal approach plan – enumerate each possible plan – measure its performance by running it – pick the fastest one – What’s wrong? Rule-based optimization – Use a set of pre-defined rules to generate a fast plan e.g. If there is an index over a table, use it for scan and join. 4

Definitions Statistics on table R: – T(R): Number of tuples in R – B(R): Number of blocks in R B(R) = T(R ) / block size – V(R,A): Number of distinct values of attribute A in R 5

Review: Clustered index The relation is stored on the disk according to the order of index DATAINDEX 70 80

Plans to select tuples from R:  A=a (R) We have a clustered index on R Plans: – (Clustered) indexed-based scan – Table-scan (sequential access) Statistics on R – B(R)=5000, T(R)=200,000 – V(R,A) = 2, one value appears in 95% of tuples. Clustered indexed scan vs. table-scan ? 7

Query optimization methods Rule-based optimizer fails – It uses static rules – The rules do not consider the distribution of the data. Cost-based optimization – predict the cost of each plan – search the plan space to find the fastest one – do it efficiently Optimization itself should be fast! 8

Cost-based optimization Plan space – which plans to consider? – it is time consuming to explore all alternatives. Cost estimator – how to estimate the cost of each plan without executing it? – we would like to have accurate estimation Search algorithm – how to search the plan space fast? – we would like to avoid checking inefficient plans 9

Space of query plans Selection – algorithms: sequential, index-based – ordering: why does it matter? Join – algorithms: nested loop, sort-merge, hash – ordering Ordering/ Grouping – can an “interesting order” be produced by join/ selection? – algorithms: sorting, hash-based 10

Reducing plan space Multiple logical query plan for each SQL query Star(name, birthdate), StarsIn(movie, name, year) SELECT movie FROM Stars, StarsIn WHERE Star.name = StarsIn.name AND year = Generally Faster StarsIn Star StarsIn.name = Star.name  year=1950 StarsIn Star StarsIn.name = Star.name year=1950 movie

Reducing plan space Push selection down to reduce # of rows Push projection down to reduce # of columns SELECT movie, name FROM Stars, StarsIn WHERE Star.name = StarsIn.name 12 StarsIn Star StarsIn.name = Star.name movei, name StarsIn Star StarsIn.name = Star.name movie, name Less effective than pushing down selection.

13 The algorithm requires exponential computation! System-R style considers only left-deep joins Reducing plan space S R T U S R T U T U S R Left-deep trees allow us to generate all fully pipelined plans – Intermediate results not written to temporary files. – Not all left-deep trees are fully pipelined (e.g., SM join).

14 System R-style avoids the plans with Cartesian products – The size of a Cartesian product is generally larger than (natural) joins. Example: R(A,B), S(B,C), U(C,D) (R ⋈ U) ⋈ S has a Cartesian product pick (R ⋈ S) ⋈ U instead If cannot avoid Cartesian products, delay them. Reducing plan space

15 Relative accuracy – Goal is to compare plans, not to predict exact cost – More of an art than an exact science Each operator: input size, cost, output size – estimate cost based on input size Example: sort-merge join of R ⋈ S is 3 B(R) + 3 B(S) – estimate output size (for next operator) or selectivity selectivity: ratio of output to input Cost estimation

Cost estimation: Selinger Style Input: stats on each table – T(R): Number of tuples in R – B(R): Number of blocks in R B(R) = T(R ) / block size – V(R,A): Number of distinct values of attribute A in R Assumptions on attribute and predicate independence When no estimate available, use magic numbers. New alternative approach – Histogram of database 16

17 Selectivity factors: selection Point selection: S =  A=a (R) – T(S) ranges from 0 to T(R) – V(R,A) + 1 – consider its mean: F = 1 / V (R,A) Range selection: S =  A<a (R) – F = (max(A) – a) / (max(A) – min(A)) – not-athematic inequality: use magic number F = 1 / 3 Range selection: S =  b <A<a (R) – F = (a - b) / (max(A) – min(A)) – If not athematic, use magic number F = 1 / 4

18 Selectivity factors: selection Range selection: column in (set of values) – F: union of point selections

19 Selectivity factors: selection S =  A=1 AND B<10 (R) – multiply 1/V(R,A) for equality and 1/3 for inequality – T(R) = 10,000, V(R,A) = 50 – T(S) = / (50 * 3) = 66 S =  A=1 OR B<10 (R) – sum of estimates of predicates minus their product – T(R) = 10,000, V(R,A) = 50 – T(S) = – 66 = 3467

20 Containment of values assumption V(S,A) <= V (R,A): A values in S is a subset of A values in R Let’s assume V (S,A) <= V (R,A) – Each tuple t in S joins x tuple(s) in R – consider its mean: x = T(R) / V (R,A) – T(R ⋈ A S) = T (S) * T(R) / V(R,A) T(R ⋈ A S) = T(R) * T(S) / max(V(R,A), V(S,A)) Selectivity factors: join predicates

Search the plan space Baseline: exhaustive search – enumerate all combinations and compare their costs – enormous space! 21 T U S R S R T U S R T U Search method parameters – plan tree development construction: bottom-up, top-down modification: improve a somehow-connected tree – algorithms heuristic selections: make choices based on heuristics hill climbing: find “nearby” plans with lowest cost Dynamic programming: construction by greedy selection

Plan search: System-R style A.K.A: Selinger style optimization Bottom-up – start from the ground relation (in FROM) – work up the tree to form a plan – compute the cost of larger plans based on its sub-trees. Dynamic programming – greedily remove sub-trees that are costly (useless) 22

23 Step 1: For each {Ri}: – s ize({Ri}) = TCARD(Ri) – plan({Ri}) = Ri – cost({Ri}) = cost of access to Ri e.g. TCARD(Ri) if no index on Ri Step 2: For each {Ri, Rj}: – size({Ri,Rj}) = estimate of the size of join – plan({Ri,Rj}) = join algorithm – cost = cost function of size of Ri and Rj #I/O access of the chosen join algorithm – plan({Ri,Rj}): the join algorithm with smallest cost Dynamic programming

24 Step i: For each S ⊆ {R1, …, Rn} of cardinality i do: – Compute size(S) – for every S 1,S 2 s.t. S = S 1  S 2 c = cost(S 1 ) + cost(S 2 ) + cost(S 1 ⋈ S 2 ) – cost(S) = the smallest C – plan(S) = the plan for cost(S) Return Plan({R1, …, Rn}) Dynamic programming

25 Let’s assume that the cost of each join is the size of its intermediate results. – to simplify the example – other cost measures, #I/O access, are possible. cost(R) = 0 (no intermediate results) cost(R ⋈ S) = 0 (no intermediate results) cost( (R ⋈ S) ⋈ T) = cost(R ⋈ S) + cost(T) + size( R ⋈ S ) = size(R ⋈ S) Dynamic programming: example

26 Relations: R, S, T, U Number of tuples: 2000, 5000, 3000, 1000 We use a toy size estimation method: – size (A ⋈ B) = 0.01 * T(A) * T(B) Dynamic programming: example

27 QuerySizeCostPlan RS RT RU ST SU TU RST RSU RTU STU RSTU

28 QuerySizeCostPlan RS100k0RS RT60k0RT RU20k0UR ST150k0TS SU50k0US TU30k0UT RST RSU RTU STU RSTU

29 QuerySizeCostPlan RS100k0RS RT60k0RT RU20k0UR ST150k0TS SU50k0US TU30k0UT RST3M60kS(RT) RSU1M20kS(UR) RTU0.6M20kT(UR) STU1.5M30kS(UT) RSTU

30 QuerySizeCostPlan RS100k0RS RT60k0RT RU20k0UR ST150k0TS SU50k0US TU30k0UT RST3M60kS(RT) RSU1M20kS(UR) RTU0.6M20kT(UR) STU1.5M30kS(UT) RSTU30M110k(US)(RT)

Plan search: all operations Base relations access – find all plans for accessing each base relations – push down selections and projections – choose good plans, discard bad ones keep the cheapest plan for unordered and each interesting order Join ordering – use the bottom-up dynamic programming – consider only left-deep join trees: n! ordering for n tables – postpone Cartesian product Finally: grouping/ ordering – use interesting order – addition sorting 31

Nested subqueries Subqueries are optimized separately Correlation: order of evaluation – uncorrelated queries nested subqueries do not reference outer subqueries evaluate the most deeply nested subquery first – correlated queries: nested subqueries reference the outer subqueries Select name From employee X Where salary > (Select salary From employee Where employee_num = X.manager) 32

Nested subqueries – cont. correlated queries: nested subqueries reference the outer subqueries Select name From employee X Where salary > (Select salary From employee Where employee_num = X.manager) The nested subquery is evaluated once for each tuple in the outer query. If there are small number of distinct values in the outer relation, it is worth sorting the tuples. – reduces the #evaluation of the nested query. 33

Summary: optimization Plan space – Huge number of alternatives, semantically equivalent Why important – Difference between good/bad plabs could be order of magnitude Idea goal – map a declarative query to the most efficient plan Conventional wisdom: at least avoid bad plans 34

State of the art Academic: always a core database research topic – Optimizing for interactive querying – Optimizing for novel parallel frameworks Industry: most optimizers use System-R style – They started with rule-based. Oracle 7 and its prior versions used rule-based Oracle 7 – 10: rule based and cost based Oracle 10g (2003): cost-based 35

36 The importance of query optimization – difference between fast and slow plans Query optimization problem – find the fast plans efficiently. The components of a cost-based (system R style) query optimizer: – plan space definition – cost estimation – search algorithm What you should know