1 Query Optimization Vishy Poosala Bell Labs. 2 Outline Introduction Necessary Details –Cost Estimation –Result Size Estimation Standard approach for.

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II
Advertisements

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Introduction to Histograms Presented By: Laukik Chitnis
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapters 14.
1 Overview of Query Evaluation Chapter Objectives  Preliminaries:  Core query processing techniques  Catalog  Access paths to data  Index matching.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Relational Query Optimization Module 5, Lecture 2.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Cs44321 CS4432: Database Systems II Query Optimizer – Cost Based Optimization.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
Quick Review of Apr 17 material Multiple-Key Access –There are good and bad ways to run queries on multiple single keys Indices on Multiple Attributes.
1 Query Optimization. 2 Why Optimize? Given a query of size n and a database of size m, how big can the output of applying the query to the database be?
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.
Database System Concepts 5 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 14: Query Optimization.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Physical Organization Index Structures Query Processing Vishy Poosala.
Prof. Ghandeharizadeh1 QUERY PROCESSING (CHAPTER 12)
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
Query Processing & Optimization
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
1 Optimization. 2 Why Optimize? Given a query of size n and a database of size m, how big can the output of applying the query to the database be? Example:
Access Path Selection in a Relation Database Management System (summarized in section 2)
Overview of Implementing Relational Operators and Query Evaluation
©Silberschatz, Korth and Sudarshan1.1Database System Concepts - 6 th Edition Chapter 13: Query Optimization.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
Advanced Databases: Lecture 8 Query Optimization (III) 1 Query Optimization Advanced Databases By Dr. Akhtar Ali.
Database Management 9. course. Execution of queries.
CMSC424: Database Design Instructor: Amol Deshpande
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
1 Relational Query Optimization Chapter Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks :  An SQL.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
CS4432: Database Systems II Query Processing- Part 3 1.
CPS216: Advanced Database Systems Notes 09:Query Optimization (Cost-based optimization) Shivnath Babu.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
CS4432: Database Systems II Query Processing- Part 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 15 – Query Optimization.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
1 Choosing an Order for Joins. 2 What is the best way to join n relations? SELECT … FROM A, B, C, D WHERE A.x = B.y AND C.z = D.z Hash-Join Sort-JoinIndex-Join.
Cost Estimation For each plan considered, must estimate cost: –Must estimate cost of each operation in plan tree. Depends on input cardinalities. –Must.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
1 Lecture 23: Query Execution Monday, November 26, 2001.
CS4432: Database Systems II Query Processing- Part 1 1.
Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,
Database Management System
Query Optimization Kush Kashyap B.Tech -IT.
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Overview of Query Optimization
Chapter 15 QUERY EXECUTION.
Database Management Systems (CS 564)
File Processing : Query Processing
File Processing : Query Processing
Yan Huang - CSCI5330 Database Implementation – Access Methods
Chapters 15 and 16b: Query Optimization
Lecture 23: Query Execution
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Lecture 11: B+ Trees and Query Execution
Lecture 20: Query Execution
Presentation transcript:

1 Query Optimization Vishy Poosala Bell Labs

2 Outline Introduction Necessary Details –Cost Estimation –Result Size Estimation Standard approach for query optimization Other ways Related Concepts

3 Given a query Q, there are several ways (access plans, plans, strategies) to execute Q and find the answer –select z from R,S where R.x=10 and R.y = S.y –selection before join or after or during? What indices to use? –represented as a tree ⋈   R S ⋈ RS

4 Query optimization is the process of identifying the access plan with the minimum cost –Cost = Time taken to get all the answers Starting with System-R, most DBMSs use the same algorithm –generate most of the access plans and select the cheapest one First, how do we determine the cost of a plan? Then, how long is this process going to take and how do we make it faster?

5 Query execution cost is usually a weighted sum of the I/O cost (# disk accesses) and CPU cost (msec) –w * IO_COST + CPU_COST Basic Idea: –Cost of an operator depends on input data size, data distribution, physical layout –The optimizer uses statistics about the relations to estimate the cost –Need statistics on base relations and intermediate results ⋈  R S

6 Key parameters –p(R): Relation R’s size in pages –t(R): R’s size in tuples –v(A,R): number of unique values in attribute A –min(A,R), max(A,R): smallest/largest values in attribute A –s(f): A relational operator f’s selectivity factor = ratio of result size over the size of the cross product of the input relations s(R ⋈ S) = t(R ⋈ S)/(t(R)t(S)) used to estimate t(R ⋈ S)

7 Without Indices Assume that data is uniformly distributed Scan –Cost = p(R) Selection –Cost = p(R) –Result Size Equality Selection: t(R) / v(A,R) A > c: t(R) * (max(A,R) - c) / (max(A,R) - min(A,R) Projection –Cost = p(R) –Size = v(A,R)

8 Join (R ⋈ S) Tuple Nested Loops (R is outer) –p(R) + t(R) * p(S) Page Nested Loops –p(R) + p(R) * p(S) Merge Scan –sort(R) + sort(S) + p(R) + p(S) –sort(R) = 2p(R)* (log (p(R)/w) + 1), w = # buffer pages Size: t(R) * t(S) / v(A,R) (assuming same number of identical unique values in both relations)

9 With Indices Cost = Cost(index) + Cost(data) Parameters for index I –p(I) = size in pages –d(I) = depth of the tree –lp(I) = number of leaf pages –b(I) = # Buckets in hash index

10 Selection –B-tree(primary) = d(I) + p(R) * s –B-tree(secondary) = d(I) + lp(I) * s + s * t(R)/v(A,R) –Hash(primary) = p(R)/b(I) for equality; p(R) for range –Hash(secondary) = p(I)/b(I) + t(R)/v(A,R); p(R) for range Nested loops (index on inner relation S) –B-tree(primary) = p(R) + (d(I) + p(S)/v(A,S)) * t(R) –B-tree(secondary) = p(R) + (d(I) + lp(I)/v(A,S) + t(S)/v(A,S)) * t(R) –Hash (primary) = p(R) + t(R) * p(S) / b(I) –Hash (secondary) = p(R) + t(R) * (p(I)/b(I) + t(R)/v(A,R) Merge Join –B-tree(primary) on both = p(R) + p(S) –B-tree(secondary) on R = p(S) + d(I_R) + lp(I_R) + t(R) –Hash: same as no index (sort + store + merge)

11 select z from R,S where R.x=10 and R.y = S.y –selection before join or after or during? Nested loop join with S as inner P1: join into temp then select –p(R) + t(R)*p(S) + 2p(R)p(S)/v(y,R) P2: select into temp then join –p(R) + 2p(R)/v(x,R) + t(R)p(S)/v(x,R) P3: select while join –p(R) + t(R)p(S)/v(x,R)

12 Lot more options when multiple joins are present –Join is associative: (R ⋈ S) ⋈ T = R ⋈ (S ⋈ T) –In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join

13 Consider finding the best join-order for r 1 r 2... r n. There are (2(n – 1))!/(n – 1)! different join orders for above expression. With n = 7, the number is , with n = 10, the number is greater than 176 billion! No need to generate all the join orders. Using dynamic programming, the least-cost join order for any subset of {r 1, r 2,... r n } is computed only once and stored for future use. –O(n*2^n) (left-deep trees)

14 A solution consists of an ordered list of the relations to be joined, the join method for each join, and an access plan for the relations (indices) If the tuples are produced in sorted order on A, the order is interesting if –A participates in another join –or A is an ordered attribute in the SQL query Idea –Find the best plan for each subset of the relations for each interesting order Join Heuristics –avoid Cartesian products –use left-deep trees System-R, Selinger-style, Dynamic Programming (or THE query optimization technique)

15 Find the best way to scan each relation for each interesting order and for the unordered case Find the best way to join each possible pair of relations based on previous step and join heuristics.. Joins of triples.... Find best way to join N relations. This is the result At any point, sub-optimal solutions are removed –expensive out of two same interesting order plans

16 EMP(name, sal, dept); DEPT(dpt, floor, mgr) select name, mgr from emp, dept where sal >= 30k and floor = 2 and dept.dpt = emp.dpt EMP has B-tree on sal and on dpt; DEPT has hashing on floor

17 Other Ways Heuristic-based, Rule-based –Perform most restrictive selection early –Perform all selections before joins –Pick the most “promising” relation to join next (Oracle) –Not guaranteed to give optimal or good solutions Randomized Algorithms –Iterative Improvement –Simulated Annealing –Not guaranteed, but known to work well

18 Statistics for Size Estimation Uniform distribution assumption is often invalid –data tends to be skewed Leads to errors in estimation, sub-optimal plans Use more realistic statistics –Samples –Histograms: uniform assumption over subsets of data what subsets do we use? Equi-depth, equi-width, v-optimal,.. –Polynomials, Wavelets,... Have to maintain the statistics as data changes

19 Related Areas Approximate Query Answering: Answer queries using data summaries, approximately but quickly –E.g., compute avg(salary) within 5% error with 99% confidence in 1/10 th of the time Optimize for fastest first answers –Interactive applications Parametric Optimization –Keep multiple access plans for different runtime parameters Optimize for other domains –Image databases, streaming sources, distributed databases Materialized Views –Store partial results –Maintenance