Lecture 26: Query Optimizations and Cost Estimation

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II
Advertisements

1 Lecture 23: Query Execution Friday, March 4, 2005.
Completing the Physical-Query-Plan. Query compiler so far Parsed the query. Converted it to an initial logical query plan. Improved that logical query.
Algebraic Laws For the binary operators, we push the selection only if all attributes in the condition C are in R.
Lecture 24: Query Execution Monday, November 20, 2000.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
CS411 Database Systems Kazuhiro Minami 12: Query Optimization.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
CS4432: Database Systems II Query Processing- Part 3 1.
CS 4432estimation - lecture 161 CS4432: Database Systems II Lecture #16 Query Processing : Estimating Sizes of Results Professor Elke A. Rundensteiner.
CS4432: Database Systems II Query Processing- Part 2.
1 Lecture 25: Query Optimization Wednesday, November 26, 2003.
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
CS 440 Database Management Systems Query Optimization 1.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
1 Lecture 15 Monday, May 20, 2002 Size Estimation, XML Processing.
1 Lecture 23: Query Execution Monday, November 26, 2001.
CS4432: Database Systems II Query Processing- Part 1 1.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinAkshay Shenoy Computer Science Dept San Jose State University.
CHAPTER 19 Query Optimization. CHAPTER 19 Query Optimization.
CS 440 Database Management Systems
CS 540 Database Management Systems
CS 440 Database Management Systems
Query Processing Exercise Session 4.
Database Management System
Chapter 13: Query Optimization
Lecture 27: Size/Cost Estimation
Overview of Query Optimization
15.5 Two-Pass Algorithms Based on Hashing
Database Management Systems (CS 564)
File Processing : Query Processing
Introduction to Database Systems CSE 444 Lecture 22: Query Optimization November 26-30, 2007.
Lecture 26: Query Optimization
Implementation of Relational Operations (Part 2)
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Query processing and optimization
CS143:Evaluation and Optimization
Query Execution Two-pass Algorithms based on Hashing
(Two-Pass Algorithms)
Query Optimization and Perspectives
Lecture 28 Friday, December 7, 2001.
Database Management Systems (CS 564)
Lecture 27: Optimizations
Relational Algebra Friday, 11/14/2003.
Chapter 12 Query Processing (1)
Lecture 24: Query Execution
Lecture 23: Query Execution
Lecture 28: Size/Cost Estimation, Recovery
Lecture 22: Query Execution
Lecture 22: Query Execution
Monday, 5/13/2002 Hash table indexes, query optimization
Lecture 11: B+ Trees and Query Execution
CSE 544: Optimizations Wednesday, 5/10/2006.
CPSC-608 Database Systems
Lecture 26: Wednesday, December 4, 2002.
Lecture 27 Wednesday, December 5, 2001.
Lecture 22: Friday, November 22, 2002.
Lecture 24: Query Execution
Lecture 29: Final Review Wednesday, December 11, 2002.
Lecture 20: Query Execution
Presentation transcript:

Lecture 26: Query Optimizations and Cost Estimation Monday, December 1st, 2003

Outline Cost-based Query Optimziation Completing the physical query plan: 16.7 Cost estimation: 16.4

Reducing the Search Space Left-linear trees v.s. Bushy trees Trees without cartesian product Example: R(A,B) ⋈ S(B,C) ⋈ T(C,D) Plan: (R(A,B) ⋈ T(C,D)) ⋈ S(B,C) has a cartesian product – most query optimizers will not consider it

Analyzing Dynamic Programming Some math we need to know: Given x1, x2, ..., xn, how many permutations are there ? Answer: n! Example: permutations of x1x2x3x4 are: x1x2x3x4 x2x4x3x1 .. . . There are 4! = 4*3*2*1 = 24 permutations)

Analyzing Dynamic Programming More math we need to know: Given a product x0x1x2...xn, in how many ways can we place parenthesis ? Answer: 1/(n+1)*C2nn = (2n)!/((n+1)*(n!)2) Example: for n=3 ((x0x1)(x2x3)) ((x0(x1x2))x3) (x0((x1x2)x3)) ((x0x1)x2)x3) (x0(x1(x2x3))) There are 6!/(4*3!*3!) = 5 ways

Analyzing Dynamic Programming Query: R0(A0,A1) ⋈ R1(A1,A2) ⋈ . . . ⋈ Rn(An,An+1) Number of bushy trees: (n+1)!  1/(n+1)*C2nn = (2n)!/(n!) Number of steps taken by dynamic programming: nk=1 (2k-2) Cnk = 3n - 2n+1 + 1

Analyzing Dynamic Programming R0(A0,A1) ⋈ R1(A1,A2) ⋈ . . . ⋈ Rn(An,An+1) Analyzing Dynamic Programming Number of bushy trees w/o cartesian products: 2n  1/(n+1)*C2nn = 2n  (2n)!/((n+1)*(n!)2) (the 2n is due to the fact that we may commute each join) Number of steps taken by dynamic programming: nk=1 (k-1)(n-k+1) = O(n3)

Analyzing Dynamic Programming R0(A0,A1) ⋈ R1(A1,A2) ⋈ . . . ⋈ Rn(An,An+1) Analyzing Dynamic Programming The number of left linear join trees is: (n+1) ! Number of steps taken by dynamic programming: nk=1 k Cnk = n 2n-1

Analyzing Dynamic Programming R0(A0,A1) ⋈ R1(A1,A2) ⋈ . . . ⋈ Rn(An,An+1) Analyzing Dynamic Programming Number of left linear trees w/o cartesian products is: 2n Example R0 ⋈ R1 ⋈ R2 ⋈ R3 ⋈ R4 Plans: (((R2 ⋈ R1) ⋈ R3) ⋈ R4) ⋈ R0 (((R2 ⋈ R3) ⋈ R1) ⋈ R0) ⋈ R4 . . . . . (hence: it is 2n) Number of steps taken by dynamic programming: nk=1 2 (n-k+1) = n(n+1) = O(n2)

Dynamic Programming: Summary Handles only join queries: Selections are pushed down (i.e. early) Projections are pulled up (i.e. late) Takes exponential time in general, BUT: Left linear joins may reduce time Non-cartesian products may reduce time further! Cartesian product joins: in data warehouses ! Today’s optimizers are rule based: SQL server and DB2 Handle arbitrary queries

Completing the Physical Query Plan Choose algorithm to implement each operator Need to account for more than cost: How much memory do we have ? Are the input operand(s) sorted ? Decide for each intermediate result: To materialize To pipeline

Materialize Intermediate Results Between Operators HashTable  S repeat read(R, x) y  join(HashTable, x) write(V1, y) HashTable  T repeat read(V1, y) z  join(HashTable, y) write(V2, z) HashTable  U repeat read(V2, z) u  join(HashTable, z) write(Answer, u) ⋈ V2 ⋈ U V1 ⋈ T R S

Materialize Intermediate Results Between Operators Question in class Given B(R), B(S), B(T), B(U) What is the total cost of the plan ? Cost = How much main memory do we need ? M = Cost = B(R)+B(S)+B(T)+B(U)+2B(R ⋈ S) + 2B(R ⋈ S ⋈ T) M = max(B(S), B(T), B(U))

Pipeline Between Operators ⋈ HashTable1  S HashTable2  T HashTable3  U repeat read(R, x) y  join(HashTable1, x) z  join(HashTable2, y) u  join(HashTable3, z) write(Answer, u) ⋈ U ⋈ T R S

Pipeline Between Operators Question in class Given B(R), B(S), B(T), B(U) What is the total cost of the plan ? Cost = How much main memory do we need ? M = Cost = B(R)+B(S)+B(T)+B(U) M = B(S) + B(T) + B(U)

Pipeline in Bushy Trees ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ V Z ⋈ R S X T I Y

Example 16.36 Logical plan is: Main memory M = 101 buffers k blocks U(y,z) 10,000 blocks R(w,x) 5,000 blocks S(x,y) 10,000 blocks

Example 16.36 Naïve evaluation: 2 partitioned hash-joins Cost 3B(R) + 3B(S) + 4k + 3B(U) = 75000 + 4k M = 101 R(w,x) 5,000 blocks S(x,y) 10,000 blocks U(y,z) k blocks

Example 16.36 M = 101 R(w,x) 5,000 blocks S(x,y) 10,000 blocks U(y,z) k blocks Smarter: Step 1: hash R on x into 100 buckets, each of 50 blocks; to disk Step 2: hash S on x into 100 buckets; to disk Step 3: read each Ri in memory (50 buffer) join with Si (1 buffer); hash result on y into 50 buckets (50 buffers) -- here we pipeline Cost so far: 3B(R) + 3B(S)

Example 16.36 M = 101 Continuing: R(w,x) 5,000 blocks S(x,y) 10,000 blocks U(y,z) k blocks Continuing: How large are the 50 buckets on y ? Answer: k/50. If k <= 50 then keep all 50 buckets in Step 3 in memory, then: Step 4: read U from disk, hash on y and join with memory Total cost: 3B(R) + 3B(S) + B(U) = 55,000

Example 16.36 M = 101 R(w,x) 5,000 blocks S(x,y) 10,000 blocks U(y,z) k blocks Continuing: If 50 < k <= 5000 then send the 50 buckets in Step 3 to disk Each bucket has size k/50 <= 100 Step 4: partition U into 50 buckets Step 5: read each partition and join in memory Total cost: 3B(R) + 3B(S) + 2k + 3B(U) = 75,000 + 2k

Example 16.36 Continuing: If k > 5000 then materialize instead of pipeline 2 partitioned hash-joins Cost 3B(R) + 3B(S) + 4k + 3B(U) = 75000 + 4k M = 101 R(w,x) 5,000 blocks S(x,y) 10,000 blocks U(y,z) k blocks

Example 16.36 Summary: If k <= 50, cost = 55,000 If 50 < k <=5000, cost = 75,000 + 2k If k > 5000, cost = 75,000 + 4k

Size Estimation Need T(E) and V(E,a) in order to estimate cost and choose a good plan This is hard without computing E Will ‘estimate’ them instead Results in large errors, when E is complex However, these errors are uniform, and the optimizer will usually prefer a good plan over a poor one

Size Estimation Estimating the size of a projection Easy: T(PL(R)) = T(R) This is because a projection doesn’t eliminate duplicates

Size Estimation Estimating the size of a selection S = sA=c(R) T(S) san be anything from 0 to T(R) – V(R,A) + 1 Mean value: T(S) = T(R)/V(R,A) S = sA<c(R) T(S) can be anything from 0 to T(R) Heuristics: T(S) = T(R)/3

Size Estimation Estimating the size of a natural join, R ⋈A S When the set of A values are disjoint, then T(R ⋈A S) = 0 When A is a key in S and a foreign key in R, then T(R ⋈A S) = T(R) When A has a unique value, the same in R and S, then T(R ⋈A S) = T(R) T(S)

Size Estimation Assumptions: Containment of values: if V(R,A) <= V(S,A), then the set of A values of R is included in the set of A values of S Note: this indeed holds when A is a foreign key in R, and a key in S Preservation of values: for any other attribute B, V(R ⋈A S, B) = V(R, B) (or V(S, B))

Size Estimation Assume V(R,A) <= V(S,A) Then each tuple t in R joins some tuple(s) in S How many ? On average T(S)/V(S,A) t will contribute T(S)/V(S,A) tuples in R ⋈A S Hence T(R ⋈A S) = T(R) T(S) / V(S,A) In general: T(R ⋈A S) = T(R) T(S) / max(V(R,A),V(S,A))

Size Estimation Example: T(R) = 10000, T(S) = 20000 V(R,A) = 100, V(S,A) = 200 How large is R ⋈A S ? Answer: T(R ⋈A S) = 10000 20000/200 = 1M

Size Estimation Joins on more than one attribute: T(R ⋈A,B S) = T(R) T(S)/(max(V(R,A),V(S,A))*max(V(R,B),V(S,B)))

Histograms Statistics on data maintained by the RDBMS Makes size estimation much more accurate (hence, cost estimations are more accurate)

Histograms Employee(ssn, name, salary, phone) Maintain a histogram on salary: T(Employee) = 25000, but now we know the distribution Salary: 0..20k 20k..40k 40k..60k 60k..80k 80k..100k > 100k Tuples 200 800 5000 12000 6500 500

Histograms Ranks(rankName, salary) Estimate the size of Employee ⋈Salary Ranks Employee 0..20k 20k..40k 40k..60k 60k..80k 80k..100k > 100k 200 800 5000 12000 6500 500 Ranks 0..20k 20k..40k 40k..60k 60k..80k 80k..100k > 100k 8 20 40 80 100 2

Histograms Assume: Main property: V(Employee, Salary) = 200 V(Ranks, Salary) = 250 Main property: Employee ⋈Salary Ranks=Employee1 ⋈Salary Ranks1’  ...  Emplyee6 ⋈Salary Ranks6’ A tuple t in Employee1 joins with how many tuples in Ranks1’ ? Answer: with T(Employee1)/T(Employee) * T(Employee)/250 = T1 / 250 Then T(Employee ⋈Salary Ranks) = = Si=1,6 Ti Ti’ / 250 = (200x8 + 800x20 + 5000x40 + 12000x80 + 6500x100 + 500x2)/250 = ….