Lecture 27 Wednesday, December 5, 2001.

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II
Advertisements

1 Lecture 23: Query Execution Friday, March 4, 2005.
Completing the Physical-Query-Plan. Query compiler so far Parsed the query. Converted it to an initial logical query plan. Improved that logical query.
Lecture 24: Query Execution Monday, November 20, 2000.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinAkshay Shenoy Computer Science Dept San Jose State University.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CS411 Database Systems Kazuhiro Minami 12: Query Optimization.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
1 Lecture 25 Friday, November 30, Outline Query execution –Two pass algorithms based on indexes (6.7) Query optimization –From SQL to logical.
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
CS411 Database Systems Kazuhiro Minami 12: Query Optimization.
CS 4432estimation - lecture 161 CS4432: Database Systems II Lecture #16 Query Processing : Estimating Sizes of Results Professor Elke A. Rundensteiner.
Lecture 24 Query Execution Monday, November 28, 2005.
CS4432: Database Systems II Query Processing- Part 2.
1 Lecture 25: Query Optimization Wednesday, November 26, 2003.
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
CS 440 Database Management Systems Query Optimization 1.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
1 Lecture 15 Monday, May 20, 2002 Size Estimation, XML Processing.
1 Lecture 23: Query Execution Monday, November 26, 2001.
CSE 544: Lecture 14 Wednesday, 5/15/2002 Optimization, Size Estimation.
CS4432: Database Systems II Query Processing- Part 1 1.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinAkshay Shenoy Computer Science Dept San Jose State University.
CS 440 Database Management Systems
CS 540 Database Management Systems
CS 440 Database Management Systems
Query Processing Exercise Session 4.
Database Management System
Lecture 26: Query Optimizations and Cost Estimation
Lecture 27: Size/Cost Estimation
15.5 Two-Pass Algorithms Based on Hashing
Cse 344 April 25th – Disk i/o.
File Processing : Query Processing
Introduction to Database Systems CSE 444 Lecture 22: Query Optimization November 26-30, 2007.
Lecture 26: Query Optimization
Implementation of Relational Operations (Part 2)
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Query processing and optimization
CS143:Evaluation and Optimization
Query Execution Two-pass Algorithms based on Hashing
(Two-Pass Algorithms)
Query Optimization and Perspectives
Lecture 28 Friday, December 7, 2001.
Lecture 27: Optimizations
Chapter 12 Query Processing (1)
Lecture 24: Query Execution
Lecture 13: Query Execution
Lecture 23: Query Execution
Lecture 28: Size/Cost Estimation, Recovery
CPSC-608 Database Systems
Lecture 22: Query Execution
Lecture 22: Query Execution
Lecture 11: B+ Trees and Query Execution
CSE 544: Query Execution Wednesday, 5/12/2004.
CSE 544: Optimizations Wednesday, 5/10/2006.
Lecture 26 Monday, December 3, 2001.
CPSC-608 Database Systems
Lecture 26: Wednesday, December 4, 2002.
Lecture 22: Friday, November 22, 2002.
Lecture 24: Query Execution
Lecture 20: Query Execution
Lecture 24: Wednesday, November 27, 2002.
Presentation transcript:

Lecture 27 Wednesday, December 5, 2001

Outline Finish dynamic programming (7.6) Estimating the Size of Operations (7.4) Histograms (7.5.1) Completing the physical-query-plan selection (7.7)

Dynamic Programming To illustrate, we will make the following simplifications: Cost(P1 P2) = Cost(P1) + Cost(P2) + size(intermediate result(s)) Intermediate results: If P1 = a join, then the size of the intermediate result is size(P1), otherwise the size is 0 Similarly for P2 Cost of a scan = 0

Dynamic Programming Example: Cost(R5 R7) = 0 (no intermediate results) Cost((R2 R1) R7) = Cost(R2 R1) + Cost(R7) + size(R2 R1) = size(R2 R1)

Dynamic Programming Relations: R, S, T, U Number of tuples: 2000, 5000, 3000, 1000 Size estimation: T(A B) = 0.01*T(A)*T(B)

Subquery Size Cost Plan RS RT RU ST SU TU RST RSU RTU STU RSTU

Subquery Size Cost Plan RS 100k RT 60k RU 20k ST 150k SU 50k TU 30k RST 3M (RT)S RSU 1M (RU)S RTU 0.6M (RU)T STU 1.5M (TU)S RSTU 30M 60k+50k=110k (RT)(SU)

Dynamic Programming Summary: computes optimal plans for subqueries: Step 1: {R1}, {R2}, …, {Rn} Step 2: {R1, R2}, {R1, R3}, …, {Rn-1, Rn} … Step n: {R1, …, Rn} We used naïve size/cost estimations In practice: more realistic size/cost estimations (next) heuristics for Reducing the Search Space Restrict to left linear trees Restrict to trees “without cartesian product” need more than just one plan for each subquery: “interesting orders”

Estimating Sizes Need size in order to estimate cost Example: Cost of partitioned hash-join E1 E2 is 3B(E1) + 3B(E2) B(E1) = T(E1)/ block size B(E2) = T(E2)/ block size So, we need to estimate T(E1), T(E2)

Estimating Sizes Estimating the size of a projection Easy: T(PL(R)) = T(R) This is because a projection doesn’t eliminate duplicates

Estimating Sizes Estimating the size of a selection S = sA=c(R) T(S) san be anything from 0 to T(R) – V(R,A) + 1 Mean value: T(S) = T(R)/V(R,A) S = sA<c(R) T(S) can be anything from 0 to T(R) Heuristics: T(S) = T(R)/3

Estimating Sizes Estimating the size of a natural join, R S When the set of A values are disjoint, then T(R S) = 0 When A is a key in S and a foreign key in R, then T(R S) = T(R) When A has a unique value, the same in R and S, then T(R S) = T(R) T(S) A A A A

Estimating Sizes Assumptions: Containment of values: if V(R,A) <= V(S,A), then the set of A values of R is included in the set of A values of S Note: this indeed holds when A is a foreign key in R, and a key in S Preservation of values: for any other attribute B, V(R S, B) = V(R, B) (or V(S, B)) A

Estimating Sizes Assume V(R,A) <= V(S,A) Then each tuple t in R joins some tuple(s) in S How many ? On average S/V(S,A) t will contribute S/V(S,A) tuples in R S Hence T(R S) = T(R) T(S) / V(S,A) In general: T(R S) = T(R) T(S) / max(V(R,A),V(S,A)) A A A

Estimating Sizes Example: T(R) = 10000, T(S) = 20000 V(R,A) = 100, V(S,A) = 200 How large is R S ? Answer: T(R S) = 10000 20000/200 = 1M A A

Estimating Sizes Joins on more than one attribute: T(R S) = T(R) T(S)/max(V(R,A),V(S,A))max(V(R,B),V(S,B)) A,B

Histograms Statistics on data maintained by the RDBMS Makes size estimation much more accurate (hence, cost estimations are more accurate)

Histograms Employee(ssn, name, salary, phone) Maintain a histogram on salary: T(Employee) = 25000, but now we know the distribution Salary: 0..20k 20k..40k 40k..60k 60k..80k 80k..100k > 100k Tuples 200 800 5000 12000 6500 500

Histograms Ranks(rankName, salary) Estimate the size of Employee Ranks 20k..40k 40k..60k 60k..80k 80k..100k > 100k 200 800 5000 12000 6500 500 Ranks 0..20k 20k..40k 40k..60k 60k..80k 80k..100k > 100k 8 20 40 80 100 2

Histograms Assume: V(Employee, Salary) = 200 V(Ranks, Salary) = 250 Then T(Employee Ranks) = = Si=1,6 Ti Ti’ / 250 = (200x8 + 800x20 + 5000x40 + 12000x80 + 6500x100 + 500x2)/250 = …. Salary

Completing the Physical Query Plan Choose algorithm to implement each operator Need to account for more than cost: How much memory do we have ? Are the input operand(s) sorted ? Decide for each intermediate result: To materialize To pipeline

Example 7.38 Logical plan is: Main memory M = 101 buffers k blocks U(y,z) 10,000 blocks R(w,x) 5,000 blocks S(x,y) 10,000 blocks

Example 7.38 Naïve evaluation: 2 partitioned hash-joins Cost 3B(R) + 3B(S) + 4k + 3B(U) = 75000 + 4k R(w,x) 5,000 blocks S(x,y) 10,000 blocks U(y,z) k blocks

Example 7.38 R(w,x) 5,000 blocks S(x,y) 10,000 blocks U(y,z) k blocks Smarter: Step 1: hash R on x into 100 buckets, each of 50 blocks; to disk Step 2: hash S on x into 100 buckets; to disk Step 3: read each Ri in memory (50 buffer) join with Si (1 buffer); hash result on y into 50 buckets (50 buffers) -- here we pipeline Cost so far: 3B(R) + 3B(S)

Example 7.38 R(w,x) 5,000 blocks S(x,y) 10,000 blocks U(y,z) k blocks Continuing: How large are the 50 buckets on y ? Answer: k/50. If k <= 50 then keep all 50 buckets in Step 3 in memory, then: Step 4: read U from disk, hash on y and join with memory Total cost: 3B(R) + 3B(S) + B(U) = 55,000

Example 7.38 R(w,x) 5,000 blocks S(x,y) 10,000 blocks U(y,z) k blocks Continuing: If 50 < k <= 5000 then send the 50 buckets in Step 3 to disk Each bucket has size k/50 <= 100 Step 4: partition U into 50 buckets Step 5: read each partition and join in memory Total cost: 3B(R) + 3B(S) + 2k + 3B(U) = 75,000 + 2k

Example 7.38 Continuing: If k > 5000 then materialize instead of pipeline 2 partitioned hash-joins Cost 3B(R) + 3B(S) + 4k + 3B(U) = 75000 + 4k R(w,x) 5,000 blocks S(x,y) 10,000 blocks U(y,z) k blocks

Example 7.38 Summary: If k <= 50, cost = 55,000 If 50 < k <=5000, cost = 75,000 + 2k If k > 5000, cost = 75,000 + 4k