Download presentation
Presentation is loading. Please wait.
Published byBrianne White Modified over 9 years ago
1
Chapter 6 Query Execution
2
Query Query Compilation (Chapter 7 ) query plan Query execution metadata ( Chapter 6 ) data the major parts Of the query processor
3
Query Compilation 1.Parse Query parse tree 2.Rewrite Query logical query plan 3.Implement Query physical query plan 2+3 => Query Optimization
4
Metadata for Query Optimization The size of each relation Statistics such as the approximate number and frequency of different values for an attribute. The existence of certain indexes The layout of data on disk.
5
Overview of Query Optimization
6
parse convert apply laws estimate result sizes consider physical plans estimate costs pick best execute {P1,P2,…..} {(P1,C1),(P2,C2)...} Pi answer SQL query parse tree logical query plan “improved” l.q.p l.q.p. +sizes statistics
7
Example: SQL query SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘ %1960 ’ ); (Find the movies with stars born in 1960)
8
Example: Parse Tree SELECT FROM WHERE IN title StarsIn ( ) starName SELECT FROM WHERE LIKE name MovieStar birthDate ‘%1960’
9
Example: Generating Logical Query Plan title StarsIn IN name birthdate LIKE ‘%1960’ starName MovieStar
10
Example: Logical Query Plan title starName=name StarsIn name birthdate LIKE ‘%1960’ MovieStar
11
Example: Improved Logical Query Plan title starName=name StarsIn name birthdate LIKE ‘%1960’ MovieStar Question: Push project to StarsIn?
12
Example: Estimate Result Sizes Need expected size StarsIn MovieStar
13
Physical Query Plan Selecting algorithms to implement each of the operators of the logical query plan. Selecting an order of execution for these operators.
14
Example: One Physical Plan Parameters: join order, memory size, project attributes,... Hash join SEQ scanindex scan Parameters: Select Condition,... StarsInMovieStar
15
Example: Estimate costs L.Q.P P1 P2 …. Pn C1 C2 …. Cn Pick best!
16
Query Execution To execute the operations of physical query plan Major approaches: scanning, hashing, sorting, and indexing Different costs and structures due to the amount of available main memory
17
Textbook outline Chapter 6 6.1Algebra for queries [bags vs sets] - Select, project, join, ….[project list a,a+b->x, … ] - Duplicate elimination, grouping, sorting 6.2Physical operators - Scan,sort, … 6.3-6.10 Implementing operators + estimating their cost
18
An Algebra for Queries Relational Algebra SQL Operators Union ∪ UNION Intersection ∩ INTERSECTION Difference - EXCEPT Selection σ WHERE Projection Π SELECT Product × FROM Join ∞ JOIN Duplicate elimination δ DISTINCT Grouping γ GROUP BY Sorting τ ORDER BY
19
Differences Between SQL Relations and Set Relations are bags Relations have schemas
20
Union, Intersection, and Difference 1.For R U S, a tuple t is in the result as many times as the number of times it is in R plus the number of times it is in S ; 2.For R ∩ S, a tuple t is in the result the minimum of the number times it is in R and S ; 3.For R – S, a tuple t is in the result the number of times it is in R minus it is in S, but not fewer than zero times 。 For example : Suppose R={A, B, B} S={C, A, B, C} R U S={A, A, B, B, B,C, C} R ∩ S={A, B} R – S={B}
21
The Selector Operator Selection σ c (R), produces the bag of those tuples in R that satisfy the condition C 。 Condition C may involve : 1. Arithmetic or string operators on constants and/or attributes ; 2. Comparison between terms constructed in 1 ; 3. Boolean connectives AND, OR, and Not applied to terms constructed in 2. For example : R a b σ a≥1 (R): a b σ b≥3 AND a+b≥6 (R) 0 1 2 3 a b 2 3 4 5 4 5 4 5 2 3 2 3
22
The Projection Operator ΠL(R) is the projection of R onto the list L 。 L can have the following kinds of elements : 1. A single attribute of R ; 2. An expression x y, where x and y are names for attributes 。 In the result, the attribute x of R is renamed with y; 3. An expression E z, where E is an expression, z is a new name for the attribute that results from calculation implied by E 。 For example : R a b c Π a,b+c→x (R) a x 0 1 2 0 3 3 4 5 3 9
23
The Product of Relations The product RXS is a relation whose schema consists of attributes of R and the attributes of S. For example : R: a b S: b c 0 1 1 4 2 3 1 4 2 3 2 5 aR.b S.b c 000222222000222222 1 3 1 2 1 2 1 2 4 5 4 5 4 5 R ×S:
24
Joins Natural Join: R ∞ S = ΠL( σ c(RXS) ), where : 1.C is a condition that equates all pairs of attributes of R and S that have the same name. 2.L is a list of all the attributes of R and S, except that one copy of each pair of equated attributes is omitted. For example : R: a b S: b c 0 1 ∞ 1 4 2 3 1 4 2 3 2 5 a b c 0 1 4
25
The theta-join: R ∞ S = σ c(RXS), c If C is x = y, x from R and y from S, we call it equijoin For example : R: a b S: b c 0 1 1 4 2 3 1 4 2 3 2 5 ∞ a+R.b<c+S.b a R.b S.b c 0 1 1 4 0 1 2 5 2 3 2 5 For example : R: a b S: b c 0 1 1 4 2 3 1 4 2 3 2 5 ∞ b=b a R.b S.b c 0 1 1 4
26
Duplicate Elimination For example : R: a b 0 1 2 3 For relation R, δ ( R) returns the set consisting of one copy of every tuple that appears one or more times in relation R. δ ( R ): a b 0 1 2 3
27
Grouping and Aggregation Grouping: According to Attribute (s) Having: Grouping Condition Aggregation operators: AVG, SUM, COUNT,MIN and MAX
28
For StarsIn (title, year, starName), to find, for each star who has appeared in at least three movies, the earliest year in which they appeared. SQL: SELECT starNme, MIN(year) AS minYear FROM StarsIn GROUP BY starName HAVING COUNT(title)>=3; Π starName,minYear σ ctTitle>=3 γ starName, MIN(year) → minYear, COUNT(title) → ctTitle StarsIn
29
The Sorting Operator τ L (R) denotes relation R is sorted in the order indicated by L 。 For example, for R(a, b, c), τ c,b (R) orders the tuples of R by their value of c, and tuples with the same c-value are ordered by their b value. Tuples that agree on both b and c may be ordered arbitrarily.
30
Logical Query Plan For example, there are two tables : MovieStar(name,addr,gender,birthdate) StarsIn(title,year,starName) Convert SQL statements into logical query plan : SELECT title,birthdate FROM MovieStar,StarIn WHERE year = 1996 AND gender=‘F’ AND starName=name; Π title,birthdate Πtitle,birthdate ∞ σ year=1996 AND gender=‘F’AND starName=name starName=name × σ gender= ‘ F ’ σ year=1996 MovieStar StarIn MovieStar StarsIn
31
Physical-Query-Plan Scanning a table Executing algebraic operators Passing data among operators
32
Scanning Tables Table–scan Index–scan
33
Sorting While Scanning Tables (1) scanning index (2) main-memory sorting algorithm (3) multiway merge sort
34
The Model of Computation for Physical Operators Use the number of disk I/O as the measure of cost for an operation Assume the arguments of any operator on disk and the result left in memory Pipeline the result of one operator to another operator
35
Parameters for Measuring Costs M: the number of main-memory buffers available to an execution of a particular operator B(R): the number of blocks that are needed to hold all the tuples of R T(R): the number of tuples in R V(R,a): the number of distinct values of the column for a in R
36
I/O Cost for Sort-Scan Operator Fits in main memory Does not fit in main memory ClusteredB3B Not ClusteredTT+2B
37
Iterators for Implementation of Physical Operators Open (R) { b:= the first block of R; t:= the first tuple of block b; Found:= TRUE; } GetNext(R) { IF ( t is past the last tuple on block b ) { increment b to the next block; IF ( there is no next block) { FOUND:=FALSE;RETURN;} ELSE t:= first tuple on block b; oldt:= t; increment t to the next tuple of b; RETURN oldt } Close(R) { }
38
Building a union iterator from its components Open (R,S) { R.Open(); CurRel:=R; } GetNext (R,S) { IF (CurRel=R) { t:=R.GetNext(); IF (FOUND) RETURN t; ELSE S.Open(); CurRel:=S; } } RETURN S.GetNext() } Close(R,S){ R.Close() ; S.Close(); }
39
Implementation Algorithms for Algebraic Operators 1.Sorting-based methods 2.Hash-based methods 3.Index-based methods Three degrees of difficulty and cost: one-pass, two-passes, three or more passes
40
One-Pass Algorithms for Database Operators 1.Tuple-at-a-time, unary operations 2.Full-relation, unary operations 3.Full-relation, binary operations
41
R Input bufferOutput buffer Unary operate COST : If R is clustered , disk I/O is B; If R is not clustered , disk I/O is T. One-Pass Algorithms for Tuple-at-a-Time Operations
42
One-Pass Algorithms for Unary, Full-Relation Operations Duplication Eliminationδ(R) Read a block from disk , for each tuple to make a decision : 1. If it has not been seen before, copy it to the output ; 2. Otherwise, ignore it 。 In order to support this decision , keep one copy in memory 。 B(δ(R)) ≤ M-1 → B(δ(R)) ≤ M R Input buffer Seen before? M-1 buffers Output buffer Main Memory: Hash Table or Binary Search Tree
43
One-Pass Algorithm for Binary Operations Set Union Set Intersection Set Difference Bag Intersection Bag Difference Product Natural Join S R ∩ Output B(S) <= M-1
44
Nested-Loop Joins Tuple-Based Nested-Loop Join Block-Based Nested-Loop Join They can be used for relations of any size.
45
Tuple-Based Nested-Loop Join R(X,Y)∞S(Y,Z) FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to make a tuple t THEN output t ; Cost may be T(R)T(S) Disk I/O.
46
An Iterator for Tuple-Based Nested-Loop Join Open (R,S) { R.Open(); S.Open(); S:=S.GetNext(): } GetNext (R,S) { REPEAT { r:=R.GetNext (); IF (NOT Found) { R.Close(); S:=S.GetNext(); IF (NOT Found) RETURN; R.Open(); r:=R.GetNext (); } } UNTILL (r and s join); RETURN the join of r and s; } Close (R,S) { R.Close(); S.Close() }
47
Block-Based Nested-Loop Join Organizing access to both argument relations by blocks Using as much main memory as we can to store tuples belonging to the relation S, the relation of the outer loop
48
FOR each chunk of M-1 blocks of S DO BEGIN read these blocks into main-memory buffers; organize their tuples into a search structure whose search key is the common attributes of R and S; FOR each block b of R DO BEGIN read b into main memory; FOR each tuple t of b DO BEGIN find the tuples of S in main memory that can join with t; output the join of t with each these tuples; END; The nested-loop join algorithm
49
Assume B(R) = 1000, B(S) = 500 and M = 101 Using S in the outer loop, we need 5X(100+1000) = 5500 Disk I/O Using R in the outer loop, we need 10X(100+500) = 6000 Disk I/O There is a slight advantage to using the smaller relation in the outer loop.
50
Summary of Algorithms Operators Approximate M required Disk I/O σ Π 1 B γ δ B B ∪ ∩ - × ∞ min (B(R),B(S)) B(R)+B(S) ∞ M≥2 B(R)B(S)/M
51
Two-Pass Algorithms Based on Sorting Two-pass algorithm for sorting B(R) (B(R)>M): Read M blocks of R into main memory Sort M blocks, using an efficient sorting algorithm Write the sorted list into M blocks of disk. Main memory disk 1.Read data, process in some way 2.Write out to disk, again 3.Reread from disk to complete the operation First pass
52
Duplicate Elimination Using Sorting R M buffersSame M buffers Read Write Sorting Sorted sublists of R Reread Choose the first unconsidered tuple t in sorted order Output the first copy of t and remove the other copies of t. All of sorted sublists are exhausted.
53
Example 6.15: assume M=3, and only two tuples fit on a block. The relation R consists of 17 tuples: ( 2,5,2,1,2,2,) ( 4,5,4,3,4,2,) (1,5,2,1,3) R1 R2 R3 sublistIn memoryWaiting on disk R1 1 2 2 2, 2 5 R2 2 3 4 4, 4 5 R3 1 1 2 3, 5 sublistIn memoryWaiting on disk R1 2 2 2, 2 5 R2 2 3 4 4, 4 5 R3 2 3 5 sublistIn memoryWaiting on disk R1 5 R2 3 4 4, 4 5 R3 3 5 sublistIn memoryWaiting on disk R1 5 R2 4 4 4 5 R3 5
54
Analysis: The total cost of this algorithm is 3B(R). to compute δ(R) with the two-pass algorithm requires only √B(R) blocks of memory since B≤M(M-1)≈M² 1.B(R) : to read each block of R when creating the sorted sublists. 2.B(R) : to write each of the sorted sublists to disk. 3.B(R): to read each block from the sublists at the appropriate time
55
Grouping and Aggregation Using Sorting This algorithm for γ takes 3B(R) disk I/O’s, as long as B(R) ≤M ² One pass to generate the sorted sublist of R γL(R) Find the least value v of the sort key ( grouping attributes) All tuples with sort key v becomes the next group Compute all the aggregates on the group Output the group as tuple of the result Until all sublists are exhausted.
56
Sort-Based Algorithms for Union, Intersection and Difference Each of the algorithms takes 3(B(R)+B(S)) disk I/O’s, as long as B(R)+B(S)≤M*M Generate the sorted sublists for R, S respectively Use one buffer for each sublist of R and S. Repeatedly find the first remaining tuple t among all the buffers. Union: copy t to the output, and remove all copies of t from the buffers. If a buffer becomes empty, reload it with the next block from its sublist. Intersection : Set: output t if it appears in both R and S. Bag: output t the minimum of the number of times it appears in R and S. Difference : R ― s S output t if and only if it appears in R but not in S R ― B S: output t the number of times it appears in R minus the number of times it appears in S.
57
Example 6.16: make the same assumptions as Ex6.15. And R has 12 tuples and S has 5 tuples. Ask R ― B S:? (2,5,2,1,2,2,) ( 4,5,4,3,4,2,) (1,5,2,1,3) R1 R2 S1 sublistIn memoryWaiting on disk R1 1 2 2 2, 2 5 R2 2 3 4 4, 4 5 S1 1 1 2 3, 5 sublistIn memoryWaiting on disk R1 2 2 2, 2 5 R2 2 3 4 4, 4 5 S1 2 3 5 sublistIn memoryWaiting on disk R1 5 R2 3 4 4, 4 5 S1 3 5 sublistIn memoryWaiting on disk R1 5 R2 4 4 4 5 S1 5 The output is 2,2,2,2,4,4,4,5.
58
A Simple Sort-Based Join Algorithm R(X,Y) join S(Y,Z) Sort RSort S Buffer for R Buffer for S The least value y Appears in both of R and S Does not appear in the other Remove the tuples with sort key yIdentify all the tuples from both R and S Output all of the joined tuples
59
Example 6.17 Assume: B(R)=1000, B(S)=500, M=101. Then we use 4(B(R)+B(S))=6000 disk I/O’s to sort with two-phase multiway merge sort. When we merge the sorted R and S to find the joined tuples, we use another B(R)+B(S)=1500 disk I/O’s. So,the total number of disk I/O’s is 5(B(R)+B(S))=7500. Analysis : The sort-join has linear I/O cost, taking time proportional to B(R)+B(S), while the nested-loop join is a quadratic algorithm, taking time proportional to B(R)B(S). It is only the constant factors and the small size of the example that make nested-loop join preferable. And the sort-join requires B(R)≤M*M, B(S) ≤M*M to work.
60
If the tuples from one of the relations fit in M-1 buffers, perform one-pass join on the tuples with the sort key y from the two relations. If neither of them can fit in M-1 buffers, perform a nested-loop join on the tuples with the sort key y from the two relations. The number of tuples with Y-value y does not fit in M buffers
61
A More Efficient Sort-Based Join Sort RSort S Sorted Sublists of RSorted Sublists of S Buffer for each sublists Find the least Y-value y Identify all the tuples of both relations that have Y-value y Output all the joined tuples Repeatedly
62
Analysis Example 6.18 :B(R)=1000, B(S)=500, M=101.Divide R into 10 sublists, S into 5 sublists, each of length 100. Use 15 buffers to hold the current blocks of each sublist. The total number of disk I/O’s is 3(B(R)+B(S))=4500. It requires B(R)+B(S)≤M*M.
63
Summary of Sort-Based Algorithm Operators Approximate M Required Disk I/O γ δ √B3B ∪ ∩ - √(B(R)+B(S))3(B(R)+B(S)) ∞√max(B(R),B(S))5(B(R)+B(S)) ∞√(B(R)+B(S))3(B(R)+B(S))
64
Two-Pass Algorithms Based on Hashing Pass One Hash all the tuples of the argument or arguments using an appropriate hash function. Pass Two Perform the operation by working on one bucket at a time (or on a pair of buckets with the same hash value)
65
Partitioning Relations by Hashing Initialize M-1 buckets using M-1 empty buffers; FOR each block b of relation R DO BEGIN Read block b into the Mth buffer; FOR each tuple t in b DO BEGIN IF the buffer for bucket h(t) has no room for t THEN BEGIN Copy the buffer to disk; Initialize a new empty block in that buffer; END; Copy t to the buffer for bucket h(t); END; FOR each bucket DO IF the buffer for this bucket is not empty THEN write the buffer to disk ;
66
Hash-Based Algorithms for Union, Intersection, and Difference RSRelations R1RiS1 Si …………Hash R and S by the same hash functions. One-pass algorithm to each pair of corresponding buckets. For example, R ∩ S = (R1 ∩ S1) U … (Ri ∩ Si) U … Cost: 3(B( R ) +B(S)) disk I/O Requirement: min(B( R ), B(S)) <= M(M-1)
67
The Hash-Join Algorithm RSRelations R1RiS1 Si ………… Hash R and S by the same hash functions with the join attributes as the hash key. One-pass algorithm to each pair of corresponding buckets. R ∞ S = (R1 ∞ S1) U … (Ri ∞ Si) U … Cost: 3(B( R ) +B(S)) disk I/O Requirement: min(B( R ), B(S)) <= M(M-1)
68
Saving Some Disk I/O’s by Hybrid Hash-Join k-m buckets m buckets Requirements : mB(S)/k+k-m≤M 。 Keep m of the k buckets entirely in main memory. Keep only one block for each of the other k – m buckets. Cost: (3- 2M/B(S))(B(R)+B(S))
69
Operators M I/O’s δ,γ ∩ ∪ ― ∞ √B √B(S) 3B 3(B(R)+B(S)) Summary of Hash-Based Algorithms (B(S)<B(R) )
70
Index-Based Algorithm The existence of an index on one or more attributes of a relation makes available some algorithms that would not be feasible without index.
71
Clustering and Nonclustering Indexes A clustering index has all tuples with a fixed value packed into (close to) the minimum possible number of blocks. a1 a1 a1 All the a1-tuples
72
Index-Based Selection For σ a=v (R), there is an index on a, 1.Search the index with value v; 2.Get the pointers to exactly those tuples of R that have a-value v; 3.Retrieve tuples following these pointers. Cost: B(R)/V(R,a) (Index is clustered); T(R) /V(R,a) (Index is non-clustered) T(R) /V(R,a) (Index is non-clustered)
73
For example: B(R)=1000, T(R)=20000, let a be one of the attributes of R, suppose there is an index on a, and consider the operation σ a=0 (R), : –If R is clustered , without index, cost is 1000 –If R is not clustered , without index, cost is 20000 –If V(R,a)=100, with the clustering index, cost is B(R)/V(R,a) =10 –If V(R,a)=100, with the nonclustering index , cost is T(R) /V(R,a) =200. –If V(R,a)=20000, a is the key of R , cost is 1.
74
Joining by Using an Index Suppose R(X,Y)∞S(Y,Z) with an index on the attribute(s) Y of S. 1.For each tuple t of R, use the index to find all those tuples of S having the same Y-value. 2.Output the join of each of these tuples with t. Disk I/O : –Read all the tuples of R B(R) ( R is clustered) ; T(R) (R is nonclustered) –Retrieve tuples of S Index is clustered, T(R)B(S)/V(S,Y) Index is not clustered, T(R)T(S)/V(S,Y)
75
Joins Using a Sorted Index Suppose that we have relations R(X,Y) and S(Y,Z) with indexes on Y for both relations. A zig-zag join using two indexes are performed as follows: R 1, 3, 4 4, 4, 5, 6 S 2 2 4 4 6 7
76
Buffer Management Buffer manager Read / Write Buffers Requests 1.The buffer manager controls main memory directly. 2.The buffer manager allocates buffers in virtual memory.
77
Buffer Management Strategies Least-Recently Used (LRU) First-In-First-Out (FIFO) The “Clock” Algorithm System Control 0 01 01 0 1 1
78
The Relationship Between Physical Operator Selection and Buffer Management The selection of algorithms depends on the number of the available main-memory buffers. Some algorithms can adapt to changes in the number of the available main-memory buffers. The buffer-replacement strategy has impact on the number of disk I/Os for execution of the physical operators.
79
Algorithms Using More Than Two Passes Multipass Sort-Based Algorithm Multipass Hash-Based Algorithm
80
Multipass Sort-Based Algorithms BASIS: If R fits in M blocks, then read R into main memory, sort it using your favorite main-memory sorting algorithm, and write the sorted relation to disk. INDUCTION: If R does not fit into main memory, partition the blocks holding R into M groups, which we shall call R 1, R 2, …,R M. Recursively sort Ri for each i = 1,2,…,M. Then, merge the M sorted sublists.
81
Performance of Multipass, Sort-Based Algorithm Let s(M,k) be the maximum size of a relation that we can sort using M buffers and k passes. Basis: If k=1, s(M,1)=M. Induction : If k>1, partition R into M Pieces, each of which must be sortable in k-1 passes. if B(R)=s(M,k), B(R)/M≤s(M,k-1), then s(M,k)= Ms(M,k-1). s(M,k)= Ms(M,k-1)=M²s(M,k-2)=...=M^(k-1) s(M,1)=M^k.
82
Multipass Hash-Based Algorithm BASIS: For a unary operation, if the relation fits in M buffers, read it into memory and perform the operation. For a binary operation, if either relation fits in M-1 buffers, perform one pass algorithm. INDUCTION: If no relation fits in main memory, then hash each relation into M-1 buffers. Recursively perform the operation on each bucket or corresponding pair of buckets, and accumulate the output from each bucket or pair.
83
Performance of Multipass Hash-Based Algorithms Unary Operation: Let u(M,k) be the number of blocks in the largest relation that a k-pass hashing algorithm can handle. –Basis : u(M,1)=M. –Induction : Divide the relation R into M-1 buckets of equal size, u(M,k)=(M-1)u(M,k-1). u(M,k)=M(M-1)^(k- 1)≈M^k Binary Operation, R(X,Y)∞S(Y,Z) : Let j(M,k) be an upper bound on the size of the smaller of R and S 。 –Basis : j(M,1)=M-1. –Induction : j(M,k)=(M-1)j(M,k-1), So j(M,k)=(M-1)^k.
84
Parallel Algorithms for Relational Operations Models of Parallelism Tuple-at-a-Time Operations in Parallel Parallel Algorithms for Full-Relation Operations Performance of Parallel Algorithms
85
Models of Parallelism MMM PPP A shared-memory machine
86
PPP MMM A shared-disk machine
87
MM M PPP A shared-nothing machine
88
Tuple-at-a-Time Operations in Parallel R R1 R2Ri Rn… … P1P2 PiPn σc(R) = σc (R1) U σc(R2) U …U σc(Ri) U … U σc(Rn) Partition R to n Processors evenly.
89
Parallel Algorithms for Full-Relation Operations R S … …R1, S1Ri, SiRn, Sn P1 P2 P3 Partition R and S by the same hash functions, perform the binary operation at each processor.
90
Performance of Parallel Algorithms Unary Operation: B(R)/p Binary Operation: 5(B(R) + B(S))/p p is the number of the parallel processors.
91
Exercise Ex 6.1.6 (d), EX 6.3.1 (e) Ex 6.4.3 (a), Ex 6.5.3 (a) Ex 6.6.2, Ex 6.7.2 (b) Ex 6.8.1 (c), Ex 6.10.1 (b)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.