Download presentation
Presentation is loading. Please wait.
1
Query Processing Exercise Session 4
2
Question 3 You have to compute R ⋈R.A<S.A S
Each of R and S has 10,000 blocks There are 100 records in a block V(R,A)=500 and V(S,A)=5,000 The buffer size is 1,117 blocks Which method would you use to compute this join most efficiently and what would be the I/O cost?
3
Answer to Question 3 Our methods for size estimation do not apply in this case Hash join does not apply at all in this case Sort-merge join cannot be applied in just one pass after the lists are sorted Also, it is not clear how to generalize sort-merge join efficiently to non-equality join So, block nested-loop join is the only option The I/O cost is 100,000
4
Question 4 Each of the relations R and S has 1,000,000 blocks
Describe a generalization of hash join that can compute R ⋈ S even when the buffer size is 102 block You may assume that you can always find hash functions that suit your needs What is the I/O cost of your method?
5
Answer to Question 4 We apply hash join recursively
First, we use a hash function h1 to divide each relation into 100 buckets, each having a size of 10,000 block Second, we use another hash function h2 to divide each bucket into 100 new buckets, each having a size of 100 blocks 101 buffer blocks are sufficient for the above two steps
6
Answer to Question 4 (cont’d)
Now, we join each pair of corresponding buckets A bucket of R and a bucket of S are corresponding if both are for the same value of h1(A) and the same value of h2(A), where A is the join attribute Each relation is read and written a total of 5 times The total I/O cost is 10,000,000
7
Question 5 You have to compute R(A,B) ⋈ S(B,C)
Each relation has 1,000,000 blocks V(R,B)=1,000 and V(S,B)=2,000 In each relation, the maximum number of records with the same value for attribute B is twice the average number The relations are not sorted What is the minimal buffer size that is needed to guarantee that sort-merge join will work optimally?
8
Answer to Question 5 The maximal number of blocks of R with the same value for B is 2(1,000,000)/1,000= 2,000 For S the number is 1,000 So, P = min(2,000, 1,000) = 1,000 B(R) + B(S) = 2,000,000 The square root is 1,414 We need 1,414+1,000+1=2,415 buffer blocks
9
Question 6 Compute R(A,B) ⋈ S(A,C) ⋈ U(A,D)
The relation R has 7,000,000 blocks Each of the other two has 1,000,000 blocks There are 100 records in a block V(R,A)=500,000, V(S,A)=5,000, V(U,A)=5,000 The buffer size is 2,048 blocks Describe the most efficient way of computing the join and give its I/O cost
10
Answer to Question 6 We take advantage of the fact that the two joins are on the same attribute A and use a modification of hash join We start by dividing each relation into 1,000 buckets using the same hash function h on attribute A Note that 1,000 is the square root of the size of each of the two smaller relations
11
Answer to Question 6 (cont’d)
To divide into 1,000 buckets, we need a buffer size of 1,001 blocks, regardless of the size of the relation And we have to read each relation once and write it once So, for the three relations, the total is 2B(R)+ 2B(S) + 2B(U)
12
Answer to Question 6 (cont’d)
Note that three records (of R, S and U ) can be joined only if they belong to buckets for the same value of h(A) To join the three relations, we do the following for each hash value v of h(A), such that all three relations have buckets for v First, we read the bucket of S and the bucket of U for the value v into main memory Each one of those two buckets has a size of 1,000 blocks, and so the buffer is large enough to have both of them completely in main memory
13
Answer to Question 6 (cont’d)
Now, we read block by block the bucket of R for the hash value v Actually, we can read it 47 blocks at a time, because there is a total of 2,048 blocks in the buffer (we need 1 block for the output, and the two buckets of S and U already take 2,000 blocks) For each block of R, we join the records of R from that block with the records of S and U that are in main memory Each bucket is read just once, so the total I/O cost of this stage is B(R)+ B(S) + B(U)
14
Answer to Question 6 (cont’d)
The total I/O cost is 3B(R)+ 3B(S) + 3B(U) As always, we do not count the cost of writing the final result to the disk Substituting the sizes of the three relations, we get 27,000,000
15
Question 7 Compute A,D(R(A,B) ⋈ S(B,C) ⋈ U(C,D)) Each relation has 10,000 blocks A record requires 20 bytes for each attribute, and a block has 4,000 bytes for storing records (the rest is used for the header) V(R,A)=2,000,000, V(R,B)=500, V(S,B)=100, V(S,C)=1,000,000 and V(U,C)=2,000 The buffer size is 5,219 blocks What is the most efficient way of computing the query, and what is the I/O cost? Note that duplicates are not removed unless explicitly stated otherwise
16
Answer to Question 7 A block has 100 records if each record has 2 attributes When either join is done first, we need only 2 attributes for the next join R(A,B) ⋈ S(B,C) has 2109 records The size of R(A,B) ⋈ U(C,D) is even bigger S(B,C) ⋈ U(C,D) has 106 records and occupies 10,000 blocks after projecting on B and D Computing this join first will give the lowest I/O cost In both joins, each of the two relations has 10,000 blocks
17
Answer to Question 7 (cont’d)
We use block nested-loop join with writing of the intermediate result to a disk Due to the buffer size, the outer loop of each join requires only 2 iterations Thus, the total I/O cost is 70,000 This is better than either hash join or sort-merge join Also better than pipelined block nested-loop join, where the optimal policy is to allocate 2,500 blocks to each of the two relations that require more than 1 block, because they have the same size
18
Question 9 Compute C=5(R(A,B) ⋈ S(B,C))
Each relation has 10,000 blocks A block has 100 records V(R,B)=500, V(S,B)=2,000 and V(S,C)=1,000 S has a non-clustering index on C The buffer size is 512 blocks Describe the most efficient way of computing the query and give the I/O cost
19
Answer to Question 9 We use the index on S to read all the records where C=5 The index is non-clustering, so the I/O cost is T(S)/V(S,C)=10010,000/1,000=1,000 But the result requires only 10 blocks So, there are enough buffer blocks to hold the intermediate result It remains to read R once The total I/O cost is 1,000+10,000=11,000
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.