Query Processing Exercise Session 4.

Slides:



Advertisements
Similar presentations
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Advertisements

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.
Implementation of relational operations
1 Lecture 23: Query Execution Friday, March 4, 2005.
15.8 Algorithms using more than two passes Presented By: Seungbeom Ma (ID 125) Professor: Dr. T. Y. Lin Computer Science Department San Jose State University.
Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Bhargav Vadher (208) APRIL 9 th, 2008 Submittetd To: Dr. T Y Lin Computer Science Department San Jose State University.
Completing the Physical-Query-Plan. Query compiler so far Parsed the query. Converted it to an initial logical query plan. Improved that logical query.
Cost-Based Transformations. Why estimate costs? Well, sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g.
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Lecture 24: Query Execution Monday, November 20, 2000.
Cost based transformations Initial logical query plan Two candidates for the best logical query plan.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Cost-Based Transformations. Why estimate costs? Sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g. Pushing.
Query Execution :Nested-Loop Joins Rohit Deshmukh ID 120 CS-257 Rohit Deshmukh ID 120 CS-257.
Quick Review of Apr 22 material Sections 13.1 through 13.3 in text Query Processing: take an SQL query and: –parse/translate it into an internal representation.
Query Compiler: 16.7 Completing the Physical Query-Plan CS257 Spring 2009 Professor Tsau Lin Student: Suntorn Sae-Eung ID: 212.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Cost based transformations Initial logical query plan Two candidates for the best logical query plan.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS4432: Database Systems II Query Processing- Part 2.
CSCE Database Systems Chapter 15: Query Execution 1.
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Chapter 12 Query Processing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
1 Lecture 23: Query Execution Monday, November 26, 2001.
Chapter 4: Query Processing
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Management System
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Chapter 13: Query Processing
Chapter 15 QUERY EXECUTION.
15.5 Two-Pass Algorithms Based on Hashing
Database Systems Ch Michael Symonds
Yan Huang - CSCI5330 Database Implementation – Access Methods
Query Processing.
Chapter 13: Query Processing
Chapter 13: Query Processing
April 27th – Cost Estimation
Chapters 15 and 16b: Query Optimization
Chapter 13: Query Processing
Lecture 2- Query Processing (continued)
Chapter 13: Query Processing
Chapter 13: Query Processing
Lecture 13: Query Execution
CS505: Intermediate Topics in Database Systems
Lecture 23: Query Execution
Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
Evaluation of Relational Operations: Other Techniques
Overview of Query Evaluation: JOINS
Lecture 22: Query Execution
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Lecture 22: Query Execution
Chapter 13: Query Processing
CENG 351 Data Management and File Structures
Chapter 13: Query Processing
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Lecture 11: B+ Trees and Query Execution
CPSC-608 Database Systems
Lecture 22: Friday, November 22, 2002.
Lecture 24: Query Execution
Lecture 20: Query Execution
Presentation transcript:

Query Processing Exercise Session 4

Question 3 You have to compute R ⋈R.A<S.A S Each of R and S has 10,000 blocks There are 100 records in a block V(R,A)=500 and V(S,A)=5,000 The buffer size is 1,117 blocks Which method would you use to compute this join most efficiently and what would be the I/O cost?

Answer to Question 3 Our methods for size estimation do not apply in this case Hash join does not apply at all in this case Sort-merge join cannot be applied in just one pass after the lists are sorted Also, it is not clear how to generalize sort-merge join efficiently to non-equality join So, block nested-loop join is the only option The I/O cost is 100,000

Question 4 Each of the relations R and S has 1,000,000 blocks Describe a generalization of hash join that can compute R ⋈ S even when the buffer size is 102 block You may assume that you can always find hash functions that suit your needs What is the I/O cost of your method?

Answer to Question 4 We apply hash join recursively First, we use a hash function h1 to divide each relation into 100 buckets, each having a size of 10,000 block Second, we use another hash function h2 to divide each bucket into 100 new buckets, each having a size of 100 blocks 101 buffer blocks are sufficient for the above two steps

Answer to Question 4 (cont’d) Now, we join each pair of corresponding buckets A bucket of R and a bucket of S are corresponding if both are for the same value of h1(A) and the same value of h2(A), where A is the join attribute Each relation is read and written a total of 5 times The total I/O cost is 10,000,000

Question 5 You have to compute R(A,B) ⋈ S(B,C) Each relation has 1,000,000 blocks V(R,B)=1,000 and V(S,B)=2,000 In each relation, the maximum number of records with the same value for attribute B is twice the average number The relations are not sorted What is the minimal buffer size that is needed to guarantee that sort-merge join will work optimally?

Answer to Question 5 The maximal number of blocks of R with the same value for B is 2(1,000,000)/1,000= 2,000 For S the number is 1,000 So, P = min(2,000, 1,000) = 1,000 B(R) + B(S) = 2,000,000 The square root is 1,414 We need 1,414+1,000+1=2,415 buffer blocks

Question 6 Compute R(A,B) ⋈ S(A,C) ⋈ U(A,D) The relation R has 7,000,000 blocks Each of the other two has 1,000,000 blocks There are 100 records in a block V(R,A)=500,000, V(S,A)=5,000, V(U,A)=5,000 The buffer size is 2,048 blocks Describe the most efficient way of computing the join and give its I/O cost

Answer to Question 6 We take advantage of the fact that the two joins are on the same attribute A and use a modification of hash join We start by dividing each relation into 1,000 buckets using the same hash function h on attribute A Note that 1,000 is the square root of the size of each of the two smaller relations

Answer to Question 6 (cont’d) To divide into 1,000 buckets, we need a buffer size of 1,001 blocks, regardless of the size of the relation And we have to read each relation once and write it once So, for the three relations, the total is 2B(R)+ 2B(S) + 2B(U)

Answer to Question 6 (cont’d) Note that three records (of R, S and U ) can be joined only if they belong to buckets for the same value of h(A) To join the three relations, we do the following for each hash value v of h(A), such that all three relations have buckets for v First, we read the bucket of S and the bucket of U for the value v into main memory Each one of those two buckets has a size of 1,000 blocks, and so the buffer is large enough to have both of them completely in main memory

Answer to Question 6 (cont’d) Now, we read block by block the bucket of R for the hash value v Actually, we can read it 47 blocks at a time, because there is a total of 2,048 blocks in the buffer (we need 1 block for the output, and the two buckets of S and U already take 2,000 blocks) For each block of R, we join the records of R from that block with the records of S and U that are in main memory Each bucket is read just once, so the total I/O cost of this stage is B(R)+ B(S) + B(U)

Answer to Question 6 (cont’d) The total I/O cost is 3B(R)+ 3B(S) + 3B(U) As always, we do not count the cost of writing the final result to the disk Substituting the sizes of the three relations, we get 27,000,000

Question 7 Compute A,D(R(A,B) ⋈ S(B,C) ⋈ U(C,D)) Each relation has 10,000 blocks A record requires 20 bytes for each attribute, and a block has 4,000 bytes for storing records (the rest is used for the header) V(R,A)=2,000,000, V(R,B)=500, V(S,B)=100, V(S,C)=1,000,000 and V(U,C)=2,000 The buffer size is 5,219 blocks What is the most efficient way of computing the query, and what is the I/O cost? Note that duplicates are not removed unless explicitly stated otherwise

Answer to Question 7 A block has 100 records if each record has 2 attributes When either join is done first, we need only 2 attributes for the next join R(A,B) ⋈ S(B,C) has 2109 records The size of R(A,B) ⋈ U(C,D) is even bigger S(B,C) ⋈ U(C,D) has 106 records and occupies 10,000 blocks after projecting on B and D Computing this join first will give the lowest I/O cost In both joins, each of the two relations has 10,000 blocks

Answer to Question 7 (cont’d) We use block nested-loop join with writing of the intermediate result to a disk Due to the buffer size, the outer loop of each join requires only 2 iterations Thus, the total I/O cost is 70,000 This is better than either hash join or sort-merge join Also better than pipelined block nested-loop join, where the optimal policy is to allocate 2,500 blocks to each of the two relations that require more than 1 block, because they have the same size

Question 9 Compute C=5(R(A,B) ⋈ S(B,C)) Each relation has 10,000 blocks A block has 100 records V(R,B)=500, V(S,B)=2,000 and V(S,C)=1,000 S has a non-clustering index on C The buffer size is 512 blocks Describe the most efficient way of computing the query and give the I/O cost

Answer to Question 9 We use the index on S to read all the records where C=5 The index is non-clustering, so the I/O cost is T(S)/V(S,C)=10010,000/1,000=1,000 But the result requires only 10 blocks So, there are enough buffer blocks to hold the intermediate result It remains to read R once The total I/O cost is 1,000+10,000=11,000