1 Query Processing Two-Pass Algorithms Source: our textbook.

Slides:



Advertisements
Similar presentations
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Advertisements

Two-Pass Algorithms Based on Sorting
1 Lecture 23: Query Execution Friday, March 4, 2005.
15.8 Algorithms using more than two passes Presented By: Seungbeom Ma (ID 125) Professor: Dr. T. Y. Lin Computer Science Department San Jose State University.
CMSC424: Database Design Instructor: Amol Deshpande
Dr. Kalpakis CMSC 661, Principles of Database Systems Query Execution [15]
Completing the Physical-Query-Plan. Query compiler so far Parsed the query. Converted it to an initial logical query plan. Improved that logical query.
Notions of clustering Clustered relation: tuples are stored in blocks mostly devoted to that relation. Clustering index: tuples (of the relation) with.
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
15.3 Nested-Loop Joins By: Saloni Tamotia (215). Introduction to Nested-Loop Joins  Used for relations of any side.  Not necessary that relation fits.
Notions of clustering Clustered file: e.g. store movie tuples together with the corresponding studio tuple. Clustered relation: tuples are stored in blocks.
15.6 Index-based Algorithms Jindou Jiao 101. Index-based algorithms are especially useful for the selection operator Algorithms for join and other binary.
Lecture 24: Query Execution Monday, November 20, 2000.
1 Indexes on Sequential Files Source: our textbook, slides by Hector Garcia-Molina.
Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna.
Nested Loops Joins Book Section of chapter 15.3 Submitted to : Prof. Dr. T.Y. LIN Submitted by: Saurabh Vishal.
15.5 Two-Pass Algorithms Based on Hashing 115 ChenKuang Yang.
Query Execution :Nested-Loop Joins Rohit Deshmukh ID 120 CS-257 Rohit Deshmukh ID 120 CS-257.
15.3 Nested-Loop Joins - Medha Pradhan - ID: CS 257 Section 2 - Spring 2008.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
Chapter 15.7 Buffer Management ID: 219 Name: Qun Yu Class: CS Spring 2009 Instructor: Dr. T.Y.Lin.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 242 Database Systems II Query Execution.
CPS216: Advanced Database Systems Notes 06:Query Execution (Sort and Join operators) Shivnath Babu.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
DBMS 2001Notes 5: Query Processing1 Principles of Database Management Systems 5: Query Processing Pekka Kilpeläinen (partially based on Stanford CS245.
CPS216: Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
Chapter 6 Query Execution. Query Query Compilation (Chapter 7 ) query plan Query execution metadata ( Chapter 6 ) data the major parts Of the query processor.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
Lecture 24 Query Execution Monday, November 28, 2005.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS4432: Database Systems II Query Processing- Part 2.
CPS216: Advanced Database Systems Notes 07:Query Execution (Sort and Join operators) Shivnath Babu.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
1 Choosing an Order for Joins. 2 What is the best way to join n relations? SELECT … FROM A, B, C, D WHERE A.x = B.y AND C.z = D.z Hash-Join Sort-JoinIndex-Join.
CS 540 Database Management Systems
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
1 Lecture 23: Query Execution Monday, November 26, 2001.
Query Processing COMP3017 Advanced Databases Nicholas Gibbins
CS4432: Database Systems II Query Processing- Part 1 1.
Two-Pass Algorithms Based on Sorting
CS 440 Database Management Systems
Database Management System
Chapter 12: Query Processing
15.5 Two-Pass Algorithms Based on Hashing
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
Query Execution Two-pass Algorithms based on Hashing
(Two-Pass Algorithms)
Lecture 2- Query Processing (continued)
One-Pass Algorithms for Database Operations (15.2)
Chapter 12 Query Processing (1)
Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
Lecture 22: Query Execution
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Lecture 22: Query Execution
Lecture 11: B+ Trees and Query Execution
Lecture 22: Friday, November 22, 2002.
Lecture 24: Query Execution
Lecture 20: Query Execution
Presentation transcript:

1 Query Processing Two-Pass Algorithms Source: our textbook

2 Nested-Loop Joins Another helpful primitive. Main idea to join relations R and S: for each tuple s in S do for each tuple r in R do if r and s match then ouput the join of r and s How to optimize this?

3 Block-Based Nested Loop Join for each chunk of M-1 blocks of S do read the blocks into main memory create search structure D for the tuples with the common attributes of R and S as the key for each block b of R do read b into M-th block of main memory for each tuple t of b do use D to find each tuple s of S that joins with t output the result of joining t with s

4 Analysis of Nested-Loop Join uAssume S is the smaller relation unumber of disk I/O's: wnumber of disk I/O's per iteration of outer loop times number of iterations of outer loop wthis is: (M B(R)) * B(S)/(M-1) uIf B(S) ≤ M-1, then this is same as one- pass join algorithm.

5 Two-Pass Algorithms uUsed when relations are too large to fit in memory uPieces of the relation are read into memory, processed in some way, and then written back to disk uThe pieces are then reread from disk to complete the operation uWhy only two passes? Usually this is enough, but ideas can be extended to more passes.

6 Using Sorting as a Tool uSuppose we have M blocks of main memory available, but B(R ) > M uRepeatedly wread M blocks of R into main memory wsort these blocks (shouldn't take more time than one disk I/O) wwrite sorted blocks to disk; called sorted sublists uDo a second pass to process the sorted sublists in some way uUsually require there are at most M sublists, i.e., B(R) ≤ M 2. Gaining a factor of M using preprocessing!

7 Duplicate Elimination Using Sorting uCreate the sorted sublists uRead first block of each sublist into memory uLook at first unconsidered tuple in each block and let t be first one in sorted order uCopy t to the output and delete all other copies of t in the blocks in memory uIf a block is emptied, then bring into its buffer the next block of that sublist and delete any copies of t in it

8 Analysis of Duplicate Elimination uB(R) disk I/Os to create sorted sublists uB(R) disk I/Os to write sorted sublists to disk uB(R) disk I/Os to reread each block from the sublists uGrand total is 3*B(R) disk I/Os. uRemember we need blocks of main memory.

9 Grouping Using Sorting uCreate the sorted sublists on disk, using the grouping attributes as the sort key uRead the first block of each sublist into memory uRepeat until all blocks have been processed: wstart a new group for the smallest sort key among next available tuples in the buffers wcompute the aggregates using all tuples in this group -- they are either in memory or will be loaded into memory next woutput the tuple for this group u3*B(R) disk I/Os, blocks of main memory

10 Union Using Sorting uCreate the sorted sublists for R on disk, using the entire tuple as the sort key uCreate the sorted sublists for S on disk, using the entire tuple as the sort key uUse one main memory buffer for each sublist of R and S uRepeatedly find next tuple t among all buffers, copy to output, and remove from the buffers all copies of t (reloading any buffer that is emptied) u3(B(R ) + B(S)) disk I/Os; B(R) + B(S) ≤ M 2

11 Intersection and Difference Using Sorting uVery similar to Union. uDifferent rules for deciding whether/ how many times a tuple is output uSet Intersection: output t if it appears in both R and S uBag Intersection: output t the minimum number of times it appears in R and S uDifference: see text. u3*(B(R ) + B(S)) disk I/Os; B(R ) + B(S) ≤ M 2

12 A Sort-Based Join uGoal: make as many main memory buffers as possible available for joining tuples with a common value uTo join R(X,Y) and S(Y,Z): wTotally sort R using Y as the sort key wTotally sort S using Y as the sort key wNext pass reads in blocks of R and S, primarily one at a time, i.e. using one buffer for R and one for S. Familiar strategy is used to reload the buffer for a relation when all the current tuples have been processed.

13 Simple Sort-Based Join (cont'd) uLet y be the smaller sort key at the front of the buffers for R and S uIf y appears in both relations then wload into memory all tuples from R and S with sort key y; up to M buffers are available for this step (*) woutput all tuples formed by combining tuples from R and tuples from S with sort key y

14 More on Sort-Based Join u(*) Suppose not all tuples with sort key y fit in main memory. uIf all the tuples in one relation with sort key y do fit, then do one-pass join uIf all the tuples in neither relation with sort key y fit, then do basic nested-loop join

15 Analysis of Sort-Based Join uSuppose all tuples with a given sort key fit in main memory uThe dominant expense of the algorithm is the secondary storage sorting algorithm used to sort the relations: w5*(B(R) + B(S)) disk I/Os wB(R) ≤ M 2 and B(S) ≤ M 2

16 Two-Pass Algorithms Using Hashing uGeneral idea: wHash the tuples using an appropriate hash key wFor the common operations, there is a way to choose the hash key so that all tuples that need to be considered together has the same hash value wDo the operation working on one bucket at a time

17 Partitioning by Hashing initialize M-1 buckets with M-1 empty buffers for each block b of relation R do read block b into the Mth buffer for each tuple t in b do if the buffer for bucket h(t) is full then copy the buffer to disk initialize a new empty block in that buffer copy t to the buffer for bucket h(t) for each bucket do if the buffer for this bucket is not empty then write the buffer to disk

18 Duplicate Elimination Using Hashing uHash the relation R to M-1 buckets, R 1, R 2,…,R M-1 uNote: all copies of the same tuple will hash to the same bucket! uDo duplicate elimination on each bucket R i independently, using one-pass algorithm uReturn the union of the individual bucket results

19 Analysis of Duplicate Elimination Using Hashing uNumber of disk I/O's: 3*B(R) uIn order for this to work, we need: whash function h evenly distributes the tuples among the buckets weach bucket R i fits in main memory (to allow the one-pass algorithm) wi.e., B(R) ≤ M 2

20 Grouping Using Hashing uHash all the tuples of relation R to M-1 buckets, using a hash function that depends only on the grouping attributes uNote: all tuples in the same group end up in the same bucket! uUse the one-pass algorithm to process each bucket independently uUses 3*B(R) disk I/O's, requires B(R) ≤ M 2

21 Union, Intersection and Difference Using Hashing uUse same hash function for both relations! uHash R to M-1 buckets R 1, R 2, …, R M-1 uHash S to M-1 buckets S 1, S 2, …, S M-1 uDo one-pass {set union, set intersection, bag intersection, set difference, bag difference} algorithm on R i and S i, for all i u3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M 2

22 Join Using Hashing uUse same hash function for both relations; hash function should depend only on the join attributes uHash R to M-1 buckets R 1, R 2, …, R M-1 uHash S to M-1 buckets S 1, S 2, …, S M-1 uDo one-pass join of R i and S i, for all i u3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M 2

23 Comparison of Sort-Based and Hash-Based uFor binary operations, hash-based only limits size of smaller relation, not sum uSort-based can produce output in sorted order, which can be helpful uHash-based depends on buckets being of equal size uSort-based algorithms can experience reduced rotational latency or seek time

24 Index-Based Algorithms uThe existence of an index is especially helpful for selection, and helps others uClustered relation: tuples are packed into the minimum number of blocks uClustering index: all tuples with the same value for the index's search key are packed into the minimum number of blocks

25 Index-Based Selection uWithout an index, selection takes B(R), or even T(R), disk I/O's. uTo select all tuples with attribute a equal to value v, when there is an index on a: wsearch the index for value v and get pointers to exactly the blocks containing the desired tuples uIf index is clustering, then number of disk I/O's is about B(R)/V(R,a)

26 Examples uSuppose B(R) = 1000, T(R) = 20,000, there is an index on a and we want to select all tuples with a = 0. wIf R is clustered and don't use index: 1000 disk I/O's wIf R is not clustered and don't use index: 20,000 disk I/O's wIf V(R,a) = 100, index is clustering, and use index: 1000/100 = 10 disk I/O's (on average) wIf V(R,a) = 10, index is non-clustering, and use index: 20,000/10 = 2000 disk I/O's (on average) wIf V(R,a) = 20,000 (a is a key) and use index: 1 disk I/O

27 Using Indexes in Other Operations 1.If the index is a B-tree, can efficiently select tuples with indexed attribute in a range 2.If selection is on a complex condition such as "a = v AND …", first do the index-based algorithm to get tuples satisfying "a = v". uSuch splitting is part of the job of the query optimizer

28 Index-Based Join Algorithm uConsider natural join of R(X,Y) and S(Y,Z). uSuppose S has an index on Y. for each block of R for each tuple t in the current block use index on S to find tuples of S that match t in the attribute(s) Y output the join of these tuples

29 Analysis of Index-Based Join uTo get all the blocks of R, either B(R) or T(R) disk I/O's are needed uFor each tuple of R, there are on average T(S)/V(S,Y) matching tuples of S wT(R)*T(S)/V(S,Y) disk I/O's if index is not clustering wT(R)*B(S)/V(S,Y) disk I/O's if index is clustering uThis method is efficient if R is much smaller than S and V(S,Y) is large (i.e., not many tuples of S match)

30 Join Using a Sorted Index uSuppose we want to join R(X,Y) and S(Y,Z). uSuppose we have a sorted index (e.g., B-tree) on Y for either R or S (or both): wdo sort-join but wno need to sort the indexed relation(s) first

31 Buffer Management uThe availability of blocks (buffers) of main memory is controlled by buffer manager. uWhen a new buffer is needed, a replacement policy is used to decide which existing buffer should be returned to disk. uIf the number of buffers available for an operation cannot be predicted in advance, then the algorithm chosen must degrade gracefully as the number of buffers shrinks. uIf the number of buffers available is not large enough for a two-pass algorithm, then there are generalizations to algorithms that use three or more passes.