15.5 Two-Pass Algorithms Based on Hashing

Slides:



Advertisements
Similar presentations
Two-Pass Algorithms Based on Sorting
Advertisements

Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.
1 Lecture 23: Query Execution Friday, March 4, 2005.
15.8 Algorithms using more than two passes Presented By: Seungbeom Ma (ID 125) Professor: Dr. T. Y. Lin Computer Science Department San Jose State University.
Bhargav Vadher (208) APRIL 9 th, 2008 Submittetd To: Dr. T Y Lin Computer Science Department San Jose State University.
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
Lecture 24: Query Execution Monday, November 20, 2000.
ONE PASS ALGORITHM PRESENTED BY: PRADHYUMAN RAOL ID : 114 Instructor: Dr T.Y. LIN.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
1 Query Processing Two-Pass Algorithms Source: our textbook.
Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna.
ONE PASS ALGORITHM PRESENTED BY: PRADHYUMAN RAOL ID : 114 Instructor: Dr T.Y. LIN.
15.5 Two-Pass Algorithms Based on Hashing 115 ChenKuang Yang.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 242 Database Systems II Query Execution.
CSCE Database Systems Chapter 15: Query Execution 1.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
Lecture 24 Query Execution Monday, November 28, 2005.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS4432: Database Systems II Query Processing- Part 2.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Chapter 12 Query Processing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
CS 540 Database Management Systems
1 Lecture 23: Query Execution Monday, November 26, 2001.
Two-Pass Algorithms Based on Sorting
CS 540 Database Management Systems
CS 440 Database Management Systems
Query Processing Exercise Session 4.
Database Management System
Chapter 12: Query Processing
Chapter 12: Query Processing
Chapter 15 QUERY EXECUTION.
Database Systems Ch Michael Symonds
Implementation of Relational Operations (Part 2)
Database Applications (15-415) DBMS Internals- Part VI Lecture 15, Oct 23, 2016 Mohammad Hammoud.
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
Query Processing.
Query Execution Two-pass Algorithms based on Hashing
(Two-Pass Algorithms)
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Module 13: Query Processing
Lecture 2- Query Processing (continued)
One-Pass Algorithms for Database Operations (15.2)
CS222: Principles of Data Management Lecture #10 External Sorting
Chapter 12 Query Processing (1)
Implementation of Relational Operations
Lecture 24: Query Execution
Lecture 13: Query Execution
CS505: Intermediate Topics in Database Systems
Lecture 23: Query Execution
Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
Evaluation of Relational Operations: Other Techniques
Lecture 22: Query Execution
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Lecture 22: Query Execution
CPSC-608 Database Systems
CPSC-608 Database Systems
Lecture 11: B+ Trees and Query Execution
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Lecture 22: Friday, November 22, 2002.
Lecture 24: Query Execution
Lecture 20: Query Execution
CS222P: Principles of Data Management UCI, Fall 2018 Notes #11 Join!
Presentation transcript:

15.5 Two-Pass Algorithms Based on Hashing By Derek Lee

General The essential idea behind all the hash-based algorithms is: If the data is too big to store in main-memory buffers, hash all the tuples of the argument or arguments using an appropriate hash key. For the common operations, there is a way to choose the hash key so all tuples that need to be considered together when perform the operation fall into same bucket. Reduces the size of the operand(s) by a factor equal to the number of buckets

15.5.1 Partitioning Relations by Hashing Set partition R into M-1 buckets of roughly equal size. Associate one buffer with each bucket. Each tuple t in the block is hashed to bucket h(t) and copied to the appropriate buffer. If that buffer is full, write it out of disk, initialize another block for same bucket At the end, write out last block of each bucket if not empty

15.5.1 Contd initialize M-1 buckets using M-1 empty buffers; FOR each block b of relation R DO BEGIN read block b into the Mth buffer; FOR each tuple t in b DO BEGIN IF the buffer for bucket h(t) has no room for t THEN BEGIN copy the buffer to disk; initialize a new empty block in that buffer; END; copy t to the buffer for bucket h(t); END ; FOR each bucket DO IF the buffer for this bucket is not empty THEN write the buffer to disk;

15.5.2 A Hash-Based Algorithm for Duplicate Elimination Two copies of the same tuple t will hash to the same bucket. We can examine one bucket at a time, perform δ on that bucket in isolation, and take as the answer the union of δ(Ri), where Ri is the portion of R that hashes to the ith bucket. Use duplicate elimination on each bucket Ri independently, using one-pass algorithm

15.5.2 Contd Number of disk I/O's: 3*B(R) The two-pass, hash-based algorithm work if B(R) <= M(M-1) In order to work: hash function h evenly distributes the tuples among the buckets each bucket Ri fits in main memory (to allow the one-pass algorithm) i.e., B(R) ≤ M2

15.5.3 Hash-Based Grouping and Aggregation In order to make sure that all tuples of the same group wind up in the same bucket, we must choose a hash function that depends only on the grouping attributes of the list L. If groups are large, then we may actually be able to handle much larger relations R than is indicated by the B(R) <= M2 rule.

15.5.4 Hash-Based Union, Intersection, and Difference When the operation is binary, we must make sure that we use the same hash function to hash tuples of both arguments. R U S we hash both R and S to M-1 R ∩or - S we hash both R and S to 2(M-1) One pass algorithm requires 3(B(R)+B(S)) disk I/O’s. Two pass hash based algorithm requires min(B(R)+B(S))≤ M2

15.5.5 The Hash-Join Algorithm The only difference of the join operation from the other operations is that we must use as the hash key just the join attributes, then we can be sure that if tuples of R and S join, they will wind up in corresponding buckets Ri and Si for some i. A one-pass join of all pairs of the corresponding buckets completes this algorithm, we call Hash-Join

15.5.6 Saving Some Disk I/O’s m * B(S) / k + k – m <= M If there is more memory available on the first pass than we need to hold one block per bucket, then we have some opportunities to save disk I/O. Hybrid hash-join: Avoid writing some of buckets to disk and then reading the again. When we hash S, we can choose to keep m of the k buckets entirely in main memory, while keeping only one block for each of the other k-m buckets , that is: m * B(S) / k + k – m <= M

15.5.6 Contd When we read the tuples of the other relation, R, to hash that relation into buckets, we keep in memory: 1. The m buckets of S that were never written to disk, and 2. One block for each of the k-m buckets of R whose corresponding buckets of S were written to disk.

15.5.7 Summary The requirement for sort-based and hah-based algorithm are almost the same. There are differences between them: 1. Hash-based for binary operation have a size requirement that depends only on the smaller of two arguments rather than on the sum of the argument sizes. 2. Sort-based sometimes allow us to produce a result in sorted order and take advantage of that sort later.

15.5.7 Summary 3. Hash-based depend on the buckets being of equal size. 4. In sort-based, sorted sublists may be written to consecutive blocks of the disk if we organize the disk properly. 5. If M is much larger than the number of sorted sublists, then we may read in several consecutive blocks at a time from a sorted sublist, again saving some latency and seek time.

15.5.7 Summary 6. If we can choose the number of buckets to be less than M in a hash-based, we can write out several blocks of a bucket at once.