Lecture 8 Join Algorithms. Intro Until now, we have used nested loops for joining data – This is slow, n^2 comparisons How can we do better? – Sorting.

Slides:



Advertisements
Similar presentations
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
1 40T1 60T2 30T3 10T4 20T5 10T6 60T7 40T8 20T9 R S C C R JOIN S?
6.830 Lecture 9 10/1/2014 Join Algorithms. Database Internals Outline Front End Admission Control Connection Management (sql) Parser (parse tree) Rewriter.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Join Processing in Databases Systems with Large Main Memories
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
1 Implementation of Relational Operations Module 5, Lecture 1.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:
1 Optimization Recap and examples. 2 Optimization introduction For every SQL expression, there are many possible ways of implementation. The different.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.
1 Implementation of Relational Operations: Joins.
Lecture 11 Main Memory Databases Midterm Review. Time breakdown for Shore DBMS Source: “OLTP Under the Looking Glass”, SIGMOD 2008 Systematically removed.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing.
Status “Lifetime of a Query” –Query Rewrite –Query Optimization –Query Execution Optimization –Use cost-estimation to iterate over all possible plans,
RELATIONAL JOIN Advanced Data Structures. Equality Joins With One Join Column External Sorting 2 SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid=S1.sid.
Lecture 9 Query Optimization.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
1 Database Systems ( 資料庫系統 ) December 7, 2011 Lecture #11.
Lecture 24 Query Execution Monday, November 28, 2005.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Processing CS 405G Introduction to Database Systems.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
More Optimization Exercises. Block Nested Loops Join Suppose there are B buffer pages Cost: M + ceil (M/(B-2))*N where –M is the number of pages of R.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
1 Choosing an Order for Joins. 2 What is the best way to join n relations? SELECT … FROM A, B, C, D WHERE A.x = B.y AND C.z = D.z Hash-Join Sort-JoinIndex-Join.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Fan Qi Database Lab 1, com1 #01-08 CS3223 Tutorial 6.
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Systems (資料庫系統)
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Evaluation of Relational Operations
Column Stores Join Algorithms 10/2/2017
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Relational Operations
09_Queries_LECTURE2 CODE GENERATION implements the operator, JOIN (equi-join, we will use  for it.) J1. NESTED LOOP JOIN S on S.S#=E.S# For each record.
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
CS222P: Principles of Data Management Notes #12 Joins!
Yan Huang - CSCI5330 Database Implementation – Access Methods
External Joins Query Optimization 10/4/2017
Example R1 R2 over common attribute C
CS222: Principles of Data Management Notes #12 Joins!
Selected Topics: External Sorting, Join Algorithms, …
Lecture 2- Query Processing (continued)
CS222: Principles of Data Management Lecture #10 External Sorting
Implementation of Relational Operations
Slides adapted from Donghui Zhang, UC Riverside
CS505: Intermediate Topics in Database Systems
Evaluation of Relational Operations: Other Techniques
Overview of Query Evaluation: JOINS
External Sorting.
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
CSE 444: Lecture 25 Query Execution
Database Systems (資料庫系統)
Evaluation of Relational Operations: Other Techniques
CS222/CS122C: Principles of Data Management UCI, Fall Notes #11 Join!
CS222P: Principles of Data Management UCI, Fall 2018 Notes #11 Join!
Presentation transcript:

Lecture 8 Join Algorithms

Intro Until now, we have used nested loops for joining data – This is slow, n^2 comparisons How can we do better? – Sorting – Divide & conquer Trade-off in I/O and CPU time for each algo

Sort Merge Join Equi-join of two tables S & R |S| = Pages in S; {S} = Tuples in S |S| ≥ |R| M pages of memory; M > sqrt(|S|) Algorithm: – Partition S and R into memory sized sorted runs, write out to disk – Merge all runs simultaneously Total I/O cost: Read |R| and |S| twice, write once 3(|R| + |S|) I/Os

Example R=1,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R1 = 1,3,4 S1 = 2,3,7S2 = 8,9,12S3 = 4,6,15 Need enough memory to keep 1 page of each run in memory at a time R2 = 6,9,14R3 = 1,7,11 If each run is M pages and M > sqrt(|S|), then there are at most |S|/sqrt(|S|) = sqrt(|S|) runs of S So if |R| = |S|, we actually need M to be 2 x sqrt(|S|) [handwavy argument in paper for why it’s only sqrt(|S|)]

Example R=1,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R1 = 1,3,4R2 = 6,9,14R3 = 1,7,11 S1 = 2,3,7S2 = 8,9,12S3 = 4,6,15

Example R=1,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R1 = 1,3,4R2 = 6,9,14R3 = 1,7,11 S1 = 2,3,7S2 = 8,9,12S3 = 4,6,15

Example R=1,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R1 = 1,3,4R2 = 6,9,14R3 = 1,7,11 S1 = 2,3,7S2 = 8,9,12S3 = 4,6,15

Example R=1,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R1 = 1,3,4R2 = 6,9,14R3 = 1,7,11 S1 = 2,3,7S2 = 8,9,12S3 = 4,6,15

Example R=1,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R1 = 1,3,4R2 = 6,9,14R3 = 1,7,11 S1 = 2,3,7S2 = 8,9,12S3 = 4,6,15

Example R=1,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R1 = 1,3,4R2 = 6,9,14R3 = 1,7,11 S1 = 2,3,7S2 = 8,9,12S3 = 4,6,15

Example R=1,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R1 = 1,3,4R2 = 6,9,14R3 = 1,7,11 S1 = 2,3,7S2 = 8,9,12S3 = 4,6,15 … Output in sorted order!

Study Break: Sort-Merge Join Say we are joining tables: A=8,20,19,20,3,13,20,18,6,5,4,5 B=15,1,3,13,13,10,19,6,8,15,16,2 If our in-memory runs are of length 4, what will the sorted runs be for A and B? What is the join output? Walk through the steps of the merge.

Simple Hash Algorithm: Given hash function H(x)  [0,…,P-1] where P is number of partitions for i in [0,…,P-1]: for each r in R: if H(r)=i, add r to in-memory hash otherwise, write r back to disk in R’ for each s in S: if H(s)=i, lookup s in hash, output matches otherwise, write s back to disk in S’ replace R with R’, S with S’

Simple Hash I/O Analysis Suppose P=2, and hash uniformly maps tuples to partitions Read |R| + |S| Write 1/2 (|R| + |S|) Read 1/2 (|R| + |S|) = 2 (|R| + |S|) P=3 Read |R| + |S| Write 2/3 (|R| + |S|) Read 2/3 (|R| + |S|) Write 1/3 (|R| + |S|) Read 1/3 (|R| + |S|) = 3 (|R| + |S|) P=4 |R| + |S| + 2 * (3/4 (|R| + |S|)) + 2 * (2/4 (|R| + |S|)) + 2 * (1/4 (|R| + |S|)) = 4 (|R| + |S|)  P = n ; n * (|R| + |S|) I/Os

Grace Hash Algorithm: Partition: Suppose we have P partitions, and H(x)  [0…P-1] Choose P = |S| / M  P ≤ sqrt(|S|) //may need to leave a little slop for imperfect hashing Allocate P 1-page output buffers, and P output files for R For each r in R: Write r into buffer H(r) If buffer full, append to file H(r) Allocate P output files for S For each s in S: Write s into buffer H(s) if buffer full, append to file H(s) Join: For i in [0,…,P-1] Read file i of R, build hash table Scan file i of S, probing into hash table and outputting matches Need one page of RAM for each of P partitions Since M > sqrt(|S|) and P ≤ sqrt(|S|), all is well Total I/O cost: Read |R| and |S| twice, write once 3(|R| + |S|) I/Os

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R2F0F1F2 P output buffers P output files

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R2 5 F0F1F2

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R2 45 F0F1F2

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R2 345 F0F1F2

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R F0F1F2

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R F0F1F2 Need to flush R0 to F0!

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R2 45 F0F1F2 3 6

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R2 945 F0F1F2 3 6

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R F0F1F2 3 6

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R F0F1F2 3 6

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R F0F1F2 3 6

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R F0F1F

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R F0F1F

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R F0F1F

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R2 97 F0F1F

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R F0F1F

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 R0R1R2F0F1F

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 F0F1F F0F1F R Files S Files

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 F0F1F F0F1F R Files S Files Matches: Load F0 from R into memory

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 F0F1F F0F1F R Files S Files Matches: Load F0 from R into memory Scan F0 from S

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 F0F1F F0F1F R Files S Files Matches: 3,3 Load F0 from R into memory Scan F0 from S

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 F0F1F F0F1F R Files S Files Matches: 3,3 Load F0 from R into memory Scan F0 from S

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 F0F1F F0F1F R Files S Files Matches: 3,3 9,9 Load F0 from R into memory Scan F0 from S

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 F0F1F F0F1F R Files S Files Matches: 3,3 9,9 Load F0 from R into memory Scan F0 from S

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 F0F1F F0F1F R Files S Files Matches: 3,3 9,9 6,6 Load F0 from R into memory Scan F0 from S

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 F0F1F F0F1F R Files S Files Matches: 3,3 9,9 6,6

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 F0F1F F0F1F R Files S Files Matches: 3,3 9,9 6,6 7,7 4,4

Example P = 3; H(x) = x mod P R=5,4,3,6,9,14,1,7,11 S=2,3,7,12,9,8,4,15,6 F0F1F F0F1F R Files S Files Matches: 3,3 9,9 6,6 7,7 4,4

Summary Sort-MergeSimple HashGrace Hash I/O: 3 (|R| + |S|) CPU: O(P x {S}/P log {S}/P) I/O: P (|R| + |S|) CPU: O({R} + {S}) I/O: 3 (|R| + |S|) CPU: O({R} + {S}) Notation: P partitions / passes over data; assuming hash is O(1) Grace hash is generally a safe bet, unless memory is close to size of tables, in which case simple can be preferable Extra cost of sorting makes sort merge unattractive unless there is a way to access tables in sorted order (e.g., a clustered index), or a need to output data in sorted order (e.g., for a subsequent ORDER BY)

Study Break: Grace Hash Join Say we are joining tables: A=8,20,19,20,3,13,20,18,6,5,4,5 B=15,1,3,13,13,10,19,6,8,15,16,2 If we have four partitions, what will the hash partitions be for A and B? Walk through the steps for producing this join’s output.