Download presentation
Presentation is loading. Please wait.
Published byKathryn Watts Modified over 9 years ago
1
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing
2
ICOM 6005Dr. Manuel Rodriguez Martinez2 Query Evaluation Techniques Read : –Chapter 12, sec 12.1-12.3 –Chapter 13 Purpose: –Study different algorithms to execute (evaluate) SQL relational operators Selection Projection Joins Aggregates Etc.
3
ICOM 6005Dr. Manuel Rodriguez Martinez3 Join Processing DBMS assume that all projections and selections on single tables are taken first –Project tuples needed for join + tuples to be projected in the actual results Then joins are computed We shall study 5 types of join algorithms –Nested Loops Join –Block Nested Loops Join –Index Nested Loops Join –Sort-merge Join –Hash Join
4
ICOM 6005Dr. Manuel Rodriguez Martinez4 Nested Loops Join Input: –Tables R and S –Equijoin condition: r[i] = s[j] Compares ith attributes of r with jth attributes of s to join tuples Algorithm: for each r R do for each s S do if r[i]=s[j] then add t = result Notation –R is called the outer relation (scanned only once) –S is called the inner relation (scanned multiples times)
5
ICOM 6005Dr. Manuel Rodriguez Martinez5 Nested Loops Join (2) r1 r2 r3 r4 r5 r6 r7 r8 R s1 s2 s3 s4 s5 s6 S...... Need to fully scan S For each tuples in R
6
ICOM 6005Dr. Manuel Rodriguez Martinez6 Nested Loops Joins (3) Cost of the Join R S: –Cost = NPages(R) + |R|*NPages(S) Usually you want the outer table to be the smallest table –But cost difference is marginal Works any type of join (natural, equi-join, theta-join) Example: –Driver (did:char(10), dname: char(20), dage: integer); Cardinality = 100,000 NPages = 2,000 –Car(cid:char(6), owner: char(10), make: char(10), year: integer); Cardinality = 40,000 NPages = 800
7
ICOM 6005Dr. Manuel Rodriguez Martinez7 Nested Loops Join (4) Join: Example: –Option 1: Driver is outer and Car is inner If 1 I/O is 10 ms, cost will be Cost = 2000 + (100,000)*800 = 80,002,000 I/Os (9.3 days!) –Option 2: Car is outer and Driver is inner Cost = 800 + 40,000*2000 = 80,000,800 I/Os (9.3. days!) –Option 2 saves 1200 I/Os
8
ICOM 6005Dr. Manuel Rodriguez Martinez8 Block Nested Loops Join Idea –join a block from outer table with blocks of inner table –Two schemes Page-at-a-Time Block-oriented –Page-at-a-Time Algorithm: for each block C in R do for each block D in S do for each r C do for each s D do if r[i]=s[j] then add t = result
9
ICOM 6005Dr. Manuel Rodriguez Martinez9 Block Nested Loops Join (2) r1 r2 r3 r4 r5 r6 r7 r8 s1 s2 s3 S R s4 s5 s6 Need to fully scan S For each page in R Join 1 page of R With 1 page of S
10
ICOM 6005Dr. Manuel Rodriguez Martinez10 Block Nested Loops Join (3) Cost of the Join R S: –Cost = NPages(R) + NPages(R)*NPages(S) Works any type of join (natural, equi-join, theta-join) Example: Join: Example: –Option 1: Driver is outer and Car is inner Cost = 2000 + (2000)*800 = 1,602,000I/Os (4.45 hours!) –Option 2: Driver is outer and Car is inner Cost = 800 + 800*2000 = 1,600,800 I/Os (4.45 hours!) –Option 2 saves 1200 I/Os
11
ICOM 6005Dr. Manuel Rodriguez Martinez11 Block Nested Loops Join (4) We can do better by leveraging on Buffers Load a bunch of pages from R on memory, call it T –T is a run of pages Join this set of pages T with a page from S Need B buffers for this –B - 2 for the run T, 1 for page of S, 1 for output page Algorithm for Block Oriented NLJ: for each run T of size B - 2 Build in-memory hash table H for T using B – 2 buffers for each block D in S do for each s D do Iterator I = H.get(s[j]) // probe the hash table for each r I // iterate over matching tuples if r[i]=s[j] then add t = result
12
ICOM 6005Dr. Manuel Rodriguez Martinez12 Block Nested Loops Join (6) r1 r2 r3 r4 r5 r6 s1 s2 s3 S T s4 s5 s6 r7 r8 r9 r10 r11 r12 R Pages on disk Buffer Pool Pages on disk Join a run of page of R With 1 page of S Run of pages In hash table
13
ICOM 6005Dr. Manuel Rodriguez Martinez13 Block Nested Loops Join (7) Cost of the Join R S: –Cost = –When B -2 = 1,we get page –at – time join Works on natural and equi-join. Example: Join: Example: B = 22 –Option 1: Driver is outer and Car is inner Cost = 2000 + (2000/20)*800 = 82,000 I/Os (13.6 min) –Option 2: Car is outer and Driver is inner Cost = 800 + (800/20)*2000 = 80,800 I/Os (13.5 min) –Option 2 saves 1200 I/Os
14
ICOM 6005Dr. Manuel Rodriguez Martinez14 Index Nested Loops Join Idea: –If a table has an index, and the search key K matches the join predicate, then index can be used to scan this table –The table that has the index becomes the inner table Algorithm: Index I = S.getIndex() // get handler for index on S for each r R do Iterator T = I.search(r[i]) for each s T do add t = result
15
ICOM 6005Dr. Manuel Rodriguez Martinez15 Index Nested Loops Join (2) r1 r2 r3 r4 r5 r6 r7 r8 s1 s2 s3 S R s4 s5 s6 … …
16
ICOM 6005Dr. Manuel Rodriguez Martinez16 Index Nested Loops Join (3) Cost of the Join R S: –Clustered Hash Index: –Cost = NPages(R) + |R|*2 B+ tree: –Cost = NPages(R) + |R|*4 –Un-clustered Hash Index: –Cost = NPages(R) + |R|*3 B+ tree: –Cost = NPages(R) + |R|*4*NTuplesPerPage(S)
17
ICOM 6005Dr. Manuel Rodriguez Martinez17 Index Nested Loops Join (4) Example: Join: Example: –Scenario 1: Clustered B+ tree on Car Cost = 2000 + (100,000*4) = 402,000 I/Os (1.12 hr) –Scenario 2: Clustered B+tree on Driver Cost = 800 + (40,000 * 4) = 160,800 I/Os (26.8 min) –Consider Scenario 1, Suppose Driver is sorted on join attribute. What happens?
18
ICOM 6005Dr. Manuel Rodriguez Martinez18 Sort-Merge Join Idea: –If tables are sorted on the join attribute, we can traverse them and join the matching tuples –In fact, it might be worth sorting the tables if not already sorted Algorithm has two stages: –Sorting phase Both tables are sorted on join attribute Use external sorting for this –Merging phase Both tables are scanned and matching tuples are joined
19
ICOM 6005Dr. Manuel Rodriguez Martinez19 Sort-Merge Join (2) r1 r2 r3 r4 r5 r6 r7 r8 s1 s2 s3 S R s4 s5 s6 Tuples are sorted by Join column Both tables are scanned concurrently Runs of matches are joined
20
ICOM 6005Dr. Manuel Rodriguez Martinez20 Sort-Merge Join(2) Algorith: Assume R is the smallest relation R2 = Sort table R; S2 = Sort table S; I1 = R2.scanIterator(); r = I1.next(); I2 = S2.scanIterator(); s = I2.next(); while there are tuples in R2 do while (r[i] < s[j]) r = I1.next(); while (s[j] < r[i]) s = I2.next(); while (s[j]==r[i]) sOld = s; while (s[j]==r[i]) add t = result s = I2.next(); s = sOld; r = I1.next();
21
ICOM 6005Dr. Manuel Rodriguez Martinez21 Sort-Merge Join (3) Cost of the Join R S, having B buffers for sorting –Parameters: –Cost:
22
ICOM 6005Dr. Manuel Rodriguez Martinez22 Sort-Merge Join Example: Join: Recall: Example: B = 22 –Cost = 8000+3200+ 2000 + 800 = 14,000 I/Os (2.3 min)
23
ICOM 6005Dr. Manuel Rodriguez Martinez23 Hash Join Idea: –Hash both tables on the join attribute –Matching tuples must hash to the same corresponding buckets –You can simply inspect corresponding buckets on each table to find matching tuples –For this to work you need a lot of memory To fit the partitions into an in-memory hash table for probing
24
ICOM 6005Dr. Manuel Rodriguez Martinez24 Hash Join Partition Phase (Phase I)... H Input Relation Partitions Build B-1 disk-resident partitions of variables size Input Hash function Output 0 1 … B-1
25
ICOM 6005Dr. Manuel Rodriguez Martinez25 Hash Join Probing Phase (Phase II)... H2 Input Partitions Input Hash function Output Resulting tuples Join tuples by probing hash table R partition S
26
ICOM 6005Dr. Manuel Rodriguez Martinez26 Hash Join (3) Algorithm: T1 = Hash(R) T2 = Hash(S) for each l = 0, 1,…, k for each partition B L in T1 do for each tuple r B L do insert r into in-memory hash table T3 for each partition C L in T2 do for each tuple s C L do Iterator I = T3.get(s[j]); for each r I do add t = result
27
ICOM 6005Dr. Manuel Rodriguez Martinez27 Hash Join (4) Cost of the Join R S –Cost = (NPages(R)+NPages(S)) + (NPages(R)+NPages(S)) + (NPages(R)+NPages(S)) –Cost = 3(NPages(R)+NPages(S)) Example: Join: –Option 1: Driver is outer and Car is inner Cost = 3 * (2000 + 800) = 8,400 I/Os (1.4 min) –Option 2: Car is outer and Driver is inner Cost = 3 * (2000 + 800) = 8,400 I/Os (1.4 min) –You need to pick the one that has runs that fit in memory
28
ICOM 6005Dr. Manuel Rodriguez Martinez28 Some Issues This option for join evaluation is very memory consuming –Should only be used if enough buffers are available How many buffer is enough? –Let B be the number of buffers to use –At partitioning phase, we need 1 buffer for pages from S We are left with B-1 buffers for partitions of R Hence we will have B -1 partitions –Let M be the number of pages with tuples in table R –We have M/B-1 pages in each partition of R –Hash table size will be (M/B-1) * f, f is fudge factor to compensate for extra space need –We must have that B > (f*M/B-1) + 2 –Thus,
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.