Download presentation
Presentation is loading. Please wait.
Published byReuben Wherley Modified over 9 years ago
1
Join Processing in Database Systems with Large Main Memories ACM Transactions on Database Systems Vol. 11, No. 3, Sep 1986 Leonard D. Shapiro Donghui Zhang, UC Riverside, May 17, 2002
2
Content Join Definition Internal-Memory Joins External-Memory Model External-Memory Joins Performance Comparisons Conclusions
3
Problem Definition Join operator: given relation R and S, report all pairs (r, s) s.t. r R, s S and the two records satisfy some given condition. Equi-Join: join if a certain attribute of r is equal to an attribute of s.
4
FacNameDeptID Gunopulos1 Kumar2 Li2 Tsotras1 Stephen3 Zaniolo3 DeptIDDeptName 1Computer Science 2Electrical Engineering 3Mathematics FacultyDepartment SELECT FacName, DeptName FROM Faculty, Department WHERE Faculty.DeptID =Department.DeptID
5
Internal-Memory Solutions Nested-loop join: check for every record in R and every record in S; cost={R}{S} Sort-merge join: sort R, S followed by merging; cost=O({S}*log{S}) (if {R}<{S}) Hash join: build a hash table for R; for every record in S, probe the hash table; cost =O({S})
6
External-Memory Model Both relations R, S reside on disk; Each disk page holds up to B records; A disk page (block) has to be read in memory before records in it can be processed; Need to extend the internal-memory join algorithms to the external-memory model; Important for join: I/O, CPU.
7
Block-Nested Loop Join 1.for every block in R 2. scan through S; 3. join records in the R block with the S records. I/O: |R|*|S|, where |R| is the number of blocks R occupies; Works good for small buffer (e.g. two blocks).
8
External Sort-Merge Join Extend the internal-memory sort-merge join by changing the sorting algorithm to external-memory merge sort. Merge sort:
9
External Sort-Merge Join (cont.) Optimization: omit the final pass of merge sort by pipelining the sort result to join; If buffer size , can sort by reading R and S twice; E.g. page size=8KB, each relation has 10,000 pages (80MB), buffer size=100 pages (<1MB), two passes are enough.
10
Classic Hash Join 1.Build a in-memory hash table for the smaller relation; 2.For each record in the larger relation, probe the hash table. Works when the smaller relation R fits in memory. If the smaller relation does not fit in memory, partition into smaller buckets!
11
Simple Hash Join 1.for each logical bucket j 2. for each record r in R 3. if r is in bucket j then 4. insert r into the hash table; 5. for each record s in S 6. if s is in bucket j then 7. probe the hash table; Classic hash join is a special case, with one bucket; Optimization: write the tuples not in bucket j to disk; Works good when memory is large (nearly as large as |R|).
12
GRACE Hash Join 1.partition R into n buckets so that each bucket fits in memory; 2.partition S into n buckets; 3.for each bucket j do 4. for each record r in Rj do 5. insert into a hash table; 6. for each record s in Sj do 7. probe the hash table. Works good when memory is small.
13
Hybrid Hash Join Hybrid of simple hash join and GRACE; When partitioning R, keep the records of the first bucket in memory as a hash table; –Typically this means that the first bucket uses more pages in memory (all other partitions are 1 page each) When partitioning S, for records of the first bucket, probe the hash table directly; Saving: no need to write R 1 and S 1 to disk or read back to memory. Works good for large and small memory.
14
Handle Partition Overflow Case 1, overflow on disk: an R partition is larger than memory size (note: don’t care about the size of S partitions). –Solution (a) small partitions first and combine before join; –Solution (b) recursive partition. Case 2, overflow in memory: the in- memory hash table of R becomes too large. –Solution: revise the partitioning scheme and keep a smaller partition in memory.
15
Conclusions Addressed equi-join problem in the external-memory environment; With decreasing cost of memory, hash- based join is better than nested-loop and sort-merge joins; Proposed three hash-based algorithms (simple hash join, GRACE join and hybrid join), out of which the hybrid hash join is the best.
16
Hash-based Nested Loop Join This is a hybrid of hash-based and nested- loop join; In pure hash-based joins, we have two steps: first, partition the source relations; next, join each partition separately; Hash-based nested loops join: no need to partition; Read some pages of R to fill memory and build a hash table for it, then scan through S.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.