Slides adapted from Donghui Zhang, UC Riverside

Slides adapted from Donghui Zhang, UC Riverside
JOIN PROCESSING E0 261 Prasad Deshpande, Jayant Haritsa Computer Science and Automation Indian Institute of Science Slides adapted from Donghui Zhang, UC Riverside

Today’s Paper Computing joins in systems with “large” amounts of main memory From |S| to |S| ACM Transactions on Database Systems Vol. 11, No. 3, Sep 1986 Basic Issues IO and computational costs How to use available memory to minimize these costs

External Sort Merge Join
Average run length = 2*|S|

External Sort-Merge Join (cont.)
Optimization: omit the final pass of merge sort by pipelining the sort result to join; If buffer size  , can sort by reading R and S twice; E.g. page size=8KB, each relation has 10,000 pages (80MB), buffer size=100 pages (<1MB), two passes are enough.

Cost of Sort-Merge Make use of any extra memory beyond |S| to save IO

Classic Hash Join Works when the smaller relation R fits in memory. Build a in-memory hash table for the smaller relation; For each record in the larger relation, probe the hash table.

Simple Hash Join for each logical bucket j for each record r in R
if r is in bucket j then insert r into the hash table; for each record s in S if s is in bucket j then probe the hash table; Classic hash join is a special case, with one bucket; Optimization: write the tuples not in bucket j to disk; Works good when memory is large (nearly as large as |R|).

Simple Hash Number of passes =

Cost of Simple Hash

GRACE Hash Join partition R into n buckets so that each bucket fits in memory; partition S into n buckets; for each bucket j do for each record r in Rj do insert into a hash table; for each record s in Sj do probe the hash table. Works good when memory is small.

Grace Hash

Cost of Grace Hash

Hybrid Hash Join Hybrid of simple hash join and GRACE;
When partitioning R, keep the records of the first bucket in memory as a hash table; When partitioning S, for records of the first bucket, probe the hash table directly; Saving: no need to write R1 and S1 to disk or read back to memory. Works good for large and small memory.

Hybrid Hash Join

Hybrid Hash Join Cost q=

Comparison Hybrid dominates simple hash Hybrid dominates GRACE hash
Grace dominates Sort-Merge In terms of computation cost

Handle Partition Overflow
Case 1, overflow on disk: an R partition is larger than memory size (note: don’t care about the size of S partitions). Solution (a) small partitions first and combine before join; Solution (b) recursive partition. Case 2, overflow in memory: the in-memory hash table of R becomes too large. Solution: revise the partitioning scheme and keep a smaller partition in memory.

Conclusions Addressed equi-join problem in the external-memory environment; With decreasing cost of memory, hash-based join is better than nested-loop and sort-merge joins; Proposed three hash-based algorithms (simple hash join, GRACE join and hybrid join), out of which the hybrid hash join is the best.

END JOINS E0 261

Slides adapted from Donghui Zhang, UC Riverside

Similar presentations

Presentation on theme: "Slides adapted from Donghui Zhang, UC Riverside"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slides adapted from Donghui Zhang, UC Riverside

Similar presentations

Presentation on theme: "Slides adapted from Donghui Zhang, UC Riverside"— Presentation transcript:

Similar presentations

About project

Feedback