@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie.

Slides:



Advertisements
Similar presentations
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
The Assembly Language Level
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Chapter 12 Pipelining Strategies Performance Hazards.
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Carnegie Mellon Improving Index Performance through Prefetching Shimin Chen, Phillip B. Gibbons † and Todd C. Mowry School of Computer Science Carnegie.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Inspector Joins IC-65 Advances in Data Management Systems 1 Inspector Joins By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005 Rammohan.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie.
Database Architecture Optimized for the New Bottleneck: Memory Access Peter Boncz Data Distilleries B.V. Amsterdam The Netherlands Stefan.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture 19: Virtual Memory
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Database Architecture Optimized for the new Bottleneck: Memory Access Chau Man Hau Wong Suet Fai.
Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Author: Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, Gary Valentin Members:
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Virtual Memory 1 1.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
Sunpyo Hong, Hyesoon Kim
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Memory COMPUTER ARCHITECTURE
Lecture 16: Data Storage Wednesday, November 6, 2006.
CSC 4250 Computer Architectures
Oracle SQL*Loader
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
ECE Dept., University of Toronto
How to improve (decrease) CPI
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Presentation transcript:

@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie Mellon University Intel Research Pittsburgh ‡

@ Carnegie Mellon Databases Hash Join  Simple hash join:  Build hash table on smaller (build) relation  Probe hash table using larger (probe) relation  Random access patterns inherent in hashing  Excessive random I/Os  If build relation and hash table cannot fit in memory Build Relation Probe Relation Hash Table

@ Carnegie Mellon Databases I/O Partitioning  Avoid excessive random disk accesses  Join pairs of build and probe partitions separately  Sequential I/O patterns for relations and partitions  Hash join is CPU-bound with reasonable I/O bandwidth Build Probe

@ Carnegie Mellon Databases  Partition: divides a 1GB relation into 800 partitions  Join: 50MB build partition 100MB probe partition  Detailed simulations based on Compaq ES40 system  Most of execution time is wasted on data cache misses  82% for partition, 73% for join  Because of random access patterns in memory Hash Join Cache Performance

@ Carnegie Mellon Databases Cache partitioning: generating cache-sized partitions  Effective in main-memory databases [Shatdal et al., 94], [Boncz et al., 99], [Manegold et al.,00]  Two limitations when used in commercial databases 1) Usually need additional in-memory partitioning pass  Cache is much smaller than main memory  50% worse than our techniques 2) Sensitive to cache sharing by multiple activities Employing Partitioning for Cache? Main Memory CPU Cache Build Partition Hash Table

@ Carnegie Mellon Databases Our Approach: Cache Prefetching  Modern processors support:  Multiple cache misses to be serviced simultaneously  Prefetch assembly instructions for exploiting the parallelism  Overlap cache miss latency with computation  Successfully applied to  Array-based programs [Mowry et al., 92]  Pointer-based programs [Luk & Mowry, 96]  Database B + -trees [Chen et al., 01] Main Memory CPU L2/L3 Cache L1 Cache pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9)

@ Carnegie Mellon Databases Challenges for Cache Prefetching  Difficult to obtain memory addresses early  Randomness of hashing prohibits address prediction  Data dependencies within the processing of a tuple  Naïve approach does not work  Complexity of hash join code  Ambiguous pointer references  Multiple code paths  Cannot apply compiler prefetching techniques

@ Carnegie Mellon Databases Our Solution  Dependencies are rare across subsequent tuples  Exploit inter-tuple parallelism  Overlap cache misses of one tuple with computation and cache misses of other tuples  We propose two prefetching techniques  Group prefetching  Software-pipelined prefetching

@ Carnegie Mellon Databases Outline  Overview  Our Proposed Techniques  Simplified Probing Algorithm  Naïve Prefetching  Group Prefetching  Software-Pipelined Prefetching  Dealing with Complexities  Experimental Results  Conclusions

@ Carnegie Mellon Databases Simplified Probing Algorithm foreach probe tuple { (0)compute bucket number; (1)visit header; (2)visit cell array; (3)visit matching build tuple; } Hash Bucket Headers Hash Cell (hash code, build tuple ptr) Build Partition

@ Carnegie Mellon Databases Naïve Prefetching foreach probe tuple { (0)compute bucket number; prefetch header; (1)visit header; prefetch cell array; (2)visit cell array; prefetch matching build tuple; (3)visit matching build tuple; } tim e Cache miss latency Data dependencies make it difficult to obtain addresses early

@ Carnegie Mellon Databases Group Prefetching foreach group of probe tuples { foreach tuple in group { (0)compute bucket number; prefetch header; } foreach tuple in group { (1)visit header; prefetch cell array; } foreach tuple in group { (2)visit cell array; prefetch build tuple; } foreach tuple in group { (3)visit matching build tuple; } } a group

@ Carnegie Mellon Databases Software-Pipelined Prefetching Prologue; for j=0 to N-4 do { tuple j+3: (0)compute bucket number; prefetch header; tuple j+2: (1)visit header; prefetch cell array; tuple j+1: (2)visit cell array; prefetch build tuple; tuple j: (3)visit matching build tuple; } Epilogue; prologue epilogue j j

@ Carnegie Mellon Databases Dealing with Multiple Code Paths  Previous compiler techniques cannot handle this A BCF DG E cond  Multiple code paths:  There could be 0 or many matches  Hash buckets could be empty or full  Keep state information for tuples being processed Record state Test state to decide: Do nothing, if state = B Execute D, if state = C Execute G, if state = F

@ Carnegie Mellon Databases Dealing with Read-write Conflicts  Use busy flag in bucket header to detect conflicts  Postpone hashing 2 nd tuple until finish processing 1 st  Compiler cannot perform this transformation Hash Bucket Headers Build Tuples  In hash table building:

@ Carnegie Mellon Databases More Details In Paper  General group prefetching algorithm  General software-pipelined prefetching algorithm  Analytical models  Discussion of important parameters:  group size, prefetching distance  Implementation details

@ Carnegie Mellon Databases Outline  Overview  Our Proposed Techniques  Experimental Results  Setup  Performance of Our Techniques  Comparison with Cache Partitioning  Conclusions

@ Carnegie Mellon Databases Experimental Setup  Relation schema: 4-byte join attribute + fixed length payload  No selection, no projection  50MB memory available for the join phase  Detailed cycle-by-cycle simulations  1GHz superscalar processor  Memory hierarchy is based on Compaq ES40

@ Carnegie Mellon Databases Joining a Pair of Build and Probe Partitions  Our techniques achieve X speedups over original hash join A 50MB build partition joins a 100MB probe partition 1:2 matching Number of tuples decreases as tuple size increases

@ Carnegie Mellon Databases Varying Memory Latency processor to memory latency execution time (M cycles) A 50MB build partition joins a 100MB probe partition 1:2 matching 100 B tuples  150 cycles: default parameter  1000 cycles: memory latency in future  Our techniques achieve 9X speedups over baseline at 1000 cycles  Absolute performances of our techniques are very close

@ Carnegie Mellon Databases Comparison with Cache Partitioning  Cache partitioning: generating cache sized partitions [Shatdal et al., 94], [Boncz et al., 99], [Manegold et al.,00]  Additional in-memory partition step after I/O partitioning  At least 50% worse than our prefetching schemes A 200MB build relation joins a 400MB probe relation 1:2 matching Partitioning + join

@ Carnegie Mellon Databases Robustness: Impact of Cache Interference  Cache partitioning relies on exclusive use of cache  Periodically flush cache: worst case interference  Self normalized to execution time when no flush  Cache partitioning degrades 8-38%  Our prefetching schemes are very robust

@ Carnegie Mellon Databases Conclusions  Exploited inter-tuple parallelism  Proposed group prefetching and software-pipelined prefetching  Prior prefetching techniques cannot handle code complexity  Our techniques achieve dramatically better performance  X speedups for join phase  X speedups for partition phase  9X speedups at 1000 cycle memory latency in future  Absolute performances are close to that at 150 cycles  Robust against cache interference  Unlike cache partitioning  Our prefetching techniques are effective for hash joins

@ Carnegie Mellon Databases Thank you !

@ Carnegie Mellon Databases Back Up Slides

@ Carnegie Mellon Databases Is Hash Join CPU-bound ?  Quad-processor Pentium III, four disks  A 1.5 GB build relation, a 3GB probe relation  Main thread: GRACE hash join  Background I/O thread per disk: I/O prefetching and writing  Hash join is CPU-bound with reasonable I/O bandwidth  Still large room for CPU performance improvement Partition phaseJoin phase 550MHz CPUs, 512MB RAM, Seagate Cheetah X15 36LP SCSI disks (max transfer rate 68MByte/sec), Linux B tuples, 4B keys, 1:2 matching Striping unit = 256KB 10 measurements, std < 10% mean or std < 1s

@ Carnegie Mellon Databases Hiding Latency within A Group  Hide cache miss latency across multiple tuples within a group  Group size can be increased to hide most cache miss latency for hash joins  Generic algorithm and analytical model (please see paper)  There are gaps between groups Group size = 3 Group size =

@ Carnegie Mellon Databases D=2 Prefetching Distance  Prefetching distance (D) : the number of iterations between two subsequent code stages for a single tuple  Increase prefetching distance to hide all cache miss latency  Generic algorithm and analytical model (please see paper) D=1

@ Carnegie Mellon Databases Group Pref: Multiple Code Paths  We keep state information for tuples in a group  One of the states decides which code path to take A BCF DG E cond A B st=1 cond C st=2 F st=3 DG st 23 EEmpty st 2 or 3other

@ Carnegie Mellon Databases Prefetching Distance = D Prologue; for j=0 to N-3D-1 do { tuple j+3D: compute hash bucket number; prefetch the target bucket header; tuple j+2D: visit the hash bucket header; prefetch the hash cell array; tuple j+D: visit the hash cell array; prefetch the matching build tuple; tuple j: visit the matching build tuple to compare keys and produce output tuple; } Epilogue;

@ Carnegie Mellon Databases Experiment Setup  We have implemented our own hash join engine:  Relations are stored in files with slotted page structure  A simple XOR and shift based hash function is used  GRACE hash join (baseline), Simple prefetching, Group prefetching, Software-pipelined prefetching  Experiment Design:  Same schema for build and probe relations:  4-byte key + fixed length payload  No selection and projection  50MB memory available for the joins

@ Carnegie Mellon Databases Simulation Platform  Detailed cycle-by-cycle simulations  Out-of-order processor pipeline  Integer multiply and divide latency are based on Pentium4  Memory hierarchy is based on Compaq ES40  memory system parameters in near future  better prefetching support  Supports TLB prefetching

@ Carnegie Mellon Databases Simulation Parameters Processor Pipeline Parameters Clock Rate1 GHz Issue Width4 insts/cycle Functional Units2 Int, 2 FP, 1 Int Divide, 2 Mem, 1 Branch Reorder Buffer Size128 insts Integer Multiply/Divide15/56 cycles All Other Integer1 cycle Branch Prediction Scheme gshare Memory Parameters Line Size64 bytes Primary Inst Cache64 KB, 2-way set assoc. Primary Data Cache64 KB, 4-way set-assoc. Miss Handlers32 for data, 2 for inst DTLB64 entries, fully-assoc. DTLB Miss Handlers1 Page Size8 KB Unified Secondary Cache1 MB, 4-way set assoc. Primary-to-Secondary Miss Latency 15 cycles (plus contention) DTLB Miss Latency20 cycles Primary-to-Memory Miss Latency 150 cycles (plus contention) Main Memory Bandwidth1 access per 10 cycles

@ Carnegie Mellon Databases Simulator vs. Real Machine  Better prefetching support:  Never ignore prefetching  Our prefetches are not hints!  TLB prefetching  When a prefetch incurs a TLB miss, perform TLB loading  More miss handlers:  32 for data

@ Carnegie Mellon Databases Varying Group Size and Prefetching Distance  Too small: latencies are not fully hidden  Too large: many prefetched cache lines are replaced by other memory references  Similar performance even when latency increases to 1000 cycles! Varying params for the 20B case in the previous figure Hash table probing

@ Carnegie Mellon Databases Breakdowns of Cache Misses to Understand the Tuning Curves

@ Carnegie Mellon Databases Cache Performance Breakdowns  Our schemes indeed hide most of the data cache miss latencies  Overheads lead to larger portions of busy times Join (100B tuples) Partition (800 partitions)

@ Carnegie Mellon Databases Partition Phase Performance  When the number of partitions is small, use simple prefetching  When the number of partitions is large, use group or software- pipelined prefetching  Combined: X speedups over the baseline

@ Carnegie Mellon Databases Cache Partitioning Schemes  Two ways to employ cache partitioning in GRACE join:  Direct cache: generate the cache partitions in the I/O partition phase  Two-step cache: generate the cache partitions in the join phase as a preprocessing step  Requires to generate larger number of smaller I/O partitions  Bounded by available memory  Bounded by requirements of underlying storage managers  So cache partitioning may not be used when joining relations are very large  Not robust with multiple activities going on  Require exclusive use of (part of) cache  Performance penalty due to cache conflicts

@ Carnegie Mellon Databases Comparison with Cache Partitioning I/O partition phase Join phaseOverall performance  Direct cache suffers from larger number of partitions generated in the I/O partition phase  Two-step cache suffers from the additional partition step  Our schemes are the best (slightly better than direct cache)

@ Carnegie Mellon Databases Robustness: Impact of Cache Interference  Performance degradation when the cache is periodically flushed  The worst cache interference  Direct cache and 2-step cache degrade 15-67% and 8-38%  Our prefetching schemes are very robust ‘100’ corresponds to the join phase execution time when there is no cache flush

@ Carnegie Mellon Databases Group Pref vs. Software-pipelined Pref  Hiding latency:  Software-pipelined pref is always able to hide all latencies (according to our analytical model)  Book-keeping overhead:  Software-pipelined pref has more overhead  Code complexity:  Group prefetching is easier to implement  Natural group boundary provides a place to do necessary processing left (e.g. for read-write conflicts)  A natural place to send outputs to the parent operator if pipelined operator is needed

@ Carnegie Mellon Databases  Try to hide the latency within the processing of a single tuple  Example: hash table probing  Does not work:  Dependencies essentially form a critical path  Randomness makes prediction almost impossible Hash Bucket Headers Hash Cell Array Build Partition Challenges in Applying Prefetching

@ Carnegie Mellon Databases Naïve Prefetching foreach probe tuple { compute bucket number; prefetch header; visit header; prefetch cell array; visit cell array; prefetch matching build tuple; visit matching build tuple; } Cache miss latency time Data dependencies make it difficult to obtain addresses early

@ Carnegie Mellon Databases Group Prefetching foreach group of probe tuples { foreach tuple in group { compute bucket number; prefetch header; } foreach tuple in group { visit header; prefetch cell array; } foreach tuple in group { visit cell array; prefetch matching build tuple; } foreach tuple in group { visit matching build tuple; } } time a group

@ Carnegie Mellon Databases Software-Pipelined Prefetching Prologue; for j=0 to N-4 do { tuple j+3: compute bucket number; prefetch header; tuple j+2: visit header; prefetch cell array; tuple j+1: visit cell array; prefetch matching build tuple; tuple j: visit matching build tuple; } Epilogue; prologue epilogue j j+3