1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie.
Computer Organization and Architecture The CPU Structure.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 12 Pipelining Strategies Performance Hazards.
Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Inspector Joins IC-65 Advances in Data Management Systems 1 Inspector Joins By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005 Rammohan.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie.
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Continuous resource monitoring for self-predicting DBMS Dushyanth Narayanan 1 Eno Thereska 2 Anastassia Ailamaki 2 1 Microsoft Research-Cambridge, 2 Carnegie.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Database Architecture Optimized for the New Bottleneck: Memory Access Peter Boncz Data Distilleries B.V. Amsterdam The Netherlands Stefan.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
Database Architecture Optimized for the new Bottleneck: Memory Access Chau Man Hau Wong Suet Fai.
Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Author: Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, Gary Valentin Members:
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
Computer Architecture Chapter (14): Processor Structure and Function
William Stallings Computer Organization and Architecture 8th Edition
The University of Adelaide, School of Computer Science
Chapter 9 – Real Memory Organization and Management
Parallel Data Laboratory, Carnegie Mellon University
5.2 Eleven Advanced Optimizations of Cache Performance
Evaluation of Relational Operations
Address-Value Delta (AVD) Prediction
15-740/ Computer Architecture Lecture 14: Prefetching
CSC D70: Compiler Optimization Prefetching
Presentation transcript:

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA AILAMAKI Carnegie Mellon University PHILLIP B. GIBBONS Intel Research Pittsburgh and TODD C. MOWRY Carnegie Mellon University and Intel Research Pittsburgh -Manisha Singh

2 Outline ____________________________ -Overview -Proposed Techniques -Experimental setup -Performance evaluation -Conclusion

3 Hash Joins ________________________________ - used in the implementation of a relational database management system - Two relation – build (small) and probe (large). - Excessive random I/Os -If the build relation and hash table can not fit in memory Build Relation Probe Relation Hash Table

4 Hash Join Performance ________________________________ - Suffer from CPU Cache Stalls Most of execution time is wasted on data cache misses -Most of execution time is wasted on data cache misses – 82% for partition, 73% for join – Because of random access patterns in memory

5 Solution: Cache Prefetching ___________________________ Cache prefetching has been successfully applied to several types of applications. Cache prefetching has been successfully applied to several types of applications. exploit cache prefetching to improve hash join performance. exploit cache prefetching to improve hash join performance.

6 Challenges to Cache Prefetching __________________________ Difficult to obtain memory addresses early –Randomness of hashing prohibits address prediction –Data dependencies within the processing of a tuple Complexity of hash join code –Ambiguous pointer references –Multiple code paths –Cannot apply compiler prefetching techniques

7 Overcoming These Challenges ____________________________ -Evaluate two new prefetching techniques: Group prefetching - try to hide cache miss latency across a group tuples Software-pipelined prefetching - avoid these intermittent stalls

8 Group Prefetching –Hide cache miss latency across a group tuples. –Then combine the processing of a group of tuples into a single loop body and rearrange the probe operations into stages –Process the tuples for a stage and then move to the next stage –Add prefetch instructions to the algorithm. –issue prefetch instructions in one code stage for the memory references in the next code stage.

9 Group Prefetching

10 Software-Pipelined Prefetching –Overlaps cache misses across different code stages of different tuples –The code stages of the same tuple are processed in subsequent iterations –Can overlap the cache miss latency of a tuple across all processing in an iteration

11 Software-Pipelined Prefetching

12 Group vs. Software-Pipelined Prefetching Hiding latency: –Software-pipelined pref is always able to hide all latencies Book-keeping overhead: –Software-pipelined pref has more overhead Code complexity: –Group prefetching is easier to implement –Natural group boundary provides a place to do necessary processing left (e.g. for read-write conflicts) –A natural place to send outputs to the parent operator if pipelined operator is needed

13 Experimental Setup - Use a simple schema for both the build and probe relations - Every tuple contains a 4 byte join attribute and a fixed length payload - Perform join without selections and projections. - Assume the join phase uses 50MB memory to join a pair of build and probe partition

14 Performance Evaluation Hash Join is CPU-bound with reasonable I/O bandwidth -The main total time is the elapsed real time of an algorithm phase. -The worker io stall time is the largest I/O stall time of individual worker threads

15 Performance Evaluation cont.. User-Mode CPU Cache Performance - Join Phase Performance This technique achieved X speedups over original hash join

16 Performance Evaluation cont.. Join Performance varying Memory Latency -prefecthing techniques are effective even when the processor/memory speed gap increases dramatically

17 Performance Evaluation cont..

18 Some Practical Issues Some issues may arise when implementing these prefetching techniques in a production DBMS that targets multiple architectures and is distributed as binaries. 1. The syntax of prefetch instructions is often different across architectures and compilers. 2. Some architecture do not support faulting prefetches 3. Several architectures require software to explicitly manage the caches and network processors 4. Pre-set parameters for the group size and the prefetch distance may be suboptimal on machines with very different configurations

19Conclusion -Even though prefetching is a promising technique for improving CPU cache performance, applying it to the hash join algorithm is not straightforward (due to the dependencies within the processing of a single tuple and the randomness of Hashing) (due to the dependencies within the processing of a single tuple and the randomness of Hashing) - Experimental results demonstrated that hash join performance can be improved by using group prefetching and software-pipelined prefetching techniques. -Several practicle issues when used on DBMS that targets multiple architectures

20 Thank you Questions?