Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 Image source: www.forces.gc.ca.

Slides:



Advertisements
Similar presentations
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Advertisements

To Include or Not to Include? Natalie Enright Dana Vantrease.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.
Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.
OpenFOAM on a GPU-based Heterogeneous Cluster
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
How should a highly multithreaded architecture, like a GPU, pick which threads to issue? Cache-Conscious Wavefront Scheduling Use feedback from the memory.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
QCAdesigner – CUDA HPPS project
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
My Coordinates Office EM G.27 contact time:
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013.
GPU Architecture and Its Application
Cache Coherence: Directory Protocol
Cache Coherence: Directory Protocol
Pangaea: A Tightly-Coupled Heterogeneous IA32 Chip Multiprocessor
Analytic Evaluation of Shared-Memory Systems with ILP Processors
תרגול מס' 5: MESI Protocol
ISPASS th April Santa Rosa, California
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
12.4 Memory Organization in Multiprocessor Systems
Graphics Processing Unit
Accelerating MapReduce on a Coupled CPU-GPU Architecture
CSC 2231: Parallel Computer Architecture and Programming GPUs
The University of Adelaide, School of Computer Science
Lecture 2: Snooping-Based Coherence
Improving Multiple-CMP Systems with Token Coherence
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 25: Multiprocessors
High Performance Computing
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture 3: Coherence Protocols
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Graphics Processing Unit
The University of Adelaide, School of Computer Science
6- General Purpose GPU Programming
Presentation transcript:

Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 Image source: 1 University of British Columbia 2 Simon Fraser University 3 AMD Research

Inderpreet SinghCache Coherence for GPU Architectures2 What is a GPU? GPU CPU spawn done CPU GPU spawn time GPU Core L1D ▪▪▪▪▪▪ Interconnect ▪▪▪▪▪▪ L2 Bank GPU Core L1D Workgroups Wavefronts

Inderpreet SinghCache Coherence for GPU Architectures3 Evolution of GPUs Graphics pipeline Compute (OpenCL, CUDA) e.g. Matrix Multiplication Vertex Shader Pixel Shader OpenGL/ DirectX

Inderpreet SinghCache Coherence for GPU Architectures4 Evolution of GPUs Future: coherent memory space Efficient critical sections Load balancing Stencil computation Workgroups lock shared structure … computation … unlock

Inderpreet SinghCache Coherence for GPU Architectures5 C4 L1D A A B B C3 L1D A A B B C2 L1D A A B B GPU Coherence Challenges Challenge 1: Coherence traffic Do not require coherence No coherence MESI GPU-VI Interconnect traffic 1.3 Recalls C1 L1D A A B B Load C gets C rcl A ack Load C Load D Load E Load F … Load G Load H Load I Load J … Load K Load L Load M Load N … Load O Load P Load Q Load R … A A B B L2/Directory

Inderpreet SinghCache Coherence for GPU Architectures6 L2 / Directory MSHR GPU Coherence Challenges Challenge 2: Tracking in-flight requests Significant % of L2 S Shared M Modified S_M

Inderpreet SinghCache Coherence for GPU Architectures7 GPU Coherence Challenges Challenge 3: Complexity Non-coherent L1 Non-coherent L2 MESI L1 States MESI L2 States States Events

Inderpreet SinghCache Coherence for GPU Architectures8 GPU Coherence Challenges All three challenges result from introducing coherence messages on a GPU 1.Traffic: transferring 2.Storage: tracking 3.Complexity: managing GPU cache coherence without coherence messages? YES – using global time

Inderpreet SinghCache Coherence for GPU Architectures9 Core 1 L1D ▪▪▪▪▪▪ Temporal Coherence (TC) Global time Interconnect ▪▪▪▪▪▪ L2 Bank A= Global Timestamp < Global Time  NO L1 COPIES Global Timestamp < Global Time  NO L1 COPIES Core 2 L1D Local Timestamp > Global Time  VALID Local Timestamp > Global Time  VALID

Inderpreet SinghCache Coherence for GPU Architectures10 T=0 T=11 T=15 Core 1 L1D Interconnect L2 Bank Core 2 L1D Temporal Coherence (TC) ▪▪▪▪▪▪ A=0 0 0 Load A T=10 A=0 10 A=0 10 A=0 10 Store A=1 A=1 A=0 10 No coherence messages

Inderpreet SinghCache Coherence for GPU Architectures11 Temporal Coherence (TC) What lifetime values should be requested on loads? Use a predictor to predict lifetime values What about stores to unexpired blocks? Stall them at the L2?

Inderpreet SinghCache Coherence for GPU Architectures12 TC Stalling Issues Stall? Problem #1: Sensitive to mispredictions Problem #2: Impedes other accesses Problem #3: Hurts existing GPU applications Solution: TC-Weak

Inderpreet SinghCache Coherence for GPU Architectures13 L2 Bank 47 T=1T=31 TC-Weak Stores return Global Write Completion Time (GWCT) GPU Core 2 L1D Interconnect GWCT Table W0: W1: GWCT Table W0: W1: data=OLD 30 data=OLD flag=NULL GPU Core 1 L1D GWCT Table W0: W1: GWCT Table W0: W1: 1 data=NEW 2 FENCE 3 flag=SET Store data=NEW Store flag=SET 1 data=NEW 2 FENCE 3 flag=SET 30 1 data=NEW 2 FENCE 3 flag=SET 1 data=NEW 2 FENCE 3 flag=SET data=NEW flag=SET data=OLD 30 T=0 47 No stalling at L2

Inderpreet SinghCache Coherence for GPU Architectures14 TC-Weak StallingTC-Weak Misprediction sensitivity Doesn’t impedes other accesses Good for existing GPU applications

Inderpreet SinghCache Coherence for GPU Architectures15 Methodology GPGPU-Sim v3.1.2 for GPU core model GEMS Ruby v2.1.1 for memory system All protocols written in SLICC Model a generic NVIDIA Fermi-based GPU (see paper for details) Applications: 6 do not require coherence 6 require coherence Barnes Hut Cloth Physics Versatile Place and Route Max-Flow Min-Cut 3D Wave Equation Solver Octree Partitioning Locks Stencil communication Load balancing

Inderpreet SinghCache Coherence for GPU Architectures Interconnect Traffic Reduces traffic by 53% over MESI and 23% over GPU-VI for intra-workgroup applications Lower traffic than 16x-sized 32-way directory Interconnect Traffic NO-COH MESI GPU-VI TC-Weak Do not require coherence

Inderpreet SinghCache Coherence for GPU Architectures17 Performance TC-Weak with simple predictor performs 85% better than disabling L1 caches Performs 28% better than TC with stalling Larger directory sizes do not improve performance MESI GPU-VI TC-Weak Require coherence NO-L1 Speedup

Inderpreet SinghCache Coherence for GPU Architectures18 Complexity Non-Coherent L1 Non-Coherent L2 MESI L1 States MESI L2 States TC-Weak L1 TC-Weak L2

Inderpreet SinghCache Coherence for GPU Architectures19 Summary First work to characterize GPU coherence challenges Save traffic and energy by using global time Reduce protocol complexity 85% performance improvement over no coherence Questions?

Inderpreet SinghCache Coherence for GPU Architectures20 Backup Slides

Inderpreet SinghCache Coherence for GPU Architectures21 Lifetime Predictor One prediction value per L2 bank Events local to L2 bank update prediction value L2 Bank T = 0 Prediction Value Load A A A 10 EventsPrediction 1.Expired load: ↑ 2.Unexpired store: ↓ 3.Unexpired eviction: ↓ prediction++ T = 20 Store A A A 30 prediction--

Inderpreet SinghCache Coherence for GPU Architectures22 TC-Strong vs TC-Weak Fixed lifetime for all applications All applications Speedup All applications Speedup TCSUO TCS TCSOO TCW TCW w/ predictor Best lifetime for each application

Inderpreet SinghCache Coherence for GPU Architectures23 Interconnect Power and Energy