Download presentation
Presentation is loading. Please wait.
Published bySkyler Bristol Modified over 9 years ago
1
Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 Image source: www.forces.gc.ca 1 University of British Columbia 2 Simon Fraser University 3 AMD Research
2
Inderpreet SinghCache Coherence for GPU Architectures2 What is a GPU? GPU CPU spawn done CPU GPU spawn time GPU Core L1D ▪▪▪▪▪▪ Interconnect ▪▪▪▪▪▪ L2 Bank GPU Core L1D Workgroups Wavefronts
3
Inderpreet SinghCache Coherence for GPU Architectures3 Evolution of GPUs Graphics pipeline Compute (OpenCL, CUDA) e.g. Matrix Multiplication Vertex Shader Pixel Shader OpenGL/ DirectX
4
Inderpreet SinghCache Coherence for GPU Architectures4 Evolution of GPUs Future: coherent memory space Efficient critical sections Load balancing Stencil computation Workgroups lock shared structure … computation … unlock
5
Inderpreet SinghCache Coherence for GPU Architectures5 C4 L1D A A B B C3 L1D A A B B C2 L1D A A B B GPU Coherence Challenges Challenge 1: Coherence traffic Do not require coherence No coherence MESI GPU-VI 0.5 1.0 1.5 2.2 Interconnect traffic 1.3 Recalls C1 L1D A A B B Load C gets C rcl A ack Load C Load D Load E Load F … Load G Load H Load I Load J … Load K Load L Load M Load N … Load O Load P Load Q Load R … A A B B L2/Directory
6
Inderpreet SinghCache Coherence for GPU Architectures6 L2 / Directory MSHR GPU Coherence Challenges Challenge 2: Tracking in-flight requests Significant % of L2 S Shared M Modified S_M
7
Inderpreet SinghCache Coherence for GPU Architectures7 GPU Coherence Challenges Challenge 3: Complexity Non-coherent L1 Non-coherent L2 MESI L1 States MESI L2 States States Events
8
Inderpreet SinghCache Coherence for GPU Architectures8 GPU Coherence Challenges All three challenges result from introducing coherence messages on a GPU 1.Traffic: transferring 2.Storage: tracking 3.Complexity: managing GPU cache coherence without coherence messages? YES – using global time
9
Inderpreet SinghCache Coherence for GPU Architectures9 Core 1 L1D ▪▪▪▪▪▪ Temporal Coherence (TC) Global time Interconnect ▪▪▪▪▪▪ L2 Bank A=0 0 0 0 0 Global Timestamp < Global Time NO L1 COPIES Global Timestamp < Global Time NO L1 COPIES Core 2 L1D Local Timestamp > Global Time VALID Local Timestamp > Global Time VALID
10
Inderpreet SinghCache Coherence for GPU Architectures10 T=0 T=11 T=15 Core 1 L1D Interconnect L2 Bank Core 2 L1D Temporal Coherence (TC) ▪▪▪▪▪▪ A=0 0 0 Load A T=10 A=0 10 A=0 10 A=0 10 Store A=1 A=1 A=0 10 No coherence messages
11
Inderpreet SinghCache Coherence for GPU Architectures11 Temporal Coherence (TC) What lifetime values should be requested on loads? Use a predictor to predict lifetime values What about stores to unexpired blocks? Stall them at the L2?
12
Inderpreet SinghCache Coherence for GPU Architectures12 TC Stalling Issues Stall? Problem #1: Sensitive to mispredictions Problem #2: Impedes other accesses Problem #3: Hurts existing GPU applications Solution: TC-Weak
13
Inderpreet SinghCache Coherence for GPU Architectures13 L2 Bank 47 T=1T=31 TC-Weak Stores return Global Write Completion Time (GWCT) GPU Core 2 L1D Interconnect GWCT Table W0: W1: GWCT Table W0: W1: data=OLD 30 data=OLD flag=NULL GPU Core 1 L1D GWCT Table W0: W1: GWCT Table W0: W1: 1 data=NEW 2 FENCE 3 flag=SET Store data=NEW Store flag=SET 1 data=NEW 2 FENCE 3 flag=SET 30 1 data=NEW 2 FENCE 3 flag=SET 1 data=NEW 2 FENCE 3 flag=SET data=NEW flag=SET data=OLD 30 T=0 47 No stalling at L2
14
Inderpreet SinghCache Coherence for GPU Architectures14 TC-Weak StallingTC-Weak Misprediction sensitivity Doesn’t impedes other accesses Good for existing GPU applications
15
Inderpreet SinghCache Coherence for GPU Architectures15 Methodology GPGPU-Sim v3.1.2 for GPU core model GEMS Ruby v2.1.1 for memory system All protocols written in SLICC Model a generic NVIDIA Fermi-based GPU (see paper for details) Applications: 6 do not require coherence 6 require coherence Barnes Hut Cloth Physics Versatile Place and Route Max-Flow Min-Cut 3D Wave Equation Solver Octree Partitioning Locks Stencil communication Load balancing
16
Inderpreet SinghCache Coherence for GPU Architectures16 0.00 0.25 0.50 0.75 1.00 1.25 1.50 2.3 Interconnect Traffic Reduces traffic by 53% over MESI and 23% over GPU-VI for intra-workgroup applications Lower traffic than 16x-sized 32-way directory Interconnect Traffic NO-COH MESI GPU-VI TC-Weak Do not require coherence
17
Inderpreet SinghCache Coherence for GPU Architectures17 Performance TC-Weak with simple predictor performs 85% better than disabling L1 caches Performs 28% better than TC with stalling Larger directory sizes do not improve performance MESI GPU-VI TC-Weak 0.0 0.5 1.0 1.5 2.0 Require coherence NO-L1 Speedup
18
Inderpreet SinghCache Coherence for GPU Architectures18 Complexity Non-Coherent L1 Non-Coherent L2 MESI L1 States MESI L2 States TC-Weak L1 TC-Weak L2
19
Inderpreet SinghCache Coherence for GPU Architectures19 Summary First work to characterize GPU coherence challenges Save traffic and energy by using global time Reduce protocol complexity 85% performance improvement over no coherence Questions?
20
Inderpreet SinghCache Coherence for GPU Architectures20 Backup Slides
21
Inderpreet SinghCache Coherence for GPU Architectures21 Lifetime Predictor One prediction value per L2 bank Events local to L2 bank update prediction value L2 Bank T = 0 Prediction Value Load A A A 10 EventsPrediction 1.Expired load: ↑ 2.Unexpired store: ↓ 3.Unexpired eviction: ↓ prediction++ T = 20 Store A A A 30 prediction--
22
Inderpreet SinghCache Coherence for GPU Architectures22 TC-Strong vs TC-Weak Fixed lifetime for all applications 0.6 0.8 1.0 1.2 1.4 All applications Speedup 0.6 0.8 1.0 1.2 All applications Speedup TCSUO TCS TCSOO TCW TCW w/ predictor Best lifetime for each application
23
Inderpreet SinghCache Coherence for GPU Architectures23 Interconnect Power and Energy
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.