Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

Slides:



Advertisements
Similar presentations
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
Advertisements

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.
Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
GPGPU platforms GP - General Purpose computation using GPU
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.
Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.
Sunpyo Hong, Hyesoon Kim
SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.
ECE/CS 552: Shared Memory © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.
µC-States: Fine-grained GPU Datapath Power Management
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
Gwangsun Kim, Jiyun Jeong, John Kim
Software Coherence Management on Non-Coherent-Cache Multicores
Crossing Guard: Mediating Host-Accelerator Coherence Interactions
Distributed Shared Memory
Concurrent Data Structures for Near-Memory Computing
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration
Measuring and Modeling On-Chip Interconnect Power on Real Hardware
BLIS optimized for EPYCTM Processors
Multiprocessor Cache Coherency
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Implementation of Efficient Check-pointing and Restart on CPU - GPU
The Small batch (and Other) solutions in Mantle API
Heterogeneous System coherence for Integrated CPU-GPU Systems
hLRC: Lazy Release Consistency For GPUs
In-depth on the memory system
SOC Runtime Gregory Stoner.
libflame optimizations with BLIS
Interference from GPU System Service Requests
Yiannis Nikolakopoulos
Interference from GPU System Service Requests
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
RegMutex: Inter-Warp GPU Register Time-Sharing
Lecture 9: Directory Protocol Implementations
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
Distributed Systems CS
Advanced Micro Devices, Inc.
Lecture 18: Coherence and Synchronization
Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018
Border Control: Sandboxing Accelerators
Presentation transcript:

QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ, Bradford M. Beckmann†, Mark D. Hill‡†, Steven K. Reinhardt†, David A. Wood‡† §Duke University ϮTexas A&M University ‡University of Wisconsin-Madison †Advanced Micro Devices, Inc.

Executive summary GPU memory systems are designed for high throughput Goal: Expand the relevance of GPU compute Requires good performance on a broader set of applications Includes irregular and synchronized data accesses Naïve solution: CPU-like cache coherent memory system for GPUs Efficiently supports irregular and synchronized data accesses Costs significant graphics and streaming performance → unacceptable QuickRelease Supports both fine-grain synchronization and streaming applications Enhances current GPU memory systems with a simple write tracking FIFO Avoids expensive cache flushes on synchronization events

OUTLINE Motivation Memory System Design & Supporting Synchronization Current GPUs Future GPUs Memory System Design & Supporting Synchronization Write-through with cache flushes (WT) CPU-like “Read-for-Ownership” cache coherence (RfO) QuickRelease Results Conclusions / Future work Consider removing this slide.

Current gpuS Maintain Streaming and Graphics performance High bandwidth, latency tolerant L1 and L2 caches Writes coalesced at the CU and write-through CPU0 L1 L2 CPU1 CU1 CU0 LLC Directory / Memory CPU GPU Maintaining coherence with the CPU requires coalescing writes with the CPU. Describe FIGURE

Future GPUs Expand the scope of GPU compute applications Support more irregular workloads efficiently Support synchronized data efficiently Leverage more locality Reduce programmer effort Support over-synchronization efficiently No labeling volatile (rw-shared) data structures Allow more sharing than OpenCL 1.x (e.g. HSA or OpenCL 2.0) Global synchronization between workgroups Heterogeneous kernels with concurrent CPU execution w/ sharing

How can we support both? To expand the utility of GPUs beyond graphics, support: Irregular parallel applications Fine-grain synchronization Both will benefit from coherent caches But traditional CPU coherence is inappropriate for: Regular streaming workloads Coarse-grain synchronization Graphics will still be a primary application for GPUs Thus, we want coherence guided by synchronization Avoid the scalability challenges of “read for ownership” (RFO) coherence Maintain streaming performance with coarse-grain synchronization

OUTLINE Motivation Memory System Design & Supporting Synchronization Current GPUs Future GPUs Memory System Design & Supporting Synchronization Write-through with cache flushes (WT) CPU-like “Read-for-Ownership” cache coherence (RFO) QuickRelease Implementation Results Conclusions / Future work Consider removing this slide.

Synchronization Operations Traditional synchronization Kernel Begin: All stores from CPU and prior kernel completions are visible. Kernel End: All stores from a kernel are visible to CPU and future kernels. Barrier: All members of a workgroup are at the same PC and all prior stores in program order will be visible. HSA specification includes Load-Acquire and Store-Release Load-Acquire (LdAcq): A load that occurs before all memory operations later in program order (like Kernel Begin). Store-Release (StRel): A store that occurs after all prior memory operations in program order (like Kernel End or Barrier). Prepare for questions on HSA semantics (CPU reading data).

Write-Through (WT) Memory System Clean caches support fast reads Wavefront coalescing of writes at the CU Track byte-wise writes LdAcq -> invalidate entire L1 cache StRel -> ensure write-through LLC Directory / Memory L2 L2 GPU CPU L1 L1 L1 L1 CU0 CU1 CPU0 CPU1

Read-for-ownership (RfO) memory system Current CPUs Single-Writer or Multiple-Readers invariant Wavefront coalescing of writes at the CU LdAcq, StRel are simply Ld and St operations Invalidations, dirty-writeback and data responses LLC Directory / Memory L2 L2 GPU CPU L1 L1 L1 L1 CU0 CU1 CPU0 CPU1

OUTLINE Motivation Memory System Design & Supporting Synchronization Current GPUs Future GPUs Memory System Design & Supporting Synchronization Write-through with cache flushes (WT) CPU-like “Read-for-Ownership” cache coherence (RFO) QuickRelease Results Conclusions / Future work Consider removing this slide.

QuickRelease BASICS Coherence by “begging for forgiveness” instead of “asking for permission” Separate read and write paths Track byte-wise writes to avoid reading for ownership Coalesce writes across wavefronts Supports irregular local writes and local read-after-writes (RAW) Reduces traffic of write-throughs Only invalidate necessary blocks Reuse data across synchronization Overlap invalidations with writes to memory Precise: only synchronization stalls on invalidation acks Byte wise writes tracked without read-for-ownership

QuickRelease: Efficient synchronization & sharing Use FIFOs and write caches to support store visibility Lazily invalidate read caches to maintain coherence LLC Directory / Memory S-FIFO wL3 S-FIFO wL2 rL2 L2 GPU CPU S-FIFO wL1 rL1 S-FIFO wL1 rL1 L1 L1 CU0 CU1 CPU0 CPU1

QuickRelease Example … CU0 MEM CU1 ST X (1) ST_Rel A (2) CU0 CU1 X: 0 FIFO X … CU0 CU1 ST X (1) ST_Rel A (2) LD_Acq A (2) LD X (1)

QuickRelease Example … CU0 MEM CU1 CU0 CU1 X: 0 A: 0 FIFO L1 X: 1 X ST X (1) ST_Rel A (2) LD_Acq A (2) LD X (1)

QuickRelease Example CU0 MEM CU1 ST_Rel A (2) CU0 CU1 A: 0 FIFO FIFO ST X (1) ST_Rel A (2) LD_Acq A (2) LD X (1) FIFO L1 A: 1 X: # CU1

QuickRelease Example … CU0 MEM CU1 CU0 CU1 A: 0 FIFO L1 X: 1 Rel ST X (1) ST_Rel A (2) LD_Acq A (2) LD X (1)

QuickRelease Example MEM CU1 CU0 CU0 CU1 X: 1 FIFO L1 A: 2 A: # X:# A ST X (1) ST_Rel A (2) LD_Acq A (2) LD X (1)

QuickRelease Example MEM CU1 CU0 CU0 CU1 X: 1 A: 2 FIFO L1 X:1 ST X (1) ST_Rel A (2) LD_Acq A (2) LD X (1)

RECAP: QuickRelease vs rfo and wt Design Goals WT RFO QuickRelease High bandwidth YES NO Only wait on synchronization Avoids L1 data responses Coalesce irregular writes Precise Cache invalidations Support RAW

OUTLINE Motivation Memory System Design & Supporting Synchronization Current GPUs Future GPUs Memory System Design & Supporting Synchronization Write-through with cache flushes (WT) CPU-like “Read-for-Ownership” cache coherence (RfO) QuickRelease Results Conclusions / Future work Consider removing this slide.

Benchmarks Synchronizing applications Rodinia benchmarks AMD APP SDK APSP: converging on All-Pairs Shortest Path (uses LdAcq and StRel) Sort: performs a 4-byte radix sort byte by byte (uses LdAcq and StRel) Rodinia benchmarks nn: n-nearest neighbors backprop: trains the connection weights on a neural network hotspot: performs a transient 2D thermal simulation (5-point stencil) lud: matrix decomposition kmeans: does k-means clustering nw: performs a global optimization for DNA sequence alignment AMD APP SDK nbody: simulation of particle-particle interactions matrixmul: multiplies matrices reduction: sums the values in an input array dct: algorithm for image and video frame compression

Read after Read REUSE in L1 WHY intermediate reuse matters so much Read after read explain better

Performance of QuickRelease vs. WT and rfo Animate bars in order

Performance of QuickRelease vs. WT and rfo Animate bars in order

Performance of QuickRelease vs. WT and rfo Animate bars in order

Performance of QuickRelease vs. WT and rfo Animate bars in order

Conclusions QuickRelease (QR) gets the best of both worlds (RFO and WT) High streaming bandwidth Efficient fine-grain communication and synchronization QR achieves 7% average performance improvement compared to WT For emerging workloads with finer-grain synchronization, 42% performance improvement compared to WT QuickRelease Costs Separate read and write caches Synchronization FIFOs Probe broadcast to CUs MJS – Added “For QuickRelease”. Overall 7% not bad, furthermore fine-grain

Questions?

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

Scalability of QuickRelease vs. rfo QuickReleae outperforms RFO when problem sizes are beyond cache capacity

QuickRelease Example … … Time CU0 MEM CU1 CU0 MEM CU1 CU0 MEM CU1 CU0  ST X (1)  ST_Rel A (2)  LD_Acq A (2)  LD X (1) CU0 L1 X: 1 MEM A: 0 FIFO X Rel   CU1 … CU0 L1 A: 2 MEM X: 1 FIFO A  CU1 X:1  CU0 L1 MEM X: 1 A: 2 FIFO CU1 X:1    CU0 L1 X: 1 MEM X: 0 A: 0 CU1 FIFO X Rel   … Time

WT Example Time CU0 MEM CU1 CU0 MEM CU1 CU0 MEM CU1 CU0 MEM CU1 CU0  ST X (1)  ST_Rel A (2)  LD_Acq A (2)  LD X (1)  CU0 L1 A: 2 MEM X: 1 CU1 Y: 3  CU0 L1 A: 2 MEM X: 1 CU1   CU0 L1 A: 2 MEM X: 1 CU1     CU0 L1 X: 1 MEM A: 0 CU1 Y: 3   CU0 L1 MEM X: 1 A: 0 CU1 Y: 3 CU0 L1 X: 1 MEM X: 0 A: 0  CU1 Y: 3 Time

RFO Example Time CU0 MEM CU1 CU0 MEM CU1 CU0 MEM CU1 CU0 MEM CU1  ST X (1)  ST_Rel A (2)  LD_Acq A (2)  LD X (1) CU0 L1 X: 1 MEM A: 2 CU1     CU0 L1 A: 2 MEM X: 1 A: 0 CU1 Y: 3 CU0 L1 A: 2 MEM X: 1 CU1    CU0 L1 X: 1 MEM X: 0 A: 0  CU1 Y: 3 Time

Reduction of write-throughs

Probes versus data More writes than reads CPU probes Probes create a lot of traffic, but QuickRelease reduces data messages

Why GPUs should not have write-back caches