HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
Sunpyo Hong, Hyesoon Kim
SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.
Programmable Accelerators
µC-States: Fine-grained GPU Datapath Power Management
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
ISPASS th April Santa Rosa, California
Stash: Have Your Scratchpad and Cache it Too
Measuring and Modeling On-Chip Interconnect Power on Real Hardware
BLIS optimized for EPYCTM Processors
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
The Small batch (and Other) solutions in Mantle API
Lecture 13: Large Cache Design I
Heterogeneous System coherence for Integrated CPU-GPU Systems
ECE 445 – Computer Organization
Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,
In-depth on the memory system
SOC Runtime Gregory Stoner.
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
libflame optimizations with BLIS
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Interference from GPU System Service Requests
CARP: Compression-Aware Replacement Policies
Improving Multiple-CMP Systems with Token Coherence
Interference from GPU System Service Requests
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
RegMutex: Inter-Warp GPU Register Time-Sharing
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
High Performance Computing
Advanced Micro Devices, Inc.
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018
Border Control: Sandboxing Accelerators
Presentation transcript:

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK D HILL* †, STEVEN K REINHARDT †, DAVID A WOOD* † *University of Wisconsin-Madison † Advanced Micro Devices, Inc.

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-462 SUMMARY  Physical and logical CPU-GPU integration  Two key bottlenecks in heterogeneous cache coherence ‒Directory bandwidth: must support more than 1 request per cycle ‒Directory MSHRs: need tens of thousands  Heterogeneous System Coherence ‒Leverages coarse-grained coherence ‒Moves coherence traffic onto incoherent direct-access bus ‒Directory bandwidth ↓ by 94% and resources ↓ by 95%

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-463 ABSTRACT  Hardware coherence can increase the utility of heterogeneous systems  Major bottlenecks in current coherence implementations ‒High bandwidth difficult to support at directory ‒Extreme resource requirements  We propose Heterogeneous System Coherence ‒Leverages spatial locality and region coherence ‒Reduces bandwidth by 94% ‒Reduces resource requirements by 95%

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-464 PHYSICAL INTEGRATION

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-465 PHYSICAL INTEGRATION

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-466 PHYSICAL INTEGRATION

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-467 PHYSICAL INTEGRATION CPU Cores GPU Stacked High-bandwidth DRAM Credit: IBM

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-468 LOGICAL INTEGRATION  General-purpose GPU computing ‒OpenCL ‒CUDA  Heterogeneous Uniform Memory Access (hUMA) ‒Shared virtual address space ‒Cache coherence  Allows new heterogeneous apps

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-469 OUTLINE  Motivation  Background ‒System overview ‒Cache architecture reminder  Heterogeneous System Bottlenecks  Heterogeneous System Coherence Details  Results  Conclusions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4610 SYSTEM OVERVIEW SYSTEM LEVEL High- bandwidth interconnect

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4611 SYSTEM OVERVIEW APU Direct-access bus (used for graphics) Direct-access bus (used for graphics) Invalidation traffic GPU compute accesses must stay coherent Arrow thickness →bandwidth

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4612 SYSTEM OVERVIEW GPU Very high bandwidth: L2 has high miss rate Very high bandwidth: L2 has high miss rate

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4613 SYSTEM OVERVIEW Low bandwidth: Low L2 miss rate

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4614 CACHE ARCHITECTURE REMINDER CPU/GPU L2 CACHE Demand requests from L1 cache Allocates an MSHR entry Searches cache tags for a tag match On a hit, return data to the L1 On a miss, send request to directory On a directory probe, check MSHRs and tags Tag hit on probe: send data to other core

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4615 DIRECTORY ARCHITECTURE REMINDER DIRECTORY Demand requests from L2 cache Allocates an MSHR entry Searches cache tags for a tag match Allocate and send probes to L2 caches On a miss, the data comes from DRAM

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4616 BACKGROUND SUMMARY  System under investigation ‒Heterogeneous CPU-GPU on chip ‒High-bandwidth DRAM  Directory pipeline complex ‒MSHR array is associative ‒Difficult to pipeline with more than 1 request per cycle ‒Important resources: MSHR entries

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4617 OUTLINE  Motivation  Background  Heterogeneous System Bottlenecks ‒Simulation overview ‒Directory bandwidth ‒MSHRs ‒Performance is significantly affected  Heterogeneous System Coherence Details  Results  Conclusions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4618 SIMULATION DETAILS  gem5 simulator ‒Simple CPU ‒GPU simulator based on AMD GCN ‒All memory requests through gem5 CPU Clock2 GHz CPU Cores2 CPU Shared L22 MB (16-way banked) GPU Clock1 GHz Compute Units32 GPU Shared L24 MB (64-way banked) L3 (Memory-side)16 MB (16-way banked) DRAMDDR3, 16 channels Peak Bandwidth700 GB/s Baseline Directory256k entries (8-way banked)  Workloads ‒Modified to use hUMA ‒Rodinia & AMD APP SDK

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4619 GPGPU BENCHMARKS  Rodinia benchmarks ‒bp trains the connection weights on a neural network ‒bfs breadth-first search ‒hs performs a transient 2D thermal simulation (5-point stencil) ‒lud matrix decomposition ‒nw performs a global optimization for DNA sequence alignment ‒km does k-means clustering ‒sd speckle-reducing anisotropic diffusion  AMD SDK ‒bn bitonic sort ‒dct discrete cosine transform ‒hg histogram ‒mm matrix multiplication

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4620 SYSTEM BOTTLENECKS  Difficult to scale directory bandwidth ‒Difficult to multi-port ‒Complicated pipeline  High resource usage ‒Must allocate MSHR for entire duration of request ‒MSHR array difficult to scale High bandwidth Designed to support CPU bandwidth

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4621 DIRECTORY TRAFFIC Difficult to support >1 request per cycle

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4622 RESOURCE USAGE Causes significant back-pressure on L2s Steady state at 700 GB/s Very difficult to scale MSHR array

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4623 PERFORMANCE OF BASELINE COMPARED TO UNCONSTRAINED RESOURCES Back-pressure from limited MSHRs and bandwidth

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4624 BOTTLENECKS SUMMARY  Directory bandwidth ‒Must support up to 4 requests per cycle ‒Difficult to construct pipeline  Resource usage ‒MSHRs are a constraining resource ‒Need more than 10,000 ‒Without resource constraints, up to 4x better performance

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4625 OUTLINE  Motivation  Background  Heterogeneous System Bottlenecks  Heterogeneous System Coherence Details ‒Overall system design ‒Region buffer design ‒Region directory design ‒Example ‒Hardware complexity  Results  Conclusions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4626 BASELINE DIRECTORY COHERENCE Kernel Launch Initialization Read result

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4627 HETEROGENEOUS SYSTEM COHERENCE (HSC) Kernel Launch Initialization

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4628 HETEROGENEOUS SYSTEM COHERENCE (HSC) Region buffers coordinate with region directory Direct-access bus

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4629 HETEROGENEOUS SYSTEM COHERENCE (HSC)

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4630 HETEROGENEOUS SYSTEM COHERENCE (HSC)

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4631 HSC: EXAMPLE MEMORY REQUEST GPU Region Buffer GPU L2 Cache Region Directory

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4632 HSC: L2 CACHE & REGION BUFFER Region tags and permissions Interface for direct-access bus Only region-level permission traffic

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4633 HSC: REGION DIRECTORY Region tags, sharers, and permissions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4634 HSC: HARDWARE COMPLEXITY  Region protocols reduce directory size ‒Region directory: 8x fewer entries  Region buffers ‒At each L2 cache ‒1-KB region (16 64-B blocks) ‒16-K region entries ‒Overprovisioned for low-locality workloads

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4635 HSC SUMMARY  Key insight ‒GPU-CPU applications exhibit high spatial locality ‒Use direct-access bus present in systems ‒Offload bandwidth onto direct-access bus  Use coherence network only for permission  Add region buffer to track region information ‒At each L2 cache ‒Bypass coherence network and directory  Replace directory with region directory ‒Significantly reduces total size needed

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4636 OUTLINE  Motivation  Background  Heterogeneous System Bottlenecks  Heterogeneous System Coherence Details  Results ‒Speed-up ‒Latency of loads ‒Bandwidth ‒MSHR usage  Conclusions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4637 THREE CACHE-COHERENCE PROTOCOLS  Broadcast: Null-directory that broadcasts on all requests  Baseline: Block-based, mostly inclusive, directory  HSC: Region-based directory with 1-KB region size

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4638 HSC PERFORMANCE Largest slowdowns from constrained resources Largest slow-downs from constrained resources

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4639 DIRECTORY TRAFFIC REDUCTION Average bandwidth significantly reduced Theoretical reduction from 16 block regions

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4640 HSC RESOURCE USAGE Maximum MSHRs significantly reduced

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4641 RESULTS SUMMARY  Used a detailed timing simulator for CPU and GPU  HSC significantly improves performance ‒Reduces the average load latency ‒Decreases bandwidth requirement of directory  HSC reduces the required MSHRs at the directory

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4642 RELATED WORK  Coarse-grained coherence ‒Region coherence ‒Applied to snooping systems [Cantin, ISCA 2005] [Moshovos, ISCA 2005] [Zebchuk, MICRO 2007] ‒Extended to directories [Fang, PACT 2013] [Zebchuk, MICRO 2013] ‒Spatiotemporal coherence [Alisafaee, MICRO 2012] ‒Dual-grain directory coherence [Basu, UW-TR 2013] ‒Primarily focused on directory size  GPU coherence [Singh et al. HPCA 2013] ‒Intra-GPU coherence

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4643 CONCLUSIONS  Hardware coherence can increase the utility of heterogeneous systems  Major bottlenecks in current coherence implementations ‒High bandwidth difficult to support at directory ‒Extreme resource requirements  We propose Heterogeneous System Coherence ‒Leverages spatial locality and region coherence ‒Reduces bandwidth by 94% ‒Reduces resource requirements by 95%

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4644 Questions? Contact:

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4645 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

Backup Slides

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4647 LOAD LATENCY Average load time significantly reduced

| HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-4648 EXECUTION TIME BREAKDOWN