Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Memory-centric System Interconnect Design with Hybrid Memory Cubes Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho Ahn,

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Optimization on Kepler Zehuan Wang

Multi-GPU System Design with Memory Networks

Memory Network: Enabling Technology for Scalable Near-Data Computing Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Sunpyo Hong, Hyesoon Kim

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

1 Lecture 5a: CPU architecture 101 boris.

A Case for Toggle-Aware Compression for GPU Systems

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Translation Lookaside Buffer

Gwangsun Kim, Jiyun Jeong, John Kim

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,

Seth Pugsley, Jeffrey Jestes,

Reducing Memory Interference in Multicore Systems

ECE232: Hardware Organization and Design

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,

Chapter 1: A Tour of Computer Systems

Supporting x86-64 Address Translation for 100s of GPU Lanes

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Adaptive Cache Partitioning on a Composite Core

Stash: Have Your Scratchpad and Cache it Too

QuickPath interconnect GB/s GB/s total To I/O

‘99 ACM/IEEE International Symposium on Computer Architecture

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Accelerating Linked-list Traversal Through Near-Data Processing

Accelerating Linked-list Traversal Through Near-Data Processing

Donghyuk Lee, Mike O’Connor, Niladrish Chatterjee

Energy-Efficient Address Translation

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

RegLess: Just-in-Time Operand Staging for GPUs

Rachata Ausavarungnirun

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Presented by: Isaac Martin

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Address-Value Delta (AVD) Prediction

Virtual Memory Hardware

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Translation Lookaside Buffer

Virtual Memory Overcoming main memory size limitation

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Operation of the Basic SM Pipeline

Mattan Erez The University of Texas at Austin

Lecture 8: Efficient Address Translation

Main Memory Background

Active-Routing: Compute on the Way for Near-Data Processing

Rajeev Balasubramonian

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Haonan Wang, Adwait Jog College of William & Mary

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor Kevin Hsieh NVIDIA and UT-Austin CMU

Near-Data Processing (NDP) Offload computation to memory: better bandwidth and energy 3D-stacked memory poses a new opportunity for NDP Limitations of prior work: Application-specific (e.g., graph-processing, 3D rendering, etc.) Architecture-specific (i.e., MMU or TLB in the memory stack) Restrict data placement (e.g., no paged virtual memory for NDP) DIVA Processing-in-Memory Chip [ICS’02] Single-die (same process) Hybrid Memory Cube Logic process (optimized for high frequency) DRAM process (optimized for high capacitance)

Hybrid Memory Cube (HMC) Interface BW > DRAM BW Max. DRAM BW: 320 GB/s Max. total link BW: 120 GB/s/link × 4 links = 480 GB/s The logic layer has routing capability  HMCs can create a memory network Memory network with ring topology TSVs . . . Vault Ctrl I/O Intra-HMC network DRAM layers Abstracted packetized interface Logic layer *Bandwidth numbers are from the HMC specification 2.1

Processor Bandwidth Limitation Processor BW can become a bottleneck with multiple HMCs NDP through the memory network can address the bottleneck Processor … Example: a processor with 8 HMCs Processor off-chip bandwidth = 120 GB/s/link × 8 links ≅ 1 TB/s Bottleneck Total max. DRAM bandwidth = 320 GB/s/HMC × 8 HMCs ≅ 2.5 TB/s Compute Data Data Memory Network

Outline Background / Motivation Proposed NDP Architecture Partitioned Execution Naïve NDP Implementation Result Dynamic Offload Decision Partial Offloading & Cache Locality-aware Offloading Results Conclusion

Proposed NDP Architecture Focus on GPUs as they are bandwidth-intensive Goal of this work: overcome the limitations of prior work General-purpose NDP Standardizable: No MMU or TLB in the memory stack No restriction on data placement during NDP Register file . . . NDP buffers Instruction cache Memory Network GPU . . . Intra-HMC network Vault Ctrl I/O NSU Near-data processing SIMD Unit

High-level View of Our Approach Offload block: the unit of NDP offloading A memory-intensive chunk of instructions in a given workload Automatically identified at compile time [ISCA’16] Warp (group of threads) Overhead: registers transferred Benefit: Memory accesses offloaded Instruction stream Compute intensive Input context execute other warps Memory intensive DRAM Output context Compute intensive GPU core NSU (in the HMC)

Partitioned Execution (Load) Memory Network GPU Address translated on the GPU PAddr Data NSU Data NSU DRAM (vault ctrl) Register file NDP buffer GPU core Read-and-Forward (Physical address) Response (Data) Load by NSU Load by GPU

Partitioned Execution (Store) Memory Network GPU Address translated on the GPU PAddr Write Location NSU PAddr +Data ACK NSU DRAM (vault ctrl) NDP buffer GPU core Register file Store by NSU Write Req. (Physical addr. + DATA) Store by GPU WriteAddress (Physical address) Ack.

Vector Addition Example GPU DRAM(vault) NSU Time Offload command LD Read-and-Forward LD Read-and-Forward Data LD Data ST Write address LD This warp is blocked (Schedule another warp) Compute ST Write request Write ack. Offload ack. GPU code NSU code 0xA00: … 0xA08: OFLD.BEG 0xD08, [], 2, 1 0xA10: ld %f1, [%rl9] 0xA18: ld %f2, [%rl8] 0xA20: add@NSU %f3, %f2, %f1 0xA28: add %rl10, %rl1, %rl7 0xA30: st [%rl10], %f3 0xA38: OFLD.END [] 0xA40: … 0xD00: … 0xD08: OFLD.BEG [] 0xD10: ld %f1 0xD18: ld %f2 0xD20: add@NSU %f3, %f2, %f1 0xD28: st %f3 0xD30: OFLD.END [] 0xD38: …

Considering GPU’s Cache Load instruction Store instruction GPU Cache DRAM (vault) GPU NSU Cache miss LD RDF req. Time RDF req. Data Cache hit LD RDF req. Data LD LD Can hold stale data Newest data is here GPU Cache DRAM (vault) GPU NSU ST Time Write address ST Write request Only 0.4% of all traffic on avg. Invalidate Write ack.

Evaluation Methodology Performance model: modified GPGPU-sim v3.2.0 1 GPU with 8 HMCs (one NSU per HMC) Compute cores (Streaming Multiprocessors or SMs) Memory network topology: 3D hypercube NDP-related buffers Power: GPUWattch + Rambus DRAM model + Interconnect model Focus on memory-intensive workloads Baseline 64 SMs @ 700 MHz Baseline_MoreSMs 72 SMs @ 700 MHz NDP 64 SMs @ 700 MHz + 8 NSUs @ 350 MHz GPU SM 8 B × 364 entries for NDP packet buffer  2.8 KB (1.8% storage overhead in the GPU) NSU 128 B × 512 entries for NDP buffer  64 KB (but no scratchpad or data cache)

Naïve NDP Performance 55% degradation 6.86 10.1 Baseline Baseline_MoreSMs NaiveNDP

Partial Offloading Too many factors need to be considered to analytically determine the best offload ratio: Compute/memory intensity Current off-chip BW utilization Cache locality  Use a simple heuristic: hill climbing method … Only offload a given proportion of offload block instances No single static ratio is optimal for all workloads Instruction stream Warp 0 Warp 1 Warp 2 Warp 3 Identified offload block Execute on the GPU (not offloaded) Execute on the NSU (offloaded) Offload ratio = 0.5 63.6%

Dynamic Offload Decision Hill climbing method Based on the IPC measured during an epoch, determine the offload ratio for the next epoch. Adaptive step size Large step: good when far from the peak Small step: good when near the peak When there is oscillation, reduce step size Cache locality-awareness Suppress offloading the blocks that have good cache locality Do not offload if (offload benefit) < (register transfer overhead) The offload benefit is calculated from cache performance counters (more details in the paper) IPC Step size Offload ratio

Dynamic Offload Decision Result Performance Energy 15% 18% 67% Baseline Baseline_MoreSMs NDP(HC) NDP(HC+CA) 8% 38%

Conclusion Prior work on NDP was limited in several ways: Application-specific, architecture-specific, or restrictive on data placement Our proposed partitioned execution mechanism overcomes such limitations and enables general-purposed, standardized NDP Key insight: architecture-specific address translation can be decoupled from the execution of LD/ST instructions through NDP buffers Dynamic offload decision improves the performance of the proposed NDP architecture Performance and energy are improved by up to 67% and 38%, respectively

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor Kevin Hsieh NVIDIA and UT-Austin CMU