Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Memory-centric System Interconnect Design with Hybrid Memory Cubes Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho Ahn,
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Optimization on Kepler Zehuan Wang
Multi-GPU System Design with Memory Networks
Memory Network: Enabling Technology for Scalable Near-Data Computing Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Sunpyo Hong, Hyesoon Kim
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
1 Lecture 5a: CPU architecture 101 boris.
A Case for Toggle-Aware Compression for GPU Systems
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Translation Lookaside Buffer
Gwangsun Kim, Jiyun Jeong, John Kim
Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,
Seth Pugsley, Jeffrey Jestes,
Reducing Memory Interference in Multicore Systems
ECE232: Hardware Organization and Design
Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,
Chapter 1: A Tour of Computer Systems
Supporting x86-64 Address Translation for 100s of GPU Lanes
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adaptive Cache Partitioning on a Composite Core
Stash: Have Your Scratchpad and Cache it Too
QuickPath interconnect GB/s GB/s total To I/O
‘99 ACM/IEEE International Symposium on Computer Architecture
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
Donghyuk Lee, Mike O’Connor, Niladrish Chatterjee
Energy-Efficient Address Translation
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
RegLess: Just-in-Time Operand Staging for GPUs
Rachata Ausavarungnirun
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Presented by: Isaac Martin
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Address-Value Delta (AVD) Prediction
Virtual Memory Hardware
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Translation Lookaside Buffer
Virtual Memory Overcoming main memory size limitation
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
Lecture 8: Efficient Address Translation
Main Memory Background
Active-Routing: Compute on the Way for Near-Data Processing
Rajeev Balasubramonian
Address-Stride Assisted Approximate Load Value Prediction in GPUs
Haonan Wang, Adwait Jog College of William & Mary
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor Kevin Hsieh NVIDIA and UT-Austin CMU

Near-Data Processing (NDP) Offload computation to memory: better bandwidth and energy 3D-stacked memory poses a new opportunity for NDP Limitations of prior work: Application-specific (e.g., graph-processing, 3D rendering, etc.) Architecture-specific (i.e., MMU or TLB in the memory stack) Restrict data placement (e.g., no paged virtual memory for NDP) DIVA Processing-in-Memory Chip [ICS’02] Single-die (same process) Hybrid Memory Cube Logic process (optimized for high frequency) DRAM process (optimized for high capacitance)

Hybrid Memory Cube (HMC) Interface BW > DRAM BW Max. DRAM BW: 320 GB/s Max. total link BW: 120 GB/s/link × 4 links = 480 GB/s The logic layer has routing capability  HMCs can create a memory network Memory network with ring topology TSVs . . . Vault Ctrl I/O Intra-HMC network DRAM layers Abstracted packetized interface Logic layer *Bandwidth numbers are from the HMC specification 2.1

Processor Bandwidth Limitation Processor BW can become a bottleneck with multiple HMCs NDP through the memory network can address the bottleneck Processor … Example: a processor with 8 HMCs Processor off-chip bandwidth = 120 GB/s/link × 8 links ≅ 1 TB/s Bottleneck Total max. DRAM bandwidth = 320 GB/s/HMC × 8 HMCs ≅ 2.5 TB/s Compute Data Data Memory Network

Outline Background / Motivation Proposed NDP Architecture Partitioned Execution Naïve NDP Implementation Result Dynamic Offload Decision Partial Offloading & Cache Locality-aware Offloading Results Conclusion

Proposed NDP Architecture Focus on GPUs as they are bandwidth-intensive Goal of this work: overcome the limitations of prior work General-purpose NDP Standardizable: No MMU or TLB in the memory stack No restriction on data placement during NDP Register file . . . NDP buffers Instruction cache Memory Network GPU . . . Intra-HMC network Vault Ctrl I/O NSU Near-data processing SIMD Unit

High-level View of Our Approach Offload block: the unit of NDP offloading A memory-intensive chunk of instructions in a given workload Automatically identified at compile time [ISCA’16] Warp (group of threads) Overhead: registers transferred Benefit: Memory accesses offloaded Instruction stream Compute intensive Input context execute other warps Memory intensive DRAM Output context Compute intensive GPU core NSU (in the HMC)

Partitioned Execution (Load) Memory Network GPU Address translated on the GPU PAddr Data NSU Data NSU DRAM (vault ctrl) Register file NDP buffer GPU core Read-and-Forward (Physical address) Response (Data) Load by NSU Load by GPU

Partitioned Execution (Store) Memory Network GPU Address translated on the GPU PAddr Write Location NSU PAddr +Data ACK NSU DRAM (vault ctrl) NDP buffer GPU core Register file Store by NSU Write Req. (Physical addr. + DATA) Store by GPU WriteAddress (Physical address) Ack.

Vector Addition Example GPU DRAM(vault) NSU Time Offload command LD Read-and-Forward LD Read-and-Forward Data LD Data ST Write address LD This warp is blocked (Schedule another warp) Compute ST Write request Write ack. Offload ack. GPU code NSU code 0xA00: … 0xA08: OFLD.BEG 0xD08, [], 2, 1 0xA10: ld %f1, [%rl9] 0xA18: ld %f2, [%rl8] 0xA20: add@NSU %f3, %f2, %f1 0xA28: add %rl10, %rl1, %rl7 0xA30: st [%rl10], %f3 0xA38: OFLD.END [] 0xA40: … 0xD00: … 0xD08: OFLD.BEG [] 0xD10: ld %f1 0xD18: ld %f2 0xD20: add@NSU %f3, %f2, %f1 0xD28: st %f3 0xD30: OFLD.END [] 0xD38: …

Considering GPU’s Cache Load instruction Store instruction GPU Cache DRAM (vault) GPU NSU Cache miss LD RDF req. Time RDF req. Data Cache hit LD RDF req. Data LD LD Can hold stale data Newest data is here GPU Cache DRAM (vault) GPU NSU ST Time Write address ST Write request Only 0.4% of all traffic on avg. Invalidate Write ack.

Evaluation Methodology Performance model: modified GPGPU-sim v3.2.0 1 GPU with 8 HMCs (one NSU per HMC) Compute cores (Streaming Multiprocessors or SMs) Memory network topology: 3D hypercube NDP-related buffers Power: GPUWattch + Rambus DRAM model + Interconnect model Focus on memory-intensive workloads Baseline 64 SMs @ 700 MHz Baseline_MoreSMs 72 SMs @ 700 MHz NDP 64 SMs @ 700 MHz + 8 NSUs @ 350 MHz GPU SM 8 B × 364 entries for NDP packet buffer  2.8 KB (1.8% storage overhead in the GPU) NSU 128 B × 512 entries for NDP buffer  64 KB (but no scratchpad or data cache)

Naïve NDP Performance 55% degradation 6.86 10.1 Baseline Baseline_MoreSMs NaiveNDP

Partial Offloading Too many factors need to be considered to analytically determine the best offload ratio: Compute/memory intensity Current off-chip BW utilization Cache locality  Use a simple heuristic: hill climbing method … Only offload a given proportion of offload block instances No single static ratio is optimal for all workloads Instruction stream Warp 0 Warp 1 Warp 2 Warp 3 Identified offload block Execute on the GPU (not offloaded) Execute on the NSU (offloaded) Offload ratio = 0.5 63.6%

Dynamic Offload Decision Hill climbing method Based on the IPC measured during an epoch, determine the offload ratio for the next epoch. Adaptive step size Large step: good when far from the peak Small step: good when near the peak When there is oscillation, reduce step size Cache locality-awareness Suppress offloading the blocks that have good cache locality Do not offload if (offload benefit) < (register transfer overhead) The offload benefit is calculated from cache performance counters (more details in the paper) IPC Step size Offload ratio

Dynamic Offload Decision Result Performance Energy 15% 18% 67% Baseline Baseline_MoreSMs NDP(HC) NDP(HC+CA) 8% 38%

Conclusion Prior work on NDP was limited in several ways: Application-specific, architecture-specific, or restrictive on data placement Our proposed partitioned execution mechanism overcomes such limitations and enables general-purposed, standardized NDP Key insight: architecture-specific address translation can be decoupled from the execution of LD/ST instructions through NDP buffers Dynamic offload decision improves the performance of the proposed NDP architecture Performance and energy are improved by up to 67% and 38%, respectively

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor Kevin Hsieh NVIDIA and UT-Austin CMU