Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor Kevin Hsieh NVIDIA and UT-Austin CMU

Near-Data Processing (NDP)
Offload computation to memory: better bandwidth and energy 3D-stacked memory poses a new opportunity for NDP Limitations of prior work: Application-specific (e.g., graph-processing, 3D rendering, etc.) Architecture-specific (i.e., MMU or TLB in the memory stack) Restrict data placement (e.g., no paged virtual memory for NDP) DIVA Processing-in-Memory Chip [ICS’02] Single-die (same process) Hybrid Memory Cube Logic process (optimized for high frequency) DRAM process (optimized for high capacitance)

Hybrid Memory Cube (HMC)
Interface BW > DRAM BW Max. DRAM BW: 320 GB/s Max. total link BW: 120 GB/s/link × 4 links = 480 GB/s The logic layer has routing capability  HMCs can create a memory network Memory network with ring topology TSVs . . . Vault Ctrl I/O Intra-HMC network DRAM layers Abstracted packetized interface Logic layer *Bandwidth numbers are from the HMC specification 2.1

Processor Bandwidth Limitation
Processor BW can become a bottleneck with multiple HMCs NDP through the memory network can address the bottleneck Processor … Example: a processor with 8 HMCs Processor off-chip bandwidth = 120 GB/s/link × 8 links ≅ 1 TB/s Bottleneck Total max. DRAM bandwidth = 320 GB/s/HMC × 8 HMCs ≅ 2.5 TB/s Compute Data Data Memory Network

Outline Background / Motivation Proposed NDP Architecture
Partitioned Execution Naïve NDP Implementation Result Dynamic Offload Decision Partial Offloading & Cache Locality-aware Offloading Results Conclusion

Proposed NDP Architecture
Focus on GPUs as they are bandwidth-intensive Goal of this work: overcome the limitations of prior work General-purpose NDP Standardizable: No MMU or TLB in the memory stack No restriction on data placement during NDP Register file . . . NDP buffers Instruction cache Memory Network GPU . . . Intra-HMC network Vault Ctrl I/O NSU Near-data processing SIMD Unit

High-level View of Our Approach
Offload block: the unit of NDP offloading A memory-intensive chunk of instructions in a given workload Automatically identified at compile time [ISCA’16] Warp (group of threads) Overhead: registers transferred Benefit: Memory accesses offloaded Instruction stream Compute intensive Input context execute other warps Memory intensive DRAM Output context Compute intensive GPU core NSU (in the HMC)

Partitioned Execution (Load)
Memory Network GPU Address translated on the GPU PAddr Data NSU Data NSU DRAM (vault ctrl) Register file NDP buffer GPU core Read-and-Forward (Physical address) Response (Data) Load by NSU Load by GPU

Partitioned Execution (Store)
Memory Network GPU Address translated on the GPU PAddr Write Location NSU PAddr +Data ACK NSU DRAM (vault ctrl) NDP buffer GPU core Register file Store by NSU Write Req. (Physical addr. + DATA) Store by GPU WriteAddress (Physical address) Ack.

Vector Addition Example
GPU DRAM(vault) NSU Time Offload command LD Read-and-Forward LD Read-and-Forward Data LD Data ST Write address LD This warp is blocked (Schedule another warp) Compute ST Write request Write ack. Offload ack. GPU code NSU code 0xA00: … 0xA08: OFLD.BEG 0xD08, [], 2, 1 0xA10: ld %f1, [%rl9] 0xA18: ld %f2, [%rl8] 0xA20: %f3, %f2, %f1 0xA28: add %rl10, %rl1, %rl7 0xA30: st [%rl10], %f3 0xA38: OFLD.END [] 0xA40: … 0xD00: … 0xD08: OFLD.BEG [] 0xD10: ld %f1 0xD18: ld %f2 0xD20: %f3, %f2, %f1 0xD28: st %f3 0xD30: OFLD.END [] 0xD38: …

Considering GPU’s Cache
Load instruction Store instruction GPU Cache DRAM (vault) GPU NSU Cache miss LD RDF req. Time RDF req. Data Cache hit LD RDF req. Data LD LD Can hold stale data Newest data is here GPU Cache DRAM (vault) GPU NSU ST Time Write address ST Write request Only 0.4% of all traffic on avg. Invalidate Write ack.

Evaluation Methodology
Performance model: modified GPGPU-sim v3.2.0 1 GPU with 8 HMCs (one NSU per HMC) Compute cores (Streaming Multiprocessors or SMs) Memory network topology: 3D hypercube NDP-related buffers Power: GPUWattch + Rambus DRAM model + Interconnect model Focus on memory-intensive workloads Baseline MHz Baseline_MoreSMs MHz NDP MHz MHz GPU SM 8 B × 364 entries for NDP packet buffer  2.8 KB (1.8% storage overhead in the GPU) NSU 128 B × 512 entries for NDP buffer  64 KB (but no scratchpad or data cache)

Naïve NDP Performance 55% degradation 6.86 10.1 Baseline
Baseline_MoreSMs NaiveNDP

Partial Offloading Too many factors need to be considered to analytically determine the best offload ratio: Compute/memory intensity Current off-chip BW utilization Cache locality  Use a simple heuristic: hill climbing method … Only offload a given proportion of offload block instances No single static ratio is optimal for all workloads Instruction stream Warp 0 Warp 1 Warp 2 Warp 3 Identified offload block Execute on the GPU (not offloaded) Execute on the NSU (offloaded) Offload ratio = 0.5 63.6%

Dynamic Offload Decision
Hill climbing method Based on the IPC measured during an epoch, determine the offload ratio for the next epoch. Adaptive step size Large step: good when far from the peak Small step: good when near the peak When there is oscillation, reduce step size Cache locality-awareness Suppress offloading the blocks that have good cache locality Do not offload if (offload benefit) < (register transfer overhead) The offload benefit is calculated from cache performance counters (more details in the paper) IPC Step size Offload ratio

Dynamic Offload Decision Result
Performance Energy 15% 18% 67% Baseline Baseline_MoreSMs NDP(HC) NDP(HC+CA) 8% 38%

Conclusion Prior work on NDP was limited in several ways: Application-specific, architecture-specific, or restrictive on data placement Our proposed partitioned execution mechanism overcomes such limitations and enables general-purposed, standardized NDP Key insight: architecture-specific address translation can be decoupled from the execution of LD/ST instructions through NDP buffers Dynamic offload decision improves the performance of the proposed NDP architecture Performance and energy are improved by up to 67% and 38%, respectively

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor Kevin Hsieh NVIDIA and UT-Austin CMU

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Similar presentations

Presentation on theme: "Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Similar presentations

Presentation on theme: "Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor"— Presentation transcript:

Similar presentations

About project

Feedback