RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Slides:



Advertisements
Similar presentations
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
High Performing Cache Hierarchies for Server Workloads
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
Sunpyo Hong, Hyesoon Kim
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
The Evicted-Address Filter
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sunpyo Hong, Hyesoon Kim
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu,
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
A Case for Toggle-Aware Compression for GPU Systems
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Zorua: A Holistic Approach to Resource Virtualization in GPUs
Reza Yazdani Albert Segura José-María Arnau Antonio González
CS427 Multicore Architecture and Parallel Computing
Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Session 2A: Today, 10:20 AM Nandita Vijaykumar.
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait.
ISPASS th April Santa Rosa, California
Managing GPU Concurrency in Heterogeneous Architectures
Lecture 5: GPU Compute Architecture
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Energy-Efficient Address Translation
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Rachata Ausavarungnirun
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Introduction, Focus, Overview
Some challenges in heterogeneous multi-core systems
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture 5: GPU Compute Architecture for the last time
A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.
Managing GPU Concurrency in Heterogeneous Architectures
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
Die Stacking (3D) Microarchitecture -- from Intel Corporation
José A. Joao* Onur Mutlu‡ Yale N. Patt*
15-740/ Computer Architecture Lecture 14: Prefetching
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
ITAP: Idle-Time-Aware Power Management for GPU Execution Units
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Address-Stride Assisted Approximate Load Value Prediction in GPUs
Haonan Wang, Adwait Jog College of William & Mary
A Case for Richer Cross-layer Abstractions:
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu, Todd C. Mowry Georgia Institute of Technology Carnegie Mellon University

Executive Summary Problem: Performance of modern GPUs significantly limited by the available off-chip bandwidth Observations: Many GPU applications are amenable to approximation Data value similarity allows to efficiently predict values of cache misses Key Idea: Use simple value prediction mechanisms to avoid accesses to main memory when it is safe Results: Higher speedup (36% on average) with less than 10% quality loss Lower energy consumption (27% on average)

Virtual Reality Data Analytics GPU Multimedia Robotics

Many GPU applications are limited by the off-chip bandwidth Virtual Reality Data Analytics Many GPU applications are limited by the off-chip bandwidth Multimedia Robotics

Motivation: Bandwidth Bottleneck 13.7 2.5 13.5 2.6 4.0 2.6 GPGPU applications from Rodinia, MapReduce, Nvidia CUDA SDK. backprop s.srad2 fastwalsh gaussian heartwall matrixmul particlefilter similarityscore s.reduce stringmatch geomean Off-chip bandwidth is a major performance bottleneck

Only Few Loads Matters Percentage of Load Misses Few GPU instructions generate most of the cache misses Number of Loads

Many GPU applications are also amenable to approximation Virtual Reality Data Analytics Many GPU applications are also amenable to approximation Multimedia Robotics

Rollback-Free Value Prediction Key idea: Predict values for safe-to-approximate loads when they miss in the cache Design principles: No rollback/recovery, only value prediction Drop rate is a tuning knob Other requests are serviced normally Providing safety guarantees

GPU RFVP: Diagram Memory Hierarchy Cores L1 Data Cache RFVP Value Predictor

Code Example to Support Intuition Matrixmul: float newVal = 0; for (int i=0; i<N; i++) { float4 v1 = matrix1[i]; float4 v2 = matrix2[i]; newVal += v1.x * v2.x; newVal += v1.y * v2.y; newVal += v1.z * v2.z; newVal += v1.w * v2.w; }

Code Example to Support Intuition (2) s.srad2: int d_cN, d_cS, d_cW, d_cE; d_cN = d_c[ei]; d_cS = d_c[d_iS[row] + d_Nr * col]; d_cW = d_c[ei]; d_cE = d_c[row + d_Nr * d_jE[col]];

Outline Motivation Key Idea RFVP Design and Operation Evaluation Conclusion

RFVP Architecture Design Instruction annotations by the programmer ISA changes Approximate load instruction Instruction for setting the drop rate Defining approximate load semantics Microarchitecture Integration

Programmer Annotations Safety is a semantic property of the program We rely on the programmer to annotate the code

ISA Support Approximate Loads load.approx Reg<id>, MEMORY<address> is a probabilistic load - can assign precise or imprecise value to Reg<id> Drop rate set.rate DropRateReg sets the fraction (e.g., 50%) of the approximate cache misses that do not initiate memory requests

Microarchitecture Integration SM Cores Memory Hierarchy Cores Cores Cores RFVP Value Predictor RFVP Value Predictor RFVP Value Predictor RFVP Value Predictor L1 Data Cache L1 Data Cache L1 Data Cache L1 Data Cache

Microarchitecture Integration SM Cores RFVP Value Predictor L1 Data Cache Memory Hierarchy Cores RFVP Value Predictor L1 Data Cache Cores RFVP Value Predictor L1 Data Cache Cores RFVP Value Predictor L1 Data Cache DATA Memory Request All the L1 misses are sent to the memory subsystem

Microarchitecture Integration SM Cores RFVP Value Predictor L1 Data Cache Memory Hierarchy Cores RFVP Value Predictor L1 Data Cache Cores RFVP Value Predictor L1 Data Cache Cores RFVP Value Predictor L1 Data Cache DATA Memory Request A fraction of the requests will be handled by RFVP

Language and Software Support Targeting performance critical loads Only a few critical instructions matter for value prediction Providing safety guarantees Programmer annotations and compiler passes Drop-rate selection A new knob that allows to control quality vs. performance tradeoffs

Base Value Predictor: Two-Delta Stride Last Value Stride1 Stride2 Hash (PC) + Predicted Value

Designing RFVP predictor for GPUs Last Value Stride1 Stride2 Hash (PC) How to design a predictor for GPUs with, for example, 32 threads per warp? + Predicted Value

GPU Predictor Design and Operation + Last Value Stride1 Stride2 Hash (PC) Last Value Stride1 Stride2 + + + Predicted Value Predicted Value warp (32 threads) 1 2 … 16 17 18 … 32

Outline Motivation Key Idea RFVP Design and Operation Evaluation Conclusion

Methodology Simulator Workloads System Parameters GPGPU-Sim simulator (cycle-accurate) ver. 3.1 Workloads GPU benchmarks from Rodinia, Nvidia SDK, and Mars benchmark suites System Parameters GPU with 15 SMs, 32 threads/warp, 6 memory channels, 48 warps/SM, 32KB shared memory, 768KB LLC, GDDR5 177.4 GB/sec off-chip bandwidth

RFVP Performance Speedup 2.2 2.4 Speedup backprop fastwalsh gaussian heartwall matrixmul particlefilter s.srad2 similarityscore s.reduce stringmatch geomean Significant speedup for various acceptable quality rates

RFVP Bandwidth Consumption 1.9 2.0 2.3 BW Consumption Reduction backprop fastwalsh gaussian heartwall matrixmul particlefilter similarityscore s.reduce s.srad2 stringmatch geomean Reduction in consumed bandwidth (up to 1.5X average)

RFVP Energy Reduction Energy Reduction 1.9 1.6 2.0 Energy Reduction backprop fastwalsh gaussian heartwall matrixmul particlefilter similarityscore s.reduce s.srad2 stringmatch geomean Reduction in consumed energy (27% on average)

Sensitivity to the Value Prediction Speedup 2.2 2.4 Two-Delta predictor was the best option

Other Results and Analyses in the Paper Sensitivity to the drop rate (energy and quality) Precise vs. imprecise value distributions RFVP for memory latency wall CPU performance CPU energy reduction CPU quality vs. performance tradeoff

Conclusion Problem: Performance of modern GPUs significantly limited by the available off-chip bandwidth Observations: Many GPU applications are amenable to approximation Data value similarity allows to efficiently predict values of cache misses Key Idea: Use simple rollback-free value prediction mechanism to avoid accesses to main memory Results: Higher speedup (36% on average) with less than 10% quality loss Lower energy consumption (27% on average)

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu, Todd C. Mowry Georgia Institute of Technology Carnegie Mellon University

Sensitivity to the Drop Rate Speedup backprop fastwalsh gaussian heartwall matrixmul particlefilter s.reduce s.srad2 similarityscore stringmatch geomean Speedup varies significantly with different drop rates

Pareto Analysis Pareto-optimal is the configuration with 192 entries and 2 independent predictors for 32 threads