ARGO: Aging-aware GPGPU Register File Allocation Majid Shoushtari Nikil Dutt Puneet Gupta Computer Science Electrical Engineering

Slides:

Advertisements

Similar presentations

Subthreshold SRAM Designs for Cryptography Security Computations Adnan Gutub The Second International Conference on Software Engineering and Computer Systems.

Advertisements

A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

 Understanding the Sources of Inefficiency in General-Purpose Chips.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.

Resource Management of Highly Configurable Tasks April 26, 2004 Jeffery P. HansenSourav Ghosh Raj RajkumarJohn P. Lehoczky Carnegie Mellon University.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

Chapter 6: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 6: CPU Scheduling Basic.

Performance and Energy Bounds for Multimedia Applications on Dual-processor Power-aware SoC Platforms Weng-Fai WONG 黄荣辉 Dept. of Computer Science National.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

GPGPU platforms GP - General Purpose computation using GPU

OpenCL Introduction A TECHNICAL REVIEW LU OCT

SRC Project Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI PIs: Fadi J. Kurdahi and Nikil D. Dutt Center for.

Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

MIAOW: An Open Source RTL Implementation of a GPGPU

Multicore Resource Management 謝政宏. 2 Outline Background Virtual Private Machines  Spatial Component  Temporal Component  Minimum and Maximum.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Luca Benini/ UNIBO and ETHZ

1 Variability.org Aging-Aware Compiler-Directed VLIW Assignment for GPGPU Architectures Abbas Rahimi ‡, Luca Benini †, Rajesh K. Gupta ‡ ‡ UC San Diego,

Outline Introduction: BTI Aging and AVS Signoff Problem

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

Taniya Siddiqua, Paul Lee University of Virginia, Charlottesville.

My Coordinates Office EM G.27 contact time:

GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Software Architecture of Sensors. Hardware - Sensor Nodes Sensing: sensor --a transducer that converts a physical, chemical, or biological parameter into.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Gwangsun Kim, Jiyun Jeong, John Kim

EECE571R -- Harnessing Massively Parallel Processors ece

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.

Chapter 9 – Real Memory Organization and Management

Accelerating MapReduce on a Coupled CPU-GPU Architecture

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Linchuan Chen, Xin Huo and Gagan Agrawal

Spare Register Aware Prefetching for Graph Algorithms on GPUs

RegLess: Just-in-Time Operand Staging for GPUs

Rachata Ausavarungnirun

CPU Scheduling G.Anuradha

†UCSD, ‡UCSB, EHTZ*, UNIBO*

Presentation transcript:

ARGO: Aging-aware GPGPU Register File Allocation Majid Shoushtari Nikil Dutt Puneet Gupta Computer Science Electrical Engineering Computer Science and Engineering Abbas Rahimi Rajesh Gupta

The Future is Heterogeneous Computing 2 Slide borrowed from AMD keynote in ISSCC 2013

CPU+GPU Integration in Mobile SoCs 3 Slide borrowed from NVIDIA

What’s the problem? To support highly parallel execution, GPGPUs contain large RFs NVIDIA GTX480: 2MB AMD Radeon HD5870: 5MB Aging mechanisms are becoming one of the most pressing sources of circuit variations as technology shrinks. 4 Large RFs are being threatened by Aging

Outline Background on NBTI Related Work GPGPU Architectural Model Observation: RF Underutilization ARGO Experimental Results 5

NBTI: A Major Aging Mechanism Negative Bias Temperature Instability has emerged as a major reliability problem in current and future technology generations. NBTI manifests itself as a shift in V th Logic: Slower circuit  Timing Error Memory: Reduced “Signal to Noise Margin” 6 Recovery effect in periods of no stress – Full recovery from a stress period only possible in infinite time – In practice overall V th shift increases monotonously Higher Temperature  Faster Aging NBTI makes the memory cell unstable. Existing Strategies: 1) Higher Vdd (guardband) required; or 2) Life-time decreased by NBTI ARGO: Increase Life-time without Vdd guardband

Related Work RF/Caches Wearout-aware register allocation [Ahmed’12] Exploiting RF underutilization for power saving [Tabkhi’12] Partitioned cache for reducing NBTI-induced aging [Calimera’11] GPGPUs Aging in functional units of GPGPU [Rahimi’13] 7 No work on aging of RFs for multi-threaded GPGPUs

GPGPU Architecture & Execution Model: AMD Evergreen 8 Radeon HD 5870 (5 MB RF) 20 Compute Units (CUs) 16 Stream Cores (SCs) per CU (SIMD execution) 5 Processing Elements (PEs) per SC (VLIW execution) 16 KB Register File per SC Ultra-threaded Dispatcher Compute Unit (CU 0 ) Compute Unit (CU 19 ) L1 Crossbar Global Memory Hierarchy Compute Device SIMD Fetch Unit Stream Core (SC 0 ) Stream Core (SC 15 ) Local Data Storage Wavefront Scheduler Compute Unit (CU) T General-purpose Reg. XYZW Branch Processing Elements (PEs) Stream Core (SC) XYZW. 16 KB. ND-Range WG … … Work-Group WI … … Common OpenCL Kernel: _kernel func() { } Work-Item

Observation: RF Underutilization Resources are fixed per compute unit local memory size maximum number of threads number of registers Any one of these resource constraints may limit #WG / CU ≡ occupancy 9 Kernel#of RegistersRF Utilization Reduction450% BinarySearch225% DwtHaar1D450% BitonicSort413% FastWalshTransform450% FloydWarshall675% BinomialOption1381% DiscreteCosineTransform722% MatrixTranspose338% MatrixMultiplication2269% SobelFilter999% URNG619% RadixSort166% Histogram1613% BlackScholes1989% This characteristic is preserved across set of OpenCL compiler options On average 54% of RF is not utilized at all Opportunistically exploiting RF underutilization for NBTI recovery

ARGO: Overall Approach 1.Detect aging (which RF banks are stressed?) Use “Virtual Sensor” to predict stressed banks 2.Distribute stress in RFs Perform leveling (rotating allocation) of RFs 3.Power gate stressed RF banks Allow stressed RF banks to recover 10

Sliced RF Organization 11 RF is partitioned into 16 Slices Each slice serves one SC RF is horizontally banked into 256 banks Each bank is 1KB and has separate power domain Each bank serves one WF RF is allocated at granularity of WG Dispatcher maps a WG to an available CU RF allocator assigns a portion of RF to WG WG + head of allocated space will be inserted into scheduler queue Logical Address Physical Address WG # + WI # + Allocated RF Head

Baseline (Aging Oblivious) RF Allocation Kernel#Reg. Limited by #WF per WG #WG per CU #Bank required RF Utilization Reduction4 Max # of threads 484*8*4 = /256 = 50% banks256 banks WG1 WG2 WG3 WG4 WG5 WG6 WG7 WG8 WG9 WG10 WG13 WG11 WG14 WG15 WG16 WG12 Low-indexed RF banks are stressed more

ARGO: RF Allocation 13 WG1 WG2 WG3 WG4 WG5 WG6 WG7 WG8 Distributing stress by rotating allocated RF portions Healing Level WG9 WG10 WG13 WG11 WG14 WG15 WG16 WG12

ARGO: Overview 1.Aging Instrumentation options NBTI Sensors Area and Power Overhead Light-weight Virtual Sensing Estimating Aging Profile of RF Portions in Relative Manner 2.Modifying RF Allocator + Adding RF Power- gators 14

ARGO: Virtual Sensing Ultra-threaded dispatcher doesn’t allocate different type of kernels to a CU at a time. Observation: Variation in execution time of different WG of a kernel is < 8% for a wide range of kernels. Why? 1)Round-robin WF scheduler. 2)Strategy that GPGPUs follow handling thread divergence. 15

ARGO: Virtual Sensing (cont.) RF portions are allocated per WG. All cells within a RF portion are aged at the same rate. At WG granularity, RF banks aged at the same rate Why? Because all are under stress for near- constant amount of time. 16 Least-degraded portion of RF is least-recently-allocated portion

ARGO: RF Allocator Based on Virtual Sensing: One rotation per each new WG Guarantees greedily allocating least-recently-allocated (= least-degraded) RF portion Issues proper power-gating signals Primary goal is recovery Side benefit is opportunistic saving of leakage power for unused banks 17

ARGO: Overheads Overheads imposed by ARGO’s micro- architectural modifications? Performance: No performance overhead thanks to single-cycle implementation of ARGO RF allocator, similar to baseline RF allocator Area: <1% of RF area Power: < 0.5% of leakage power of RF 18 Overheads are negligible

Experimental Setup Multi2Sim A cycle-accurate simulation framework − a CPU-GPU model for heterogeneous computing targeting AMD Evergreen ISA Kernels of AMD APP SDK 2.5 Large parameters to put highest load on resources HSPICE for SNM measurements 19

Simulation Result: V th Shift 20 On average 27% improvement in V th shift Normalized to reduction in baseline mode ~100% RF utilization, no opportunity for recovery No improvement, but no performance degradation too Min Improvement: 10% Max Improvement: 43%

Simulation Result: SNM Degradation 21 Improvements in SNM and V th show the same trend as expected [23] On average 30% improvement in SNM

Simulation Result: Trend of SNM Degradation 22 Unsafe Zone Aging-Oblivious Trend Depending on tech. and init. SNM, 15% to 20% reduction in SNM makes SRAM unreliable Entrance to “Unsafe Zone” shifted from 0.7 to 1.45 All curves below 20% after 5 years of execution

Summary Aging is becoming a reliability threat GPGPUs have large RFs susceptible to aging Observation: GPGPU RF utilization is ~46% ARGO: Key Ideas Exploit RF underutilization Overcome aging by leveling (rotating) allocation of stressed RFs ARGO improves SNM by 30% on average. 23 Please come to our poster for more details

Thank you Q&A NSF Expedition in Computing, Variability-Aware Software for Efficient Computing with Nanoscale Devices

25 Supplementary Slides

Simulation Result: Recovery / Bank Size Tradeoff 26 Kernel Recovery Time (%) 1K2K4K8K Rdn 48% BSe 63% DH1D 44% BSo 87% FWT 53% FW 29% BO 13% 8% DCT 77%73% MT 56% 42% MM 21% 14% SF 0% * URNG 81% 75% RS 86% HS 78% BSc 9% 4% 8K bank results in performance degradation Bank Size Overhead of power-gating logic can be reduced by coarser bank size WF per WG × #of registers is already a multiple of bank size. 2K or 4K banks are near optimal

Simulation Result: Different Process Corners 27 Gain is almost constant over the years Temp. constant, varying Voltage Voltage constant, varying Temp.