Luca Benini/ UNIBO and ETHZ

Slides:



Advertisements
Similar presentations
Tunable Sensors for Process-Aware Voltage Scaling
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The University of Adelaide, School of Computer Science
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.
Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Chapter 17 Parallel Processing.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
1 A Variability-Aware OpenMP Environment for Efficient Execution of Accuracy-Configurable Computation on Shared-FPU Processor Clusters Abbas Rahimi, Andrea.
A New Methodology for Reduced Cost of Resilience Andrew B. Kahng, Seokhyeong Kang and Jiajia Li UC San Diego VLSI CAD Laboratory.
Andrea Marongiu Luca Benini ETH Zurich Daniele Cesarini University of Bologna.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
1 A Cost-effective Substantial- impact-filter Based Method to Tolerate Voltage Emergencies Songjun Pan 1,2, Yu Hu 1, Xing Hu 1,2, and Xiaowei Li 1 1 Key.
1 Variability.org Aging-Aware Compiler-Directed VLIW Assignment for GPGPU Architectures Abbas Rahimi ‡, Luca Benini †, Rajesh K. Gupta ‡ ‡ UC San Diego,
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
ARGO: Aging-aware GPGPU Register File Allocation Majid Shoushtari Nikil Dutt Puneet Gupta Computer Science Electrical Engineering
Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
My Coordinates Office EM G.27 contact time:
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
PipeliningPipelining Computer Architecture (Fall 2006)
Advanced Architectures
‡University of California Berekely
Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego
Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.
SIMD Lane Decoupling Improved Timing-Error Resilience
5.2 Eleven Advanced Optimizations of Cache Performance
/ Computer Architecture and Design
Improving Program Efficiency by Packing Instructions Into Registers
Vector Processing => Multimedia
Abbas Rahimi, Luca Benini, Rajesh K. Gupta
†UCSD, ‡UCSB, EHTZ*, UNIBO*
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego
Haonan Wang, Adwait Jog College of William & Mary
Efficient Placement of Compressed Code for Parallel Decompression
Presentation transcript:

Temporal Memoization for Energy-Efficient Timing Error Recovery in GPGPUs Abbas Rahimi, Luca Benini, Rajesh K. Gupta UC San Diego, UNIBO and ETHZ NSF Variability Expedition ERC MultiTherman

Luca Benini/ UNIBO and ETHZ Outline Motivation Sources of variability Cost of variability-tolerance Related work Taxonomy of SIMD Variability-Tolerance Temporal memoization Temporal instruction reuse in GPGPUs Experimental setup and results Conclusions and future work 26-March-14 Luca Benini/ UNIBO and ETHZ

Sources of Variability Variability in transistor characteristics is a major challenge in nanoscale CMOS: Static variation: process (Leff, Vth) Dynamic variations: aging, temperature, voltage droops To handle variations Conservative guardbands  loss of operational efficiency  actual circuit delay guardband Clock Process Temperature Aging VCC droop 26-March-14 Luca Benini/ UNIBO and ETHZ Slow Fast

Variability is about Cost and Scale Eliminating guardband  Timing error  Bowman et al, JSSC’09 Costly error recovery  3×N recovery cycles per error for scalar pipeline! N= # of stages 26-March-14 Luca Benini/ UNIBO and ETHZ 3 Bowman et al, JSSC’11

Cost of Recovery is Higher in SIMD! Cost of recovery is exacerbated in SIMD pipelined: Vertically: Any error within any of the lanes will cause a global stall and recovery of the entire SIMD pipeline. Horizontally: Higher pipeline latency causes a higher cost of recovery through flushing and replaying. error rate × wider width Recovery cycles increases linearly with pipeline length Wide lanes quadratically expensive Deep pipes

SIMD is the Heart of GPGPU Ultra-threaded Dispatcher Compute Unit (CU0) Compute Unit (CU19) L1 Crossbar Global Memory Hierarchy Compute Device SIMD Fetch Unit Stream Core (SC0) Stream Core (SC15) Local Data Storage Wavefront Scheduler Compute Unit (CU) T General-purpose Reg X Y Z W Branch Processing Elements (PEs) Stream Core (SC) X : MOV R8.x, 0.0f Y : AND_INT T0.y, KC0[1].x Z : ASHR T0.x, KC1[3].x W:________ T:_________ VLIW Radeon HD 5870 (AMD Evergreen) 20 Compute Units (CUs) 16 Stream Cores (SCs) per CU (SIMD execution) 5 Processing Elements (PEs) per SC (VLIW execution) 4 Identical PEs (PEX, PEY, PEW, PEZ) 1 Special PET 26-March-14 Luca Benini/ UNIBO and ETHZ

Taxonomy of SIMD Variability-Tolerance Guardband Adaptive Eliminating No timing error Timing error Hierarchically focused guardbanding and uniform instruction assignment Error recovery Rahimi et al, DATE’13 Rahimi et al, DAC’13 Predict & prevent Memoization Decoupled recovery Recalling recent context of error-free execution Lane decoupling through provate queues Rahimi et al, TCAS’13 Pawlowski et al, ISSCC’12 Krimer et al, ISCA’12 26-March-14 Luca Benini/ UNIBO and ETHZ 6 Detect-then-correct

Related Work: Predict & Prevent Uniform VLIW assignment periodically distributes the stress of instructions among various slots resulting in a healthy code generation. GPGPU Dynamic Binary Optimizer Host CPU Naïve Kernel Healthy Kernel × These predictive techniques cannot eliminate the entire guardbanding to work efficiently at the edge of failure! Rahimi et al, DAC’13 Tuning clock frequency through an online model-based rule in view of sensors, observation granularity, and reaction times. 26-March-14 Luca Benini/ UNIBO and ETHZ Rahimi et al, DATE’13

Related Work: Detect-then-Correct Lane decoupling by private queues that prevent errors in any single lane from stalling all other lanes self-lane recovery Pawlowski et al, ISSCC’12 Krimer et al, ISCA’12 Causes slip between lanes  additional mechanisms to ensure correct execution Lanes are required to resynchronize for a microbarrier (load, store)  performance penalty 26-March-14 Luca Benini/ UNIBO and ETHZ

Taxonomy of SIMD Variability-Tolerance Guardband Adaptive Eliminating No timing error Timing error Hierarchically focused guardbanding and uniform instruction assignment Error ignorance Error recovery Predict & prevent Ensuring safety of error ignorance by fusing multiple data-parallel values into a single value Memoization Decoupled recovery Detect & ignore Recalling recent context of error-free execution Lane decoupling through provate queues Detect-then-correct: exactly or approximately through memoization Detect-then-correct 26-March-14 Luca Benini/ UNIBO and ETHZ 9

Memoization: in Time or Space Reduce the cost of recovery by memoization-based optimizations that exploit spatial or temporal parallelisms Temporal error correction Contexta [t-k] ✔ Contextb [t-k] ✔ Contextc [t-k] ✔ … … … Contexta [t-1] ✔ Contextb [t-1] ✔ Contextc [t-1] ✔ Contexta [t] ✔ Contextb [t] ✔ Contextc [t] ✔ Spatial error correction Contexti x Reuse HW Sensors [Spatial Memoization] A. Rahimi, L. Benini, R. K. Gupta, “Spatial Memoization: Concurrent Instruction Reuse to Correct Timing Errors in SIMD,” IEEE Tran. on CAS-II, 2013. 26-March-14 Luca Benini/ UNIBO and ETHZ

Luca Benini/ UNIBO and ETHZ Contributions A temporal memoization technique for use in SIMD floating-point units (FPUs) in GPGPUs Recalls the context of error-free execution of an instruction on a FPU. Maintain the lock-step execution  To enable scalable and independent recovery, a single-cycle lookup table (LUT) is tightly coupled to every FPU to maintain contexts of recent error-free executions. The LUT reuses these memorized contexts to exactly, or approximately, correct errant FP instructions based on application needs. Scalability ✓ low-cost self-resiliency ✓ in the face of high timing error rates! 26-March-14 Luca Benini/ UNIBO and ETHZ

Concurrent/Temporal Inst. Reuse (C/TIR) Parallel execution in SIMD provides an ability to reuse computation and reduce the cost of recovery by leveraging inherent value locality CIR: Whether an instruction can be reused spatially across various parallel lanes? TIR: Whether an instruction can be reused temporally for a lane itself? Utilizing memoization: C/TIR memoizes the result of an error-free execution on an instance of data. Reuses this memoized context if they meet a matching constraint (approximate or exact) CIR TIR 26-March-14 Luca Benini/ UNIBO and ETHZ 12

FP Temporal Instruction Reuse (TIR) A private FIFO for every individual FPU Exact matching constraint; for Black-Scholes Approximate matching constraint (ignoring the less significant 12 bits of the fraction); for Sobel With approximate matching constraint, PSNR > 30dB✓ 11%↑ 13%↑ 5×↑ 26-March-14 Luca Benini/ UNIBO and ETHZ

Overall TIR Rate of Applications Programmable through memory-mapped registers Approximate matching Exact matching Mostly, hit rate increases < 10% when FIFO increases from 10 to 1,000 FIFOs with 4 entries ✓ provide an average hit rate of 76% (up to 97%) ✓ have 2.8× higher hit rate per power compared to the 10 entries 26-March-14 Luca Benini/ UNIBO and ETHZ

Temporal Memoization Module module (in gray) superposed on the baseline recovery with EDS+ECU (replay) Hit Error Action QPipe LUT update QS 1 Trigger ECU LUT reuse + FP CLK gating QL LUT reuse + FP CLK gating + error masking 26-March-14 Luca Benini/ UNIBO and ETHZ

Luca Benini/ UNIBO and ETHZ Experimental Setup We focus on energy-hungry high-latency single-precision FP pipelines Memory blocks are resilient by using tunable replica bits The fetch and decode stages display a low criticality [Rahimi et al, DATE’12] Six frequently exercised units: ADD, MUL, SQRT, RECIP, MULADD, FP2FIX; 4 cycles latency (except RECIP with 16 stages) generated by FloPoCo. Have been optimized for signoff frequency of 1GHz at (SS/0.81V/125°C), and then for power using high VTH cells in TSMC 45nm. 0.11% die area overhead for Radeon HD 5870. Multi2Sim, a cycle-accurate CPU-GPU simulator for AMD Evergreen The naive binaries of AMD APP SDK 2.5 26-March-14 Luca Benini/ UNIBO and ETHZ

Energy Saving for Various Error Rates error rate of 0%: on average 8% saving error rate of 1%: on average 14% saving error rate of 2%: on average 20% saving error rate of 3%: on average 24% saving error rate of 4%: on average 28% saving Temporal memoization module does NOT produce an erroneous result, as it has a positive slack of 14% of the clock period. Thanks to efficient memoization-based error recovery that does not impose any latency penalty as opposed to the baseline 26-March-14 Luca Benini/ UNIBO and ETHZ

Efficiency under Voltage Overscaling 66% saving 6% saving 8% saving @ nominal volt FPUs of the baseline are reduced their power as consequence of negligible error rate, while we cannot proportionally scale down the power of the temporal memoization modules. Baseline faces an abrupt increasing in error rate therefore frequent recoveries! 26-March-14 Luca Benini/ UNIBO and ETHZ

Luca Benini/ UNIBO and ETHZ Conclusion A fast lightweight temporal memoization module to independently store recent error-free executions of a FPU. To efficiently reuse computations, the technique supports both exact and approximate error correction. Reduces the total energy by average savings of 8%-28% depending on the timing error rate. Enhances robustness in the voltage overscaling regime and achieves relative average energy saving of 66% with 11% voltage overscaling. 26-March-14 Luca Benini/ UNIBO and ETHZ

Luca Benini/ UNIBO and ETHZ Work in Progress To further reduce the cost of memoization, we replaced LUT with associative memristive (ReRAM) memory module that has a ternary content addressable memory [Rahimi et al, DAC’14] 39% reduction in average energy use by the kernels Collaborative compilation + Approximate storage 26-March-14 Luca Benini/ UNIBO and ETHZ

Grazie dell’attenzione! NSF Variability Expedition ERC MultiTherman 26-March-14 Luca Benini/ UNIBO and ETHZ