Raghuraman Balasubramanian Karthikeyan Sankaralingam

Slides:



Advertisements
Similar presentations
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Advertisements

Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 261 Lecture 26 Logic BIST Architectures n Motivation n Built-in Logic Block Observer (BILBO) n Test.
Slides based on Kewal Saluja
Microprocessor Reliability
1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.
2007 MURI Review The Effect of Voltage Fluctuations on the Single Event Transient Response of Deep Submicron Digital Circuits Matthew J. Gadlage 1,2, Ronald.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.
Twin Logic Gates – Improved Logic Reliability by Redundancy concerning Gate Oxide Breakdown Hagen Sämrow, Claas Cornelius, Frank Sill, Andreas Tockhorn,
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Self-calibrated.
Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Automated Design.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
Trace-Based Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability 1 Lerong Cheng, 1 Yan Lin,
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
1 Multi-Level Error Detection Scheme based on Conditional DIVA-Style Verification Kevin Lacker and Huifang Qin CS252 Project Presentation 12/10/2003.
1 paper I design and implementation of the aegis single-chip secure processor using physical random functions, isca’05 nuno alves 28/sep/06.
Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.
Advanced Computing and Information Systems laboratory Device Variability Impact on Logic Gate Failure Rates Erin Taylor and José Fortes Department of Electrical.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
Software Faults and Fault Injection Models --Raviteja Varanasi.
ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTES Fault Modeling.
DYNAMIC TEST SET SELECTION USING IMPLICATION-BASED ON-CHIP DIAGNOSIS Nicholas Imbriglia, Nuno Alves, Elif Alpaslan, Jennifer Dworak Brown University NATW.
UW-Madison Computer Sciences Vertical Research Group© 2010 A Unified Model for Timing Speculation: Evaluating the Impact of Technology Scaling, CMOS Design.
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Soft errors in adder circuits Rajaraman Ramanarayanan, Mary Jane Irwin, Vijaykrishnan Narayanan, Yuan Xie Penn State University Kerry Bernstein IBM.
Canary SRAM Built in Self Test for SRAM VMIN Tracking
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
Self-* Systems CSE 598B Paper title: Dynamic ECC tuning for caches Presented by: Niranjan Soundararajan.
European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.
On-Chip Reliability Monitor for Measuring Frequency Degradation of Digital Circuits Department of Electrical and Computer Engineering By Han Lin Jiun-Yi.
1 A Cost-effective Substantial- impact-filter Based Method to Tolerate Voltage Emergencies Songjun Pan 1,2, Yu Hu 1, Xing Hu 1,2, and Xiaowei Li 1 1 Key.
1 Compacting Test Vector Sets via Strategic Use of Implications Kundan Nepal Electrical Engineering Bucknell University Lewisburg, PA Nuno Alves, Jennifer.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Stochastic Current Prediction Enabled Frequency Actuator for Runtime Resonance Noise Reduction Yiyu Shi*, Jinjun Xiong +, Howard Chen + and Lei He* *Electrical.
Detecting Errors Using Multi-Cycle Invariance Information Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence,
Patricia Gonzalez Divya Akella VLSI Class Project.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Seok-jae, Lee VLSI Signal Processing Lab. Korea University
Taniya Siddiqua, Paul Lee University of Virginia, Charlottesville.
Raghuraman Balasubramanian Karthikeyan Sankaralingam Understanding the Impact of Gate-Level Physical Reliability Effects on Whole Program Execution.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Simone Campanoni A research CAT Simone Campanoni
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
SIMD Lane Decoupling Improved Timing-Error Resilience
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
The University of British Columbia
Pattern Compression for Multiple Fault Models
Circuits Aging Min Chen( ) Ran Li( )
Soft Error Detection for Iterative Applications Using Offline Training
Dual Mode Logic An approach for high speed and energy efficient design
Circuit Design Techniques for Low Power DSPs
Yiyu Shi*, Jinjun Xiong+, Howard Chen+ and Lei He*
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs
Circuits Aging Min Chen( ) Ran Li( )
A High Performance SoC: PkunityTM
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
The Impact of Aging on FPGAs
Encountering Gate Oxide Breakdown with Shadow Transistors to Increase Reliability Claas Cornelius1, Frank Sill2, Hagen Sämrow1, Jakob Salzmann1, Dirk Timmermann1,
Guihai Yan, Yinhe Han, and Xiaowei Li
HotAging — Impact of Power Dissipation on Hardware Degradation
Phase based adaptive Branch predictor: Seeing the forest for the trees
Presentation transcript:

Raghuraman Balasubramanian Karthikeyan Sankaralingam Virtually-Aged Sampling DMR Unifying Circuit Failure Prediction and Detection Raghuraman Balasubramanian Karthikeyan Sankaralingam

Microprocessor Reliability  More devices will fail on the field in future technology nodes 10nm Failure Rate 16nm 32nm Time (years) A lot of research on how to… Mitigate / Recover / Repair … Detect : DMR, Diva, Argus, BIST, SWAT… Predict : Canaries, Razor, WearMon… Coverage, detection latency, fault type… 2

Circuit Failure Prediction Our goals Low Design Complexity Low Overheads High Accuracy Full Coverage 3

Lets start from a good baseline To get there… Lets start from a good baseline Sampling-DMR 4

Sampling+DMR Permanent fault detection 100% coverage < 2% Energy overheads Nomura, Shuou, et al. "Sampling+dmr: practical and low-overhead permanent fault detection." International Symposium on Computer Architecture (ISCA), 2011 5

Virtual aging makes the gates behave as if they were 6 months older But There is a Problem Time (years) A Gate Fails Architectural Errors Sampling-DMR SamplingWindows With Infrequently Occurring Errors Missed Errors  Sampling-DMR Virtually Aged Virtual aging makes the gates behave as if they were 6 months older 6

Virtually Aged Sampling DMR Virtual Aging Fault Exposure In most gates the faults are automatically exposed A new mechanism to expose faults in other gates Detect Errors 7

Executive Summary Virtually Aged Sampling-DMR Microprocessor Failure Prediction Full logic coverage With < 0.7% energy overhead Negligible performance overhead 8

Outline Motivation and Overview Virtual aging? Are all gates covered? Evaluation Methodology Results Related work Questions 9

Reducing Vdd == 6-month Delay Degradation Virtual Aging As a chip wears out, the gates become slower As we decrease Vdd, the gates become slower Virtual aging => Reducing Vdd == 6-month Delay Degradation 10

Outline Motivation and Overview Virtual aging Are all gates covered? Evaluation Methodology Results Related work Questions 11

Are all gates covered? Most gates (near-critical paths) ✔ Initial worst-case propagation delay ∼ clock period Wearout ➔ propagation delay↑ > clock period Delay fault is naturally exposed Some gates (non-critical paths) ✗ Initial worst-case propagation delay << clock period Wearout ➔ propagation delay↑ < clock period ⇒ Fault is not manifested Delay degradation is benign Eventually catastrophic breakdown  12 Photo credit : Wikimedia Commons

Soft and Hard breakdown Degradation = f(utilization, operating conditions, process variations) Any gate may fail. 13

Fault Capture Logic for Non-Critical Paths 14

Comprehensive Logic coverage Fault Capture Logic for Non-Critical Paths Comprehensive Logic coverage 15

Virtually Aged SDMR 16

Outline Motivation and Overview Virtual aging?? Are all gates covered?? Evaluation Methodology Results Related work Questions 17

Evaluation Methodology Applications Applications DMR Error?? Input Sequences Fault Vector Delay Aware Simulation Full SPEC benchmarks OpenRISC Processor ~400,000 Fault Injection Experiments Synopsys HSPICE + MOSRA Delay as a function of Time/Vdd 18

Outline Motivation and Overview Virtual aging? Are all gates covered? Evaluation Methodology Results Related work Questions 19

Results Is delay degradation measurably observable? Can voltage reduction mimic virtual aging? Do the manifested faults get exposed to the μ arch and cause timing faults? Do the faults exposed to the microarchitecture translate to architectural errors, then detected? What are the overheads? Paper includes results on running 10 SPEC benchmarks to completion spanning almost 400,000 experimental runs 20

Is delay degradation measurably observable? 5 gates represent fault sites Model paths through these gates in HSPICE MOSRA wearout models 21

Can voltage reduction mimic virtual aging? HSPICE @ Vdd = 1.2 V, Vdd = 1.15V 22

What are the overheads? Synthesized with 32nm Synopsys process Implemented additional logic for fast paths OpenRISC OpenSPARC Logic Processor Gates on Fast Path 39% 30% Area Overhead 28.9% 8.9% 22.2% 6.8% Peak Power Increase 3.2% 2.54% 2.21% 0.99% Energy Increase 0.9% 0.7% 1.02% 1.07% 23

Results - Summary Experimental Result Predict failures 9 months in advance using a Vdd reduction of 50mV Empirical result + Mathematical modeling Can predict failure within 0.4 days in all but 1 of 1 billion chips 24

Outline Motivation and Overview Virtual aging? Are all gates covered? Evaluation Methodology Results Related work Questions 25

Circuit Failure Prediction Predict the onset of failures Low Design Complexity Low Overheads High Accuracy Full Coverage 26

Related Work ✓ ✗ On-chip test circuits Canary circuits Technique Complexity Overheads Accuracy Coverage Canary circuits ✓ ✗ On-chip test circuits 27

Detect aging in select near-critical paths Related Work Technique Complexity Overheads Accuracy Coverage Canary circuits ✓ ✗ Age Detection (Shadow) Latches Detect aging in select near-critical paths 27

Periodic testing (offline) using on-chip test vectors Related Work Technique Complexity Overheads Accuracy Coverage Canary circuits ✓ ✗ Age Detection (Shadow) Latches BIST/DFT Aging Analysis Periodic testing (offline) using on-chip test vectors 27

Measure + Analyze (online) Related Work Technique Complexity Overheads Accuracy Coverage Canary circuits ✓ ✗ Age Detection (Shadow) Latches BIST/DFT Aging Analysis Continuous Delay Tracking Measure + Analyze (online) 27

Reduce Vdd + Expose Faults Related Work Technique Complexity Overheads Accuracy Coverage Canary circuits ✓ ✗ Age Detection (Shadow) Latches BIST/DFT Aging Analysis Continuous Delay Tracking Virtually Aged Sampling DMR Reduce Vdd + Expose Faults 27

Contributions Thank You Virtually Aged Sampling-DMR Microprocessor Failure Prediction Full logic coverage With < 0.7% energy overhead Negligible performance overhead A new state-of-the-art in evaluation Accurate wearout models at the gate level And impact on full system (running full benchmarks) Thank You 28

How Devices Degrade Target failure mechanisms for which NBTI, HCI, TDDB Over time, Threshold Voltage Increases Propagation Delay Increases NOT covered: Electromigration, thermal runaway We cannot statically determine which gate will wearout : as this depends on process variations, operating conditions like temperature and most importantly switching activity that depends on the application that’s executing. Target failure mechanisms for which delay degradation is a symptom 33

Variations Process variations (Static) Voltage variations (Dynamic) Some processors are more susceptible Voltage variations (Dynamic) Variations ~1 order of magnitude smaller compared to degradation Similar conditions in actual failure & virtual aging Reddi, Vijay Janapa, et al. "Voltage noise in production processors." Micro, IEEE 31.1 (2011).

When does this not work? Only when the conditions change drastically between prediction and actual failure Change in program behavior Operating conditions (Temperature, Voltage etc.,) Program hides fault exposure (but stresses it) As long as the fault is manifested 0.4 days before the actual failure – Aged-SDMR works.

Evaluation setup

We saw timing faults appear during the sampling windows Do the manifested faults get exposed to the μ-arch and cause timing faults? Delay Aware Simulation Input sequences from OpenRISC FPGA 10 benchmarks (6 SPEC INT, 4 SPEC FP) 5 million cycle traces x 3 phases of the program Cycle accurate fault vectors We saw timing faults appear during the sampling windows

Architecture error rate using 100000 cycle sampling windows Do the faults exposed in the microarchitecture translate to architectural errors, then detected? Fault vector from delay aware simulation Injected on OpenRISC on FPGA + DMR emulation Appln G1 G2 G3 G4 G5 ammp 1.60% 3.10% 5.10% 1.40% art 0.02% 2.70% 0.01% 2.60% bzip 2.30% 1.20% 0.90% 0.20% 0.07% gzip 1.50% 0.03% 0.40% 0.04% mcf 3.40% 0.70% mesa 2.20% 1.00% 0.09% 0.80% parser 4.30% 1.30% 1.90% 0.50% quake twolf 3.30% 1.10% vpr 2.10% Architecture error rate using 100000 cycle sampling windows

Canary based 39/29

Age Detection Latches 40/29

BIST/DFT Based (Offline) 41/29

Continuous Degradation Tracking 42/29

Evaluation Methodology : Key Challenges Aged-SDMR is a Cross-layered Approach Wearout is a gate-level phenomenon Sampling-DMR works at the architecture level Application dependency Technique relies on the application to expose faults Run full applications on a full system simulator & model wearout at the device level