Download presentation
Presentation is loading. Please wait.
1
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Self-calibrated Online Wearout Detection Authors: Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke MICRO-40 December 3, 2007
2
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 2 Motivation “Designing Reliable Systems from Unreliable Components…” - Shekhar Borkar (Intel) [Srinivasan, DSN‘04][Borkar, MICRO‘05] More failures to come Failures will be wearout induced
3
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 3 Current Approaches Traditional Design margins Burn-in Detection: based on replication of computation TMR (Tandem/HP NonStop servers) DIVA (Bower, MICRO’05) Prediction: utilizes precise analytical models and/or sensors Canary circuits (SentinelSilicion, RidgeTop) RAMP (Srinivasan, UIUC/IBM) RAMP Costly Static Impractical
4
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 4 Wearout Mechanisms Many failure mechanisms have been shown to be progressive Hot carrier injection (HCI) Electromigration (EM) Oxide Breakdown (OBD) Negative Bias Temperature Inversion (NBTI)
5
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 5 Objective Propose a failure prediction technique that exploits the progressive nature of wearout Monitor impact on path delays Prediction Monitors evolution of wearout Proactive enables failure avoidance/mitigation Continuous feedback False negatives and positives Detection Identifies existing fault Reactive enables failure recovery End-of-life feedback False negatives
6
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 6 Oxide Breakdown (OBD) Accumulation of defects leads to a conductive path Percolation Model [Stathis, JAP‘06]
7
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 7 OBD HSPICE Model Post-breakdown leakage modeling [Rodriguez, Stathis, Linder, IRPS ‘03] [BSIM4.6.0, ‘06]
8
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 8 Characterization Testbench t circuit t cell 90nm standard cell library
9
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 9 Impact on Propagation Delay
10
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 10 Delay Profiling Unit (DPU) input signal Latency Sampling 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 uArch Module
11
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 11 TRIX Analysis Magnitude of divergence between TRIX global and TRIX local reflects amount of degradation
12
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 12 Exponential Moving Average (EMA) Triple-smoothed Exponential Moving Average TRIX Analysis Details
13
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 13 Noisy Latency Profile Percent Nominal Delay (%) Increasing Age
14
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 14 DPU with TRIX Hardware input signal Latency Sampling TRIX l Calculation Prediction TRIX g Calculation 0 0 0 0 0 0 0 1 1 1
15
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 15 Wearout Detection Unit (WDU) Latency Sampling Prediction TRIX l Calculation + TRIX g Calculation
16
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 16 Evaluation Framework OR1200 Verilog OR1200 Verilog Synthesis and Place and Route Synthesis and Place and Route Timing, Power, and Temperature Simulations Timing, Power, and Temperature Simulations MediaBench Suite MediaBench Suite 90nm Library 90nm Library Fully Synthesized, P&R, OR1200 Core Monte Carlo Simulator OBD Wearout Model OBD Wearout Model HSPICE Simulations HSPICE Simulations Gate-level Processor Simulator Workload Simulator Wearout Simulator
17
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 17 WDU Accuracy
18
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 18 WDU Overhead
19
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 19 WDU Overhead
20
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 20 Long-term Vision Introspective Reliability Management (IRM) Intelligent reliability management directed by on-chip sensor feedback Prospective sensors Delay (WDU) Leakage/Vt Temperature
21
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 21 Introspective Reliability Management
22
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 22 Conclusions Many progressive wearout phenomenon impact device- level performance. It’s possible to characterize this impact and anticipate failures WDU performance Failure predicted within 20% of end of life (tunable) Area overhead < 3% (hybrid) Low-level sensors can be used to enable intelligent reliability management
23
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 23 Questions? ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.