University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.

Slides:



Advertisements
Similar presentations
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Advertisements

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Advanced Computer Architecture Laboratory StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric.
Microprocessor Reliability
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
NATW 2008 Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, R. Iris Bahar Division of Engineering Brown University Providence,
Making Services Fault Tolerant
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Erasing Core Boundaries.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.
Swami NatarajanJune 17, 2015 RIT Software Engineering Reliability Engineering.
SE 450 Software Processes & Product Metrics Reliability Engineering.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Self-calibrated.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
BIST vs. ATPG.
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
University of Michigan Electrical Engineering and Computer Science 1 Top 5 Reasons Reliability is the Biggest Fallacy in Computer Architecture Research.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Maestro: Orchestrating.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
Achieving Better Reliability With Software Reliability Engineering Russel D’Souza Russel D’Souza.
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
Alec Stanculescu, Fintronic USA Alex Zamfirescu, ASC MAPLD 2004 September 8-10, Design Verification Method for.
UW-Madison Computer Sciences Vertical Research Group© 2010 A Unified Model for Timing Speculation: Evaluating the Impact of Technology Scaling, CMOS Design.
Presenter: Jyun-Yan Li A software-based self-test methodology for in-system testing of processor cache tag arrays G. Theodorou, N. Kranitis, A. Paschalis.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Chapter 6 : Software Metrics
IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
OHTO -99 SOFTWARE ENGINEERING “SOFTWARE PRODUCT QUALITY” Today: - Software quality - Quality Components - ”Good” software properties.
Architectural Optimizations Ed Carlisle. DARA: A LOW-COST RELIABLE ARCHITECTURE BASED ON UNHARDENED DEVICES AND ITS CASE STUDY OF RADIATION STRESS TEST.
Robust Low Power VLSI ECE 7502 S2015 Minimum Supply Voltage and Very- Low-Voltage Testing ECE 7502 Class Discussion Elena Weinberg Thursday, April 16,
A Node and Load Allocation Algorithm for Resilient CPSs under Energy-Exhaustion Attack Tam Chantem and Ryan M. Gerdes Electrical and Computer Engineering.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost,
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Qiang XU CUhk REliable computing laboratory (CURE)
Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 The StageNet Fabric.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.
CS203 – Advanced Computer Architecture Dependability & Reliability.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
DEFECT PREDICTION : USING MACHINE LEARNING
Fault Tolerance In Operating System
Supporting Fault-Tolerance in Streaming Grid Applications
Daya S Khudia, Griffin Wright and Scott Mahlke
Scott Mahlke University of Michigan
Hwisoo So. , Moslem Didehban#, Yohan Ko
Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors
Fault Tolerance Distributed Web-based Systems
Metrics for process and Projects
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing for Efficient Hard Fault Detection Shantanu Gupta, Amin Ansari, Shuguang Feng and Scott Mahlke University of Michigan, Ann Arbor International Conference on Computer Design, October 4-7, 2009

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Reliability Threats Transient faults ► Particle strikes (alpha, neutrons) ► Expected to grow with the decreasing Q crit Permanent faults / Defects ► Time of occurrence Manufacture time / burn-in In-field (consumer end) ► Causes Process variation Wafer defects Wearout 2 Intra-die variations in ILD thickness Electromigration (EM) Oxide breakdown (OBD) Transient fault

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Online Defect Tolerance 3 1. Detection and Diagnosis 2. System repair 3. System recovery Use system redundancy Graceful performance degradation Bulletproof, StageNet, Configurable Isolation Checkpoint state periodically Revert back to a safe state upon failures ReVive, Safetynet Goal: To perform efficient detection of hard errors. Low cost solutions are desirable

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Detection and Diagnosis : Continuous test Redundant Execution DMR / TMR Processor checking Low-level sensors Canary circuits ► Early indication of failures ► Predict failures Wearout sensors ► In-situ measurement of degradation Can approximate remaining life of a module 4 Original Module Original Module Redundant Module Redundant Module Checker Processor Checker Processor Checker Processor

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Detection and Diagnosis : Periodic test Periodically stall the system and run diagnostic tests Create checkpoints, and roll-back in case of failures Testing alternatives: BIST, SBST, functional testing… 5 Checkpoint Thread Execution Test Thread Execution

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Periodic Testing Challenge 6 Annual losses (millions of $) 1% downtime impact Periodic tests are resource intensive: 5%-20% overhead Testing stalls the main system Downtimes have a huge impact Target: Make periodic testing more efficient

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Key Insight Given a many-core chip… Health of cores vary spatially and temporally 1. Process variations : Spatial 2. Workload imbalance related variations : Spatial 3. Wearout over the lifetime : Temporal Our Approach Use sensors for health evaluation Allocate testing resources on the basis of health  Strong core  less testing; Weak core  more testing Software-based self test (SBST) programs for testing 7

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing 8 Low-level sensors can track health of resources Step 1. Health Assessment D$ I$ Sensors CMP Health Assessment Memory System How to leverage this for saving testing cost?

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing 9 Health assessment is used to derive the probability of failure of each core A core is UNSAFE if: a fault occurs AND it’s not caught P f : Probability of failure FC : Test Fault Coverage 5% 10% 50% 5% 0% 25% 80% For a given safety level PfPf FC Safety level is a fault coverage metric that accounts for failure probability

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing Inputs: ► Safety Level ► Probability of failure Test fault coverage (FC) is computed for every core Software-based self testing program is formulated given the target FC Test program size grows superlinearly with the FC 10 Step 2. Test allocation Test Vectors CMP Health Assessment Test Allocator T – test array P – failure probability Memory System Step 2. Test Allocation

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing 11 Checkpoint and Recovery Test Vectors Checkpoint CMP Test Allocator T – test array Health Assessment P – failure probability Memory System Step 3. Checkpoint and Recovery Checkpoints are created periodically Main memory used for checkpoint storage We use the ReVive design for checkpoint and recovery

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing: Example 12 Checkpoint Thread Execution Test Thread Execution Each core undergoes a different level of testing. Time / Energy saved benefits the actual thread execution. Thread Execution Test Checkpoint Rollback

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Evaluation : Setup Architecture: ► 16-core CMP, ARM9-like RISC processors Testing methodology: ► Software-based self test [Lu 2008] Process variation: ► VARIUS tool from UIUC for variation modeling Wearout sensors: ► Oxide breakdown sensors [E. Karl 2008] 13

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Test saving with varying safety level 14 Number of test instructions (normalized) Safety level 80% test instruction saving Reduces performance overhead by 5X

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Test saving and sensor accuracy 15 Number of test instructions Sensor area overhead Sensor error Huge savings even with error prone sensors!

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive testing over the lifetime 16 Safety Level Time (Years) Average number of Test Instructions (thousands)

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Limitations Health assessment is done at the CORE level The weakest component within a CORE determines the testing effort Increasing levels of process variation will only aggravate this behavior Is there any way to overcome this? 17

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Limitations: Example 18 Core 2 Core 0 Core 1 Core 3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 Latch Stage2 Latch Stage3 StageN High test effort Low test effort

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 19 Core 2 Core 0 Core 1 Core 3 StageNet (SN) Fabric [MICRO 08] Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 Latch Stage2 Latch Stage3 StageN

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 20 StageNet (SN) Fabric [MICRO 08] Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Configuration Manager StageNet Slice Crossbar Switch Wearout Sensors Delay Current

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science SN Fault Tolerance and Adaptive Testing Fetch Ex/Mem Decode Issue Configuration Manager Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Defect-free working of StageNet is similar to a traditional CMPIn the presence of failures, working stages can be easily salvagedProcess variation and lifetime wearout can result in a disparity of health for various resources StageNet can isolate strong/weak resources and improve the efficacy of the proposed Adaptive Online Testing Ex/Mem Issue Fetch Decode Strong: Low test effort Weak: High test effort Strong: Low test effort 21

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Test saving with StageNet 22 Number of test instructions (normalized) Safety level StageNet delivers an additional 10% saving

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Conclusions Health assessment of resources can enable large test savings Vaguely accurate sensors might be sufficient: 80% saving with 25% sensor error Adaptive online testing can reduce the performance overhead by 5X Sensors and testing can work together for a comprehensive and cheap online testing solution 23

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing for Efficient Hard Fault Detection International Conference on Computer Design, October 4-7, 2009

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Back up slides 25

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 26 Journey of Silicon Technology 486 Pentium Pentium II Pentium III Pentium 4 Core Duo Core 2 Quad Perfect transistors Rising Variability and Defects Unreliable Silicon CPU Performance (log scale) Memory redundancy IBM z servers Cell

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Periodic test options 27 TestingSoftware Functional Testing Software- based self test Hardware Built-in self test

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Evaluation : Methodology Lifetime simulations are conducted as a series of interval simulations At the end of each interval: ► Sensors readings are updated ► Test effort is computed on a per core basis Statistics are collected for ► Total amount of test effort (number of instructions) ► Running average of test effort (over the lifetime) 28

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Test saving with varying system coverage 29 StageNet (SN) gains advantage with increasing coverage target

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Test saving and sensor accuracy 30 80% saving at 25% sensor error!

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive testing over the lifetime 31