University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing for Efficient Hard Fault Detection Shantanu Gupta, Amin Ansari, Shuguang Feng and Scott Mahlke University of Michigan, Ann Arbor International Conference on Computer Design, October 4-7, 2009
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Reliability Threats Transient faults ► Particle strikes (alpha, neutrons) ► Expected to grow with the decreasing Q crit Permanent faults / Defects ► Time of occurrence Manufacture time / burn-in In-field (consumer end) ► Causes Process variation Wafer defects Wearout 2 Intra-die variations in ILD thickness Electromigration (EM) Oxide breakdown (OBD) Transient fault
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Online Defect Tolerance 3 1. Detection and Diagnosis 2. System repair 3. System recovery Use system redundancy Graceful performance degradation Bulletproof, StageNet, Configurable Isolation Checkpoint state periodically Revert back to a safe state upon failures ReVive, Safetynet Goal: To perform efficient detection of hard errors. Low cost solutions are desirable
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Detection and Diagnosis : Continuous test Redundant Execution DMR / TMR Processor checking Low-level sensors Canary circuits ► Early indication of failures ► Predict failures Wearout sensors ► In-situ measurement of degradation Can approximate remaining life of a module 4 Original Module Original Module Redundant Module Redundant Module Checker Processor Checker Processor Checker Processor
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Detection and Diagnosis : Periodic test Periodically stall the system and run diagnostic tests Create checkpoints, and roll-back in case of failures Testing alternatives: BIST, SBST, functional testing… 5 Checkpoint Thread Execution Test Thread Execution
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Periodic Testing Challenge 6 Annual losses (millions of $) 1% downtime impact Periodic tests are resource intensive: 5%-20% overhead Testing stalls the main system Downtimes have a huge impact Target: Make periodic testing more efficient
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Key Insight Given a many-core chip… Health of cores vary spatially and temporally 1. Process variations : Spatial 2. Workload imbalance related variations : Spatial 3. Wearout over the lifetime : Temporal Our Approach Use sensors for health evaluation Allocate testing resources on the basis of health Strong core less testing; Weak core more testing Software-based self test (SBST) programs for testing 7
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing 8 Low-level sensors can track health of resources Step 1. Health Assessment D$ I$ Sensors CMP Health Assessment Memory System How to leverage this for saving testing cost?
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing 9 Health assessment is used to derive the probability of failure of each core A core is UNSAFE if: a fault occurs AND it’s not caught P f : Probability of failure FC : Test Fault Coverage 5% 10% 50% 5% 0% 25% 80% For a given safety level PfPf FC Safety level is a fault coverage metric that accounts for failure probability
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing Inputs: ► Safety Level ► Probability of failure Test fault coverage (FC) is computed for every core Software-based self testing program is formulated given the target FC Test program size grows superlinearly with the FC 10 Step 2. Test allocation Test Vectors CMP Health Assessment Test Allocator T – test array P – failure probability Memory System Step 2. Test Allocation
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing 11 Checkpoint and Recovery Test Vectors Checkpoint CMP Test Allocator T – test array Health Assessment P – failure probability Memory System Step 3. Checkpoint and Recovery Checkpoints are created periodically Main memory used for checkpoint storage We use the ReVive design for checkpoint and recovery
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing: Example 12 Checkpoint Thread Execution Test Thread Execution Each core undergoes a different level of testing. Time / Energy saved benefits the actual thread execution. Thread Execution Test Checkpoint Rollback
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Evaluation : Setup Architecture: ► 16-core CMP, ARM9-like RISC processors Testing methodology: ► Software-based self test [Lu 2008] Process variation: ► VARIUS tool from UIUC for variation modeling Wearout sensors: ► Oxide breakdown sensors [E. Karl 2008] 13
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Test saving with varying safety level 14 Number of test instructions (normalized) Safety level 80% test instruction saving Reduces performance overhead by 5X
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Test saving and sensor accuracy 15 Number of test instructions Sensor area overhead Sensor error Huge savings even with error prone sensors!
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive testing over the lifetime 16 Safety Level Time (Years) Average number of Test Instructions (thousands)
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Limitations Health assessment is done at the CORE level The weakest component within a CORE determines the testing effort Increasing levels of process variation will only aggravate this behavior Is there any way to overcome this? 17
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Limitations: Example 18 Core 2 Core 0 Core 1 Core 3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 Latch Stage2 Latch Stage3 StageN High test effort Low test effort
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 19 Core 2 Core 0 Core 1 Core 3 StageNet (SN) Fabric [MICRO 08] Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 Latch Stage2 Latch Stage3 StageN
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 20 StageNet (SN) Fabric [MICRO 08] Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Configuration Manager StageNet Slice Crossbar Switch Wearout Sensors Delay Current
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science SN Fault Tolerance and Adaptive Testing Fetch Ex/Mem Decode Issue Configuration Manager Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Defect-free working of StageNet is similar to a traditional CMPIn the presence of failures, working stages can be easily salvagedProcess variation and lifetime wearout can result in a disparity of health for various resources StageNet can isolate strong/weak resources and improve the efficacy of the proposed Adaptive Online Testing Ex/Mem Issue Fetch Decode Strong: Low test effort Weak: High test effort Strong: Low test effort 21
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Test saving with StageNet 22 Number of test instructions (normalized) Safety level StageNet delivers an additional 10% saving
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Conclusions Health assessment of resources can enable large test savings Vaguely accurate sensors might be sufficient: 80% saving with 25% sensor error Adaptive online testing can reduce the performance overhead by 5X Sensors and testing can work together for a comprehensive and cheap online testing solution 23
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing for Efficient Hard Fault Detection International Conference on Computer Design, October 4-7, 2009
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Back up slides 25
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 26 Journey of Silicon Technology 486 Pentium Pentium II Pentium III Pentium 4 Core Duo Core 2 Quad Perfect transistors Rising Variability and Defects Unreliable Silicon CPU Performance (log scale) Memory redundancy IBM z servers Cell
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Periodic test options 27 TestingSoftware Functional Testing Software- based self test Hardware Built-in self test
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Evaluation : Methodology Lifetime simulations are conducted as a series of interval simulations At the end of each interval: ► Sensors readings are updated ► Test effort is computed on a per core basis Statistics are collected for ► Total amount of test effort (number of instructions) ► Running average of test effort (over the lifetime) 28
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Test saving with varying system coverage 29 StageNet (SN) gains advantage with increasing coverage target
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Test saving and sensor accuracy 30 80% saving at 25% sensor error!
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive testing over the lifetime 31