Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.

Slides:



Advertisements
Similar presentations
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Advertisements

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
Presenter: Jyun-Yan Li A Software-Based Self-Test Methodology for On-Line Testing of Processor Caches G. Theodorou, N. Kranitis, A. Paschalis, D. Gizopoulos.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
NATW 2008 Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, R. Iris Bahar Division of Engineering Brown University Providence,
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
Embedded Hardware and Software Self-Testing Methodologies for Processor Cores Li Chen, Sujit Dey, Pablo Sanchez, Krishna Sekar, and Ying Chen Design Automation.
An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.
From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.
BIST vs. ATPG.
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
1 Multi-Level Error Detection Scheme based on Conditional DIVA-Style Verification Kevin Lacker and Huifang Qin CS252 Project Presentation 12/10/2003.
HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros.
Online Design Bug Detection: RTL Analysis, Flexible Mechanisms, and Evaluation Kypros Constantinides University of Michigan Onur Mutlu Microsoft Research.
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Testimise projekteerimine: Labor 2 BIST Optimization
DYNAMIC TEST SET SELECTION USING IMPLICATION-BASED ON-CHIP DIAGNOSIS Nicholas Imbriglia, Nuno Alves, Elif Alpaslan, Jennifer Dworak Brown University NATW.
Reporter: PCLee. Assertions in silicon help post-silicon debug by providing observability of internal properties within a system which are.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Mugil Vannan H ST Microelectronics India Pvt. Ltd, Noida
Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.
Presenter : Ching-Hua Huang 2013/9/16 Visibility Enhancement for Silicon Debug Cited count : 62 Yu-Chin Hsu; Furshing Tsai; Wells Jong; Ying-Tsai Chang.
Presenter : Ching-Hua Huang 2013/7/15 A Unified Methodology for Pre-Silicon Verification and Post-Silicon Validation Citation : 15 Adir, A., Copty, S.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Testability and architecture. Design methodologies. Multiprocessor system-on-chip.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Presenter: Jyun-Yan Li Effective Software-Based Self-Test Strategies for On-Line Periodic Testing of Embedded Processors Antonis Paschalis Department of.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
THE TESTING APPROACH FOR FPGA LOGIC CELLS E. Bareiša, V. Jusas, K. Motiejūnas, R. Šeinauskas Kaunas University of Technology LITHUANIA EWDTW'04.
Presenter: PCLee. Semiconductor manufacturers aim at delivering high-quality new devices within shorter times in order to gain market shares.
Test and Test Equipment Joshua Lottich CMPE /23/05.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Compacting Test Vector Sets via Strategic Use of Implications Kundan Nepal Electrical Engineering Bucknell University Lewisburg, PA Nuno Alves, Jennifer.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.
Runtime Software Power Estimation and Minimization Tao Li.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
Defect-tolerant FPGA Switch Block and Connection Block with Fine-grain Redundancy for Yield Enhancement Anthony J. YuGuy G.F. Lemieux August 25, 2005.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.
Raghuraman Balasubramanian Karthikeyan Sankaralingam
nZDC: A compiler technique for near-Zero silent Data Corruption
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Hwisoo So. , Moslem Didehban#, Yohan Ko
Douglas Lacy & Daniel LeCheminant CS 252 December 10, 2003
Fault Tolerance Distributed Web-based Systems
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Co-designed Virtual Machines for Reliable Computer Systems
Fault Tolerant Systems in a Space Environment
Presentation transcript:

Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur Mutlu Microsoft Research Todd Austin and Valeria Bertacco University of Michigan

Reliability Challenges of Technology Scaling MICRO-40 December 3rd, Software-Based Detection of Hardware Defects Silicon Process Technology Cost cost per transistor product cost reliability cost 1) Cost of built-in defect tolerance mechanisms 2) Cost of R&D needed to develop reliable technologies Further scaling is not profitable Further scaling is not profitable Suggested Approach 1) Build products out of unreliable components/technologies 2) Provide reliability through very low cost defect-tolerance techniques reliability cost

Low-cost Online Defect-Tolerance Mechanisms MICRO-40 December 3rd, Software-Based Detection of Hardware Defects Online Defect Detection & Diagnosis Online Defect Detection & Diagnosis Online System Repair Online System Repair Online System Recovery Online System Recovery - Exploit resource redundancy - Gracefully degrade the product over time - The multi-core trend is supporting this approach - Low overhead periodic checkpoint and recovery - Existing mechanisms: ReVive + ReViveI/O SafetyNet Need For Low-Cost Detection & Diagnosis Mechanisms Remaining Challenge In this work we focus on a low-cost technique for detecting and diagnosing hard silicon defects

Continuous Checking Techniques  Continuously check for execution errors Shortcomings of continuous checking:  Redundant computation requires significant extra hardware – high area overhead  Continuous checking consumes significant energy – pressure on power budget Software-Based Detection of Hardware DefectsMICRO-40 December 3rd, Original Module Copy of the Module Checker Dual-Modular Redundancy Main Processor Checker Processor Checking

Periodic Checking Techniques  Periodically stall the processor and check the hardware  If hardware checking succeeds all previous computation is correct  Employ checkpointing and roll-back techniques  Built-In Self-Test (BIST) techniques to check the hardware Software-Based Detection of Hardware DefectsMICRO-40 December 3rd, Shortcomings -Random patterns do not target any specific testing technique (fault model) - A lot of patterns are needed for good coverage - Long testing times On-chip Random Test Pattern Generation Module Under Test Signature Register Too slow for online testing – High performance overhead

Our Approach – Software-Based Defect Detection MICRO-40 December 3rd, Software-Based Detection of Hardware Defects FIRMWARE Periodically stalls the processor and run hardware checking routines FIRMWARE Periodically stalls the processor and run hardware checking routines Architectural support to software-based checking 1)Move the hardware checking overhead to software 2)Firmware periodically stalls the processor and perform hardware checking 3)Provide architectural support to the software checking routines Advantages over hardware-based techniques - Lower area overhead - Higher runtime flexibility - it can support multiple fault models - dynamic tuning of testing process - Easier to upgrade (software patches) Accessibility Controllability ??

Access-Control Extensions (ACE) Framework  Architectural support that enables software access to the processor state (ACE Hardware)  Special Instructions can access and control any part of the processor state (ACE Instructions)  Firmware can periodically run directed hardware tests (ACE Firmware) Software-Based Detection of Hardware DefectsMICRO-40 December 3rd, Processor State Processor ACE Hardware Hardware ACE Extension ACE Firmware Operating System Applications Software ISA

Accessing The Processor State (ACE Hardware) Software-Based Detection of Hardware DefectsMICRO-40 December 3rd,  We leverage the existing full hold-scan chain infrastructure  Full hold-scan chains are employed by most modern processors to improve/automate manufacturing testing Scan State (shadow processor state) Processor State

Accessing The Processor State (ACE Hardware)  ACE Instructions can move values from the architectural registers to the scan state and vice versa  ACE Instructions can swap data between the scan state and the processor state MICRO-40 December 3rd, Software-Based Detection of Hardware Defects Processor State Register File ACE Node Scan State ACE Tree

Software-based Testing & Diagnosis (ACE Firmware)  Step 1 : Load test pattern into scan state  Step 2 : 3 cycle atomic test operation  Cycle 1: Swap scan state with processor state  Cycle 2: Test cycle  Cycle 3: Swap scan state with processor state  Step 3 : Validate test response Software-Based Detection of Hardware DefectsMICRO-40 December 3rd, Register File ACE Node MEMORY Test Patterns Test Responses MEMORY Test Patterns Test Responses X ATPG Automatic test pattern & response generation ATPG Automatic test pattern & response generation Scan state Processor state Test Pattern Validation Test Pattern Processor State Test Response Processor State

COMPUTATION Functional Test ACE-based Test Checkpoint Checkpoint Interval Timeline of Software-Based Testing Software-based testing is coupled with a checkpointing and recovery mechanism MICRO-40 December 3rd, Software-Based Detection of Hardware Defects Functional software test - Check if the core is capable to run ACE-based testing - Limited fault coverage 60-70% - Very fast < 1000 instructions Functional software test - Check if the core is capable to run ACE-based testing - Limited fault coverage 60-70% - Very fast < 1000 instructions Directed ACE-based testing - High-quality testing (ATPG patterns) - High fault coverage ~99% - Runtime < 1M instructions Directed ACE-based testing - High-quality testing (ATPG patterns) - High fault coverage ~99% - Runtime < 1M instructions

Experimental Methodology  OpenSPARC T1 CMP – based on Sun’s Niagara  Synopsys Design Compiler to synthesize the OpenSPARC CMP  Synopsys TetraMAX ATPG tool for test pattern generation  RTL implementation of ACE framework to get area overhead  Microarchitectural Simulation to get performance overhead  SESC cycle-accurate simulator  Simulate a SPARC core enhanced with the ACE framework  Benchmarks from the SPEC CPU2000 suite Software-Based Detection of Hardware DefectsMICRO-40 December 3rd,

Fault Models used for Test Pattern Generation  Stuck-at (0 or 1)  Industry standard fault model for test pattern generation  Silicon defects behave as a node stuck at 0 or 1  N-Detect  Higher probability to detect real hardware defects  Each stuck-at fault is detected by at least N different patterns  Path-delay  Test for delay faults that cause timing violations  Delay fault can be caused due to:  Manufacturing defects  Wearout-related defects  Process variation Software-Based Detection of Hardware DefectsMICRO-40 December 3rd,

 Fault injection campaign on a gate-level netlist of a SPARC core  Software functional test – 3 phases (~700 instructions):  Control flow check  Register access  Use all ISA instructions  Functional testing coverage is low ~ 62%  Undetected faults do not affect the execution of ACE firmware  Full coverage provided with further ACE-based testing Preliminary Functional Testing Software-Based Detection of Hardware DefectsMICRO-40 December 3rd,

Full-chip Distributed ACE-based Testing  Chip testing is distributed to the eight SPARC cores  Testing for stuck-at and path-delay fault models Software-Based Detection of Hardware DefectsMICRO-40 December 3rd, Cores [2,4] Test Instructions: 468K Coverage: 98.7% Cores [6,7] Test Instructions: 333K Coverage: 99.9% Cores [3,5] Test Instructions: 405K Coverage: 98.8% Cores [0,1] Test Instructions: 312K Coverage: 99.6%

 Performance overhead depends on the fault model used to generate patterns  ACE framework is flexible to support test patterns from different fault models Higher quality testing Performance Overhead of ACE-Based Testing Software-Based Detection of Hardware DefectsMICRO-40 December 3rd, M Checkpoint Interval SPEC CPU2000 Average

ACE Framework Area Overhead MICRO-40 December 3rd, Software-Based Detection of Hardware Defects  RTL implementation of ACE Framework in Verilog  Explored several ACE tree configurations  8 ACE trees (1 per core) to cover OpenSPARC ~230K ACE accessible bits Area Overhead : 0.7% each tree 5.8% for ACE framework

Overhead of ACE framework can be amortized by other applications:  Manufacturing testing  Lower cost of testing equipment  Faster testing – testing infrastructure embedded on the chip  Post-Silicon debugging - direct software access to processor state ACE Framework Future Directions – Other Applications MICRO-40 December 3rd, Software-Based Detection of Hardware Defects PROCESSOR Online Defect Detection & Diagnosis Online Defect Detection & Diagnosis Manufacturing Testing Post-silicon Debugging ACE Firmware Hardware accessibility & controllability ACE Firmware Hardware accessibility & controllability

Conclusions  We proposed a novel software-based online defect detection and diagnosis technique  Low area overhead: 5.8%  High fault coverage: 99%  Low performance overhead: 5.5%  Demonstrated the flexibility of the proposed technique to support:  Dynamic trade-off between performance and reliability  A number of fault models with varying test quality  The ACE infrastructure can be a unified framework that provides hardware accessibility and controllability to software MICRO-40 December 3rd, Software-Based Detection of Hardware Defects

Thank You! Questions? MICRO-40 December 3rd, Software-Based Detection of Hardware Defects

 Using more test patterns leads to higher reliability (coverage) but also into higher performance overhead  Software nature of ACE framework enables a flexible runtime tuning between reliability and performance Performance-Reliability Trade-off Software-Based Detection of Hardware DefectsMICRO-40 December 3rd, % reduction in coverage 46% reduction in performance overhead

Memory Logging Storage Requirements Software-Based Detection of Hardware DefectsMICRO-40 December 3rd, Coarse-grain checkpoint intervals of 100M instructions < 10MB

Performance Overhead of I/O-Intensive Applications MICRO-40 December 3rd, Software-Based Detection of Hardware Defects

ACE Tree Implementation – Area Overhead  RTL implementation of ACE Tree in Verilog  8 ACE trees (1 per core) to cover OpenSPARC ~230K bits  Area overhead : 2.3% each ACE tree 18.7% for ACE framework MICRO-40 December 3rd, Software-Based Detection of Hardware Defects Register File ACE Node 64 Bits Level 0 ACE Root Level 1 2 ACE nodes Level 2 8 ACE nodes Level 3 32 ACE nodes Level4 128 ACE nodes Direct-Access ACE Tree 512 x 64-bit segments = 32K bits

Hybrid ACE Tree – Area Overhead MICRO-40 December 3rd, Software-Based Detection of Hardware Defects  Hybrid ACE Tree  Direct-access portion  Scan chain portion  Area Overhead : 0.7% each tree 5.8% for ACE framework  ACE-based testing latency not affected (serial access to different segments) Register File ACE Node 64 Bits Level 0 ACE Root Level 1 4 ACE nodes Level 2 16 ACE nodes 448 Bits 64 x 512-bit segments = 32K bits Hybrid-Access ACE Tree