HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros.

Slides:



Advertisements
Similar presentations
Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
Advertisements

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 2.
Fault-Tolerant Systems Design Part 1.
DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio †, Konstantinos Aisopos ‡§ Valeria Bertacco †, Li-Shiuan.
Microprocessor Reliability
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.
5th Conference on Intelligent Systems
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
NATW 2008 Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, R. Iris Bahar Division of Engineering Brown University Providence,
Making Services Fault Tolerant
Ashish Gupta Under Guidance of Prof. B.N. Jain Department of Computer Science and Engineering Advanced Networking Laboratory.
TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…
Susmit Biswas A Pageable Defect Tolerant Nanoscale Memory System Susmit Biswas, Tzvetan S. Metodi, Frederic T. Chong, Ryan Kastner
FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 1 Assessing SEU Vulnerability via Circuit-Level.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Reconfiguration Based Fault-Tolerant Systems Design - Survey of Approaches Jan Balach, Jan Balach, Ondřej Novák FIT, CTU in Prague MEMICS 2010.
Post-Manufacturing ECC Customization Based on Orthogonal Latin Square Codes and Its Application to Ultra-Low Power Caches Rudrajit Datta and Nur A. Touba.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Seattle June 24-26, 2004 NASA/DoD IEEE Conference on Evolvable Hardware Self-Repairing Embryonic Memory Arrays Lucian Prodan Mihai Udrescu Mircea Vladutiu.
Fault-Tolerant Systems Design Part 1.
European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.
Test and Test Equipment Joshua Lottich CMPE /23/05.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Synthesis Of Fault Tolerant Circuits For FSMs & RAMs Rajiv Garg Pradish Mathews Darren Zacher.
Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
CprE 458/558: Real-Time Systems
Fault-Tolerant Systems Design Part 1.
Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.
IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.
Using Memory to Cope with Simultaneous Transient Faults Authors: Universidade Federal do Rio Grande do Sul Programa de Pós-Graduação em Engenharia Elétrica.
Eduardo L. Rhod, Álisson Michels, Carlos A. L. Lisbôa, Luigi Carro ETS 2006 Fault Tolerance Against Multiple SEUs using Memory-Based Circuits to Improve.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
CS203 – Advanced Computer Architecture Dependability & Reliability.
Seminar On Rain Technology
Fault-tolerant routing
Fault Tolerance & Reliability CDA 5140 Spring 2006
Fault Tolerance In Operating System
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
UnSync: A Soft Error Resilient Redundant Multicore Architecture
Presentation Title Greg Snider QSR, Hewlett-Packard Laboratories
Sequential circuits and Digital System Reliability
Mattan Erez The University of Texas at Austin July 2015
Design of a ‘Single Event Effect’ Mitigation Technique for Reconfigurable Architectures SAJID BALOCH Prof. Dr. T. Arslan1,2 Dr.Adrian Stoica3.
RECONFIGURABLE NETWORK ON CHIP ARCHITECTURE FOR AEROSPACE APPLICATIONS
2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.
Reliability and Error Control 5/17/11
Seminar on Enterprise Software
Presentation transcript:

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros Constantinides ‡ Stephen Plaza ‡ Jason Blome ‡ Bin Zhang † Valeria Bertacco ‡ Scott Mahlke ‡ Todd Austin ‡ Michael Orshansky † ‡ Advanced Computer Architecture Lab † Department of Electrical and Computer Engineering University of MichiganUniversity of Texas at Austin

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 2 Introduction Reliability is a critical aspect of any computer design System designers target for very small failure rates Today reliability targets are met by using fault-avoidance design techniques – use of conservative design margins For future process technologies it would be impossible to avoid system failures by using conservative design margins – need defect-tolerant design techniques Transistor Reliability Transistor Lifetime (years) Now Future

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 3 Need for cost- and performance-efficient techniques that can provide high reliability in the presence of unreliable components – “BulletProof” Reliable System Design Space MANUFACTURING DEFECTWEAR-OUT DEFECTTRANSIENT ERROR NO-DETECTIONUntestable Defects System fails in unpredictable way System glitch manifests in unpredictable way DETECTIONTesting Component terminates at first error Component terminates. Hard-reset restore DETECTION +CORRECTION Post-manufacturing recovery Online defect recovery Transient fault recovery DETECTION +CORRECTION +REPAIR Post-manufacturing reconfiguration Online repair DMR ECC - memory cache-line swap-out memory-array spares TMR Diva Razor ECC TMR BulletProof Mainstream Solutions High-end Solutions Specialized Solutions Research-stage Solutions TYPE OF DEFECT DESIGN FEATURE

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 4 CMP Switch Architecture Goal : A defect tolerant CMP switch design Baseline switch architecture is provided by Li-Shiuan Peh Implements the routing and flow-control functions required for transmitting packets in a 2D Torus network Wormhole switch pipelined at the flit level (32-bit flits) Dimensional order routing Specified in Verilog and synthesized to a gate-level netlist ~ 9K logic gates and 1700 sequential elements

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 5 Soft Errors (SEU) Vulnerability In earlier work we studied the vulnerability of the switch architecture to soft-errors – Only 3.2% of faults eventually cause an error Age-related wear-out silicon defects is a more challenging reliability threat for future technologies In this work we focus on solutions for in-field silicon defects These solutions also provide soft-error tolerance to the design

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 6 Self-Repairing Systems Defect-tolerant self-repairing systems need to support: – Error Detection – System Diagnosis (locate the origin of the error) – System Repair – System Recovery Key idea: – error detection must be performance efficient continuously check execution for errors – diagnosis, repair and recovery are insensitive on performance get invoked only when an error is detected (rare scenario) trade-off performance for more cost efficient techniques

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 7 Traditional Defect-Tolerant Techniques Traditional techniques for designing defect-tolerant systems: – Triple Modular Redundancy (TMR) Forward recovery Applicable to both combinational and sequential logic Can not tolerate more than one defective modules Area and power overhead ~ 3X – Error Correction Codes (ECC) Lower overhead solution Applicable only for state holding structures and busses M M M V R 1 R 2 D 1 R 3 D 2 D 3 D 4 R 4 D 5 D 6 D 7 D 8 ECC bits Data bits

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 8 The synthesized netlist of the added components account for ~10% of the total switch area Provide error detection for both hard and soft errors Buffer Checker Routing Logic ARB Cross-bar Controller Header Input Buffers Cross-bar ARB CRC Checker CRC Error Detection: Low-Cost Domain Specific Technique Error FLIT CRC Checker

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 9 Adding Defect Resiliency With Lower Cost Automatic Cluster Decomposition Balanced recursive min-cut heuristic algorithm Input : a) design’s gate-level netlist b) number of partitions Output : a partitioned netlist Goal : – Balance partition sizes: - smaller partition higher resilience – Minimize cut edges: - reduce cost overhead - reduce vulnerable logic Partitions can have both combinational and sequential logic A B C D E F G H J I A B C D E F G H J I

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 10 A B F A B F D E H D E H C G J I C G J I Partition sparing: – Only one spare is active for each partition of the switch – Replace voting logic with spare swapping logic – Lower power overhead – A defect is fatal if it hits the last spare of a partition or the spare swapping logic Silicon Protection Factor (SPF) = – The number of defect in a design are proportional to the design’s area – Enables to compare different defect tolerant designs SPF – Defect Tolerance 7.6X more defects tolerated per unit area Partition Sparing – Silicon Protection Factor 1 extra spare per partition Mean Defects to Failure Area Overhead 15.8X more defects tolerated

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 11 System Recovery Add a Recovery Pointer to each input buffer Recovery pointers advance 4 cycles after the input controller grants the requesting output channel – Guarantees that flit is CRC checked On error detection: – All CRC checkers drop outgoing flits – Switch pipeline is flushed – Head pointers are set to recovery pointers – Restart execution CRC Checker Interconnect Switch CRC Checker CRC Checker CRC Checker Recovery Logic CRC Checker Routed Flit Routed Flit Routed Flit Routed Flit Routed Flit Error Detection Signal abcde abcde Input Buffers TailHeadRecovery Head a: Correctly routed flit b, c: In the switch pipeline d: Next flit to be routed e: Last flit buffered ed

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 12 System Diagnosis and Repair Iterative trial-and-error technique Built-In-Self-Test (BIST) – For each partition keep automatically generated test vectors in ROM – Apply test vectors to each partition through scan chains to locate the defective partition Recover to the last correct state of the switch For partition i swap in the spare for the current copy and restart execution Error detected? i < # partitions? Continue Execution Increase i No Yes Fatal Defect

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 13 How does these techniques affect the system’s lifetime? Pareto Optimal Designs Pareto Sub-optimal Designs 12 partitions (cmps) 2/5 spare input controllers 1 spare per cmp. (rest) Iterative replay Area = 1.76X SPF = partitions (cmps) 2/5 spare input controllers 1 spare per cmp. (rest) Iterative replay Area = 1.76X SPF = partitions 2 spares per partition Iterative replay Area = 3.4X SPF = partitions 2 spares per partition Iterative replay Area = 3.4X SPF = partitions 1 spare per partition Built-In-Self-Test Area = 3.16X SPF = partitions 1 spare per partition Built-In-Self-Test Area = 3.16X SPF = partitions 1 spare per partition Iterative replay Area = 2.3X SPF = partitions 1 spare per partition Iterative replay Area = 2.3X SPF = partitions (cmps) TMR Area = 3.04X SPF = partitions (cmps) TMR Area = 3.04X SPF = 1.54 more robust designs cheaper designs cheaper more robust designs Exploring Defect-Tolerant CMP Switch Designs

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 14 “Bathtub Curve”: A model for semiconductor hard failures The lifetime failure rate for semiconductor systems follows what is known as the bathtub curve Trend for future process technologies: – Failure rate of grace period gets larger – Breakdown period is earlier in system’s lifetime Grace Period Infant Period Breakdown Period Time Failure Rate (FIT) Future process technologies

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 15 System Lifetime – A Post 65nm Technology Case Scenario Failure Rate (FIT) TMR SPF=1.54 TMR SPF=1.54 3/5 spare IC 1 spare rest SPF=3.01 3/5 spare IC 1 spare rest SPF= spare SPF= spare SPF= spares SPF= spares SPF= defect every two years every two years

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 16 Conclusions – Future Work Conclusions Traditional mechanisms are insufficient for tolerating moderate numbers of defects Domain-specific techniques along with resource sparing, iterative diagnosis and reconfiguration are more effective Decomposing the design into modest-sized partitions is the most effective granularity to apply redundancy Future Work Use of spare components based on component wear-out profiles Explore low-cost defect-tolerant techniques for microprocessors

HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 17 Questions?