Oct. 2007Defect Avoidance and CircumventionSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.

Slides:



Advertisements
Similar presentations
Interconnect Testing in Cluster Based FPGA Architectures Research by Ian G.Harris and Russel Tessier University of Massachusetts. Presented by Alpha Oumar.
Advertisements

Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
An Integrated ECC and Redundancy Repair Scheme for Memory Reliability Enhancement National Tsing Hua University Hsinchu, Taiwan Chin-Lung Su, Yi-Ting Yeh,
Oct. 2007Fault MaskingSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.
Programmable Logic Devices
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
Oct. 2007State-Space ModelingSlide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Copyright 2005, Agrawal & BushnellVLSI Test: Lecture 19alt1 Lecture 19alt I DDQ Testing (Alternative for Lectures 21 and 22) n Definition n Faults detected.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Robust Low Power VLSI ECE 7502 S2015 Burn-in/Stress Test for Reliability: Reducing burn-in time through high-voltage stress test and Weibull statistical.
Time-Dependent Failure Models
Oct State-Space Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Nov Malfunction Diagnosis and Tolerance Slide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Oct Error Correction Slide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Oct Combinational Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Nov. 2006Reconfiguration and VotingSlide 1 Fault-Tolerant Computing Hardware Design Methods.
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
Oct Terminology, Models, and Measures Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
Jan. 2015Part II – Defects: Physical ImperfectionsSlide 1.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors Focused on permanent and transient faults detection. Three.
Oct Fault Masking Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Oct Fault Testing Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Susmit Biswas A Pageable Defect Tolerant Nanoscale Memory System Susmit Biswas, Tzvetan S. Metodi, Frederic T. Chong, Ryan Kastner
FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Oct Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Nov Malfunction Diagnosis and ToleranceSlide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Nov. 2007Reconfiguration and VotingSlide 1 Fault-Tolerant Computing Hardware Design Methods.
BIST vs. ATPG.
Oct. 2007Combinational ModelingSlide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
1 Reliability Application Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
CSE 321b Computer Organization (2) تنظيم الحاسب (2) 3 rd year, Computer Engineering Winter 2015 Lecture #4 Dr. Hazem Ibrahim Shehata Dept. of Computer.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Power Reduction for FPGA using Multiple Vdd/Vth
Software Reliability SEG3202 N. El Kadri.
Example 5.8 Non-logistics Network Models | 5.2 | 5.3 | 5.4 | 5.5 | 5.6 | 5.7 | 5.9 | 5.10 | 5.10a a Background Information.
Testing of integrated circuits and design for testability J. Christiansen CERN - EP/MIC
Session objectives Discuss whether or not virtualization makes sense for Exchange 2013 Describe supportability of virtualization features Explain sizing.
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:
RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.
Field Programmable Gate Arrays (FPGAs) An Enabling Technology.
L i a b l eh kC o m p u t i n gL a b o r a t o r y Test Economics for Homogeneous Manycore Systems Lin Huang† and Qiang Xu†‡ †CUhk REliable computing laboratory.
CprE 458/558: Real-Time Systems
Section 1  Quickly identify faulty components  Design new, efficient testing methodologies to offset the complexity of FPGA testing as compared to.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
Design For Manufacturability in Nanometer Era
Defect-tolerant FPGA Switch Block and Connection Block with Fine-grain Redundancy for Yield Enhancement Anthony J. YuGuy G.F. Lemieux August 25, 2005.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
CS203 – Advanced Computer Architecture Dependability & Reliability.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
ELEC Digital Logic Circuits Fall 2014 Logic Testing (Chapter 12)
CSE 451: Operating Systems Autumn 2010 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
Fault-Tolerant Clustering for FPGAs
RECONFIGURABLE NETWORK ON CHIP ARCHITECTURE FOR AEROSPACE APPLICATIONS
Part II – Defects: Physical Imperfections
Seminar on Enterprise Software
Presentation transcript:

Oct. 2007Defect Avoidance and CircumventionSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2007Defect Avoidance and CircumventionSlide 2 About This Presentation EditionReleasedRevised FirstOct. 2006Oct This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami

Oct. 2007Defect Avoidance and CircumventionSlide 3 Defect Avoidance and Circumvention

Oct. 2007Defect Avoidance and CircumventionSlide 4

Oct. 2007Defect Avoidance and CircumventionSlide 5 Multilevel Model Component Logic Service Result Information System Legend: Tolerance Entry

Oct. 2007Defect Avoidance and CircumventionSlide 6 The Manufacturing Process for an IC Part

Oct. 2007Defect Avoidance and CircumventionSlide 7 The dramatic decrease in yield with larger dies Effect of Die Size on Yield Die yield = def (Number of good dies) / (Total number of dies) Die yield = Wafer yield  [1 + (Defect density  Die area) / a] –a Die cost = (Cost of wafer) / (Total number of dies  Die yield) = (Cost of wafer)  (Die area / Wafer area) / (Die yield) The parameter a ranges from 3 to 4 for modern CMOS processes Shown are some random defects; there are also bulk or clustered defects that affect a large region

Oct. 2007Defect Avoidance and CircumventionSlide 8 Effects of Yield on Testing and Part Reliability Die yield = assume 50% Out of 2,000,000 dies manufactured,  1,000,000 are defective To achieve the goal of 100 defects per million (DPM) in parts shipped, we must catch 999,900 of the 1,000,000 defective parts Therefore, we need a test coverage of 99.99% Testing is imperfect: missed faults (coverage), false positives Going from a coverage of 99.9% to 99.99% involves a significant investment in test development and application times False positives are not a source of difficulty in this context Discarding another 1-2% due to false positives in testing does not change the scale of the loss

Oct. 2007Defect Avoidance and CircumventionSlide 9 Examples of Random Defects in ICs Resistive open due to unfilled via [R. Madge et al., IEEE D&T, 2003] Particle embedded between layers

Oct. 2007Defect Avoidance and CircumventionSlide 10 Defect Modeling Extra-material defects are modeled as circular areas Pinhole defects are tiny breaches in the dielectric From:

Oct. 2007Defect Avoidance and CircumventionSlide 11 Sensitivity of Layouts to Defects Extra material VLSI layout must be done with defect patterns and their impacts in mind A balance must be struck with regard to sensitivity to different defect types Missing material Actual photo of a missing-material defect Killer defect Latent defect

Oct. 2007Defect Avoidance and CircumventionSlide 12 The Bathtub Curve Many components fail early on because of residual or latent defects Components may also wear out due to aging (less so for electronics) In between the two high-mortality regions lies the useful life period Time Failure rate Infant mortality End-of-life wearout Useful life (low, constant failure rate) Mechanical Electronic Primarily due to latent defects

Oct. 2007Defect Avoidance and CircumventionSlide 13 Survival Probability of Electronic Components From: Infant mortality Time in years Percent of parts still working No significant wear-out Bathtub curve

Oct. 2007Defect Avoidance and CircumventionSlide 14 Burn-In and Stress Testing From: Time in years Percent of parts still working Burn-in and stress tests are done in accelerated form Difficult to perform on complex and delicate ICs without damaging good parts Expensive “ovens” are required

Oct. 2007Defect Avoidance and CircumventionSlide 15 Defect Avoidance vs. Circumvention Defect Avoidance Defect awareness in design, particularly layout and routing Extensive quality control during the manufacturing process Comprehensive screening, including burn-in and stress tests Defect Circumvention (Removal) Built-in dynamic redundancy on the die or wafer Identification of defective parts (visual inspection, testing, association) Bypassing or reconfiguration via embedded switches Defect Circumvention (Tolerance) Built-in static redundancy on the die or wafer Identification of defective parts (external test or self-test) Adjustment or tuning of redundant structures

Oct. 2007Defect Avoidance and CircumventionSlide 16 Defect Bypassing via Reconfiguration Works best when the system on die has regular, repetitive structure: Memory FPGA Multicore chip CMP (chip multiprocessor) Irregular (random) logic implies greater redundancy due to replication: Replicated structures must not be close to each other They should not be very far either (wiring/switching overhead)

Oct. 2007Defect Avoidance and CircumventionSlide 17 Peripheral reconfiguration elements Defects in Memory Arrays Defect circumvention (removal) Provide several extra (spare) rows and/or columns Route external connections to defect-free rows and columns Spare rows Memory array Memory array Defective row Defective column Defect circumvention (tolerance) Error-correcting code With m rows and s spares, can model as m-out-of-(m + s) Somewhat more complex with both spare rows and columns (still combinational, though) Modeling with coded scheme to be discussed at the info level Methods in use since the 1970s; e.g., IBM’s defect-tolerant chip Spare columns

Oct. 2007Defect Avoidance and CircumventionSlide 18 Yield Improvement in Memory Arrays Example of IBM’s experimental 16 Mb memory chip Combines the use of spare rows/columns in memory arrays with ECC Four quadrants, each with 16 spare rows & 24 spare columns ECC corrects any single error via 9 check bits (137 data bits) Bits assigned to the same word are separated by 8 bit positions Avg. number of failing cells per chip Yield ECC only Spares only ECC and spares

Oct. 2007Defect Avoidance and CircumventionSlide 19 Defects in FPGAs Defect circumvention (removal) Provide several extra (spare) CLBs, I/O blocks, and connections Route external connections to available blocks Defect circumvention (tolerance) Not applicable

Oct. 2007Defect Avoidance and CircumventionSlide 20 Defects in Multicore Chips or CMPs Defect circumvention (removal) Similar to FPGAs, except that processors are the replacement entities Interprocessor interconnection network is the main challenge Will discuss the switching and reconfiguration aspects in more detail when we get to the malfunction level in our multilevel model

Oct. 2007Defect Avoidance and CircumventionSlide 21 Circumventing Defects in Processor Arrays Extensive research done on how to salvage a working array from one that has been damaged by defects Proposed methods differ in Types and placement of switches (e.g., 4-port, single/double-track) Types and placement of spares Algorithms for determining working configurations Ways of effecting reconfiguration Methods of assessing resilience The next few slides show some methods based on 4-port, 2-state switches

Oct. 2007Defect Avoidance and CircumventionSlide 22 Defect Tolerance Schemes for Linear Arrays A linear array with a spare processor and embedded switching A linear array with a spare processor and reconfiguration switches

Oct. 2007Defect Avoidance and CircumventionSlide 23 Defect Tolerance in 2D Arrays Two types of reconfiguration switching for 2D arrays Assumption: A defective unit can be bypassed in its row/column by means of a separate switching mechanism (not shown)

Oct. 2007Defect Avoidance and CircumventionSlide 24 A Reconfiguration Scheme for 2D Arrays A 5  5 working array salvaged from a 6  6 redundant mesh through reconfiguration switching Seven defective processors in a 5  5 array and their associated compensation paths

Oct. 2007Defect Avoidance and CircumventionSlide 25 Limits of Reconfigurability A set of three defective nodes, one of which cannot be accommodated by the compensation-path method. Extension: We can go beyond the 3-defect limit by providing spare rows on top and bottom and spare columns on either side

Oct. 2007Defect Avoidance and CircumventionSlide 26 Combinational Modeling Pessimistic/Easy: Any 3 bad cells lead to failure Model m  m array as (m 2 – 2)-out-of-m 2 system Realistic/Hard: Enumerate all combinations of bad cells that lead to failure and assess the probability of at least one of these combinations occurring