Oct. 2006 Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.

Slides:



Advertisements
Similar presentations
Copyright © 2010 SpectraPlex – Presentation property of SpectraPlex, no reproduction without permission SpectraPlex High Performance Communications Technologies.
Advertisements

Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.
Computer Engineering II
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
An Integrated ECC and Redundancy Repair Scheme for Memory Reliability Enhancement National Tsing Hua University Hsinchu, Taiwan Chin-Lung Su, Yi-Ting Yeh,
Oct. 2007Fault MaskingSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Programmable Logic Devices
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
Oct. 2007State-Space ModelingSlide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Robust Low Power VLSI ECE 7502 S2015 Burn-in/Stress Test for Reliability: Reducing burn-in time through high-voltage stress test and Weibull statistical.
Time-Dependent Failure Models
Oct State-Space Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Nov Malfunction Diagnosis and Tolerance Slide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Oct. 2007Defect Avoidance and CircumventionSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Oct Error Correction Slide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Oct Combinational Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Nov. 2006Reconfiguration and VotingSlide 1 Fault-Tolerant Computing Hardware Design Methods.
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
Oct Terminology, Models, and Measures Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Jan. 2015Part II – Defects: Physical ImperfectionsSlide 1.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors Focused on permanent and transient faults detection. Three.
Oct Fault Masking Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Oct Fault Testing Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Susmit Biswas A Pageable Defect Tolerant Nanoscale Memory System Susmit Biswas, Tzvetan S. Metodi, Frederic T. Chong, Ryan Kastner
FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.
Nov Malfunction Diagnosis and ToleranceSlide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Nov. 2007Reconfiguration and VotingSlide 1 Fault-Tolerant Computing Hardware Design Methods.
BIST vs. ATPG.
Oct. 2007Combinational ModelingSlide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
EKT 221 Digital Electronics II
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Software Reliability SEG3202 N. El Kadri.
Example 5.8 Non-logistics Network Models | 5.2 | 5.3 | 5.4 | 5.5 | 5.6 | 5.7 | 5.9 | 5.10 | 5.10a a Background Information.
EKT 221 : Digital 2 Memory Basics
Testing of integrated circuits and design for testability J. Christiansen CERN - EP/MIC
Lecture 16: Storage and I/O EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Session objectives Discuss whether or not virtualization makes sense for Exchange 2013 Describe supportability of virtualization features Explain sizing.
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.
Field Programmable Gate Arrays (FPGAs) An Enabling Technology.
L i a b l eh kC o m p u t i n gL a b o r a t o r y Test Economics for Homogeneous Manycore Systems Lin Huang† and Qiang Xu†‡ †CUhk REliable computing laboratory.
Section 1  Quickly identify faulty components  Design new, efficient testing methodologies to offset the complexity of FPGA testing as compared to.
1 Carnegie Mellon University Center for Silicon System Implementation An Architectural Exploration of Via Patterned Gate Arrays Chetan Patel, Anthony Cozzie,
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
Digital Circuits Introduction Memory information storage a collection of cells store binary information RAM – Random-Access Memory read operation.
Design For Manufacturability in Nanometer Era
Defect-tolerant FPGA Switch Block and Connection Block with Fine-grain Redundancy for Yield Enhancement Anthony J. YuGuy G.F. Lemieux August 25, 2005.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
CS203 – Advanced Computer Architecture Dependability & Reliability.
Network Topology Computer network topology is the way various components of a network (like nodes, links, peripherals, etc) are arranged. Network topologies.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
ELEC Digital Logic Circuits Fall 2014 Logic Testing (Chapter 12)
Use ECP, not ECC, for hard failures in resistive memories
RECONFIGURABLE NETWORK ON CHIP ARCHITECTURE FOR AEROSPACE APPLICATIONS
Part II – Defects: Physical Imperfections
Seminar on Enterprise Software
Presentation transcript:

Oct Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct Defect Avoidance and Circumvention Slide 2 About This Presentation EditionReleasedRevised FirstOct This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami

Oct Defect Avoidance and Circumvention Slide 3 Defect Avoidance and Circumvention

Oct Defect Avoidance and Circumvention Slide 4

Oct Defect Avoidance and Circumvention Slide 5 Multilevel Model Component Logic Service Result Information System Legend: Tolerance Entry

Oct Defect Avoidance and Circumvention Slide 6 The Manufacturing Process for an IC Part

Oct Defect Avoidance and Circumvention Slide 7 The dramatic decrease in yield with larger dies Effect of Die Size on Yield Die yield = def (Number of good dies) / (Total number of dies) Die yield = Wafer yield  [1 + (Defect density  Die area) / a] –a Die cost = (Cost of wafer) / (Total number of dies  Die yield) = (Cost of wafer)  (Die area / Wafer area) / (Die yield) The parameter a ranges from 3 to 4 for modern CMOS processes Shown are some random defects; there are also bulk or clustered defects that affect a large region

Oct Defect Avoidance and Circumvention Slide 8 Effects of Yield on Testing and Part Reliability Die yield = assume 50% Out of 2,000,000 dies manufactured,  1,000,000 are defective To achieve the goal of 100 defects per million (DPM) in parts shipped, we must catch 999,900 of the 1,000,000 defective parts Therefore, we need a test coverage of 99.99%

Oct Defect Avoidance and Circumvention Slide 9 Examples of Random Defects in ICs Resistive open due to unfilled via [R. Madge et al., IEEE D&T, 2003] Particle embedded between layers

Oct Defect Avoidance and Circumvention Slide 10 Defect Modeling Extra-material defects are modeled as circular areas Pinhole defects are tiny breaches in the dielectric From:

Oct Defect Avoidance and Circumvention Slide 11 Sensitivity of Layouts to Defects Extra material VLSI layout must be done with defect patterns and their impacts in mind A balance must be struck with regard to sensitivity to different defect types Missing material Actual photo of a missing-material defect Killer defect Latent defect

Oct Defect Avoidance and Circumvention Slide 12 The Bathtub Curve Many components fail early on because of residual or latent defects Components may also wear out due to aging (less so for electronics) In between the two high-mortality regions lies the useful life period Time Failure rate Infant mortality End-of-life wearout Useful life (low, constant failure rate) Mechanical Electronic Primarily due to latent defects

Oct Defect Avoidance and Circumvention Slide 13 Survival Probability of Electronic Components From: Infant mortality Time in years Percent of parts still working No significant wear-out

Oct Defect Avoidance and Circumvention Slide 14 Burn-In and Stress Testing From: Time in years Percent of parts still working Burn-in and stress tests are done in accelerated form Difficult to perform on complex and delicate ICs without damaging good parts Expensive “ovens” are required

Oct Defect Avoidance and Circumvention Slide 15 Defect Avoidance vs. Circumvention Defect Avoidance Defect awareness in design, particularly layout and routing Extensive quality control during the manufacturing process Comprehensive screening, including burn-in and stress tests Defect Circumvention (Removal) Built-in dynamic redundancy on the die or wafer Identification of defective parts (visual inspection, testing, association) Bypassing or reconfiguration via embedded switches Defect Circumvention (Tolerance) Built-in static redundancy on the die or wafer Identification of defective parts (external test or self-test) Adjustment or tuning of redundant structures

Oct Defect Avoidance and Circumvention Slide 16 Defect Bypassing via Reconfiguration Works best when the system on die has regular, repetitive structure: Memory FPGA Multicore chip CMP (chip multiprocessor) Irregular (random) logic implies greater redundancy due to replication: Replicated structures must not be close to each other They should not be very far either (wiring/switching overhead)

Oct Defect Avoidance and Circumvention Slide 17 Peripheral reconfiguration elements Defects in Memory Arrays Defect circumvention (removal) Provide several extra (spare) rows and/or columns Route external connections to defect-free rows and columns Spare rows Memory array Memory array Defective row Defective column Defect circumvention (tolerance) Error-correcting code With m rows and s spares, can model as m-out-of-(m + s) Somewhat more complex with both spare rows and columns (still combinational, though) Modeling with coded scheme to be discussed at the info level Methods in use since the 1970s; e.g., IBM’s defect-tolerant chip Spare columns

Oct Defect Avoidance and Circumvention Slide 18 Yield Improvement in Memory Arrays Example of IBM’s experimental 16 Mb memory chip Combines the use of spare rows/columns in memory arrays with ECC Four quadrants, each with 16 spare rows & 24 spare columns ECC corrects any single error via 9 check bits (137 data bits) Bits assigned to the same word are separated by 8 bit positions Avg. number of failing cells per chip Yield ECC only Spares only ECC and spares

Oct Defect Avoidance and Circumvention Slide 19 Defects in FPGAs Defect circumvention (removal) Provide several extra (spare) CLBs, I/O blocks, and connections Route external connections to available blocks Defect circumvention (tolerance) Not applicable

Oct Defect Avoidance and Circumvention Slide 20 Defects in Multicore Chips or CMPs Defect circumvention (removal) Similar to FPGAs, except that processors are the replacement entities Interprocessor interconnection network is the main challenge Will discuss the switching and reconfiguration aspects in more detail when we get to the malfunction level in our multilevel model

Oct Defect Avoidance and Circumvention Slide 21 Circumventing Defects in Processor Arrays

Oct Defect Avoidance and Circumvention Slide 22 Defect Tolerance Schemes for Linear Arrays A linear array with a spare processor and embedded switching A linear array with a spare processor and reconfiguration switches

Oct Defect Avoidance and Circumvention Slide 23 Defect Tolerance in 2D Arrays Two types of reconfiguration switching for 2D arrays Assumption: A defective unit can be bypassed in its row/column by means of a separate switching mechanism (not shown)

Oct Defect Avoidance and Circumvention Slide 24 A Reconfiguration Scheme for 2D Arrays A 5  5 working array salvaged from a 6  6 redundant mesh through reconfiguration switching Seven defective processors in a 5  5 array and their associated compensation paths

Oct Defect Avoidance and Circumvention Slide 25 Limits of Reconfigurability A set of three defective nodes, one of which cannot be accommodated by the compensation-path method. Extension: We can go beyond the 3-defect limit by providing spare rows on top and bottom and spare columns on either side