DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
INPUT-OUTPUT ORGANIZATION
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
An Integrated ECC and Redundancy Repair Scheme for Memory Reliability Enhancement National Tsing Hua University Hsinchu, Taiwan Chin-Lung Su, Yi-Ting Yeh,
Fault-Tolerant Systems Design Part 1.
Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
DS - X - CS - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 10 CASE STUDIES Wintersemester 99/00 Leitung: Prof.
DS - VI - FTM - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
1 Chapter Fault Tolerant Design of Digital Systems.
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
Memory Organization.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Servers Redundant Array of Inexpensive Disks (RAID) –A group of hard disks is called a disk array FIGURE Server with redundant NICs.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
INPUT-OUTPUT ORGANIZATION
Shashank Srivastava Motilal Nehru National Institute Of Technology, Allahabad Error Detection and Correction : Data Link Layer.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Redundant Array of Independent Disks
Memory and Programmable Logic
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Part.7.1 Copyright 2007 Koren & Krishna, Morgan-Kaufman FAULT TOLERANT SYSTEMS Part 7 - Coding.
IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.
Chapter 5 Internal Memory. Semiconductor Memory Types.
Fault-Tolerant Systems Design Part 1.
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
Memory Cell Operation.
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012.
VLSI AND INTELLIGENT SYTEMS LABORATORY 12 Bit Hamming Code Error Detector/Corrector December 2nd, 2003 Department of Electrical and Computer Engineering.
1 © Unitec New Zealand CRC calculation and Hammings code.
Data Manipulation, part two Introduction to computer, 2 nd semester, 2010/2011 Mr.Nael Aburas Faculty of Information.
Overview von Neumann Architecture Computer component Computer function
IT3002 Computer Architecture
Digital Circuits Introduction Memory information storage a collection of cells store binary information RAM – Random-Access Memory read operation.
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
بسم الله الرحمن الرحيم MEMORY AND I/O.
1 Product Codes An extension of the concept of parity to a large number of words of data 0110… … … … … … …101.
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Gunjeet Kaur Dronacharya Group of Institutions. Outline I Random-Access Memory Memory Decoding Error Detection and Correction Read-Only Memory Programmable.
Chapter 7 Memory and Programmable Logic
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
Memory and Programmable Logic
Vladimir Stojanovic & Nicholas Weaver
Coding Theory Dan Siewiorek June 2012.
Information Redundancy Fault Tolerant Computing
UNIT IV RAID.
Seminar on Enterprise Software
Presentation transcript:

DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester 2002/03 Leitung: Prof. Dr. Miroslaw Malek

DS - VI - FTM - 1 Fault-tolerant and Fault-secure Memories Objectives: –To study techniques of fault-tolerant and fault-secure memory design used in memory manufacturing and applications Contents: –Fault-tolerant techniques in manufacturing –Replication –Codes –Reconfiguration

DS - VI - FTM - 2 Fault-tolerant Technique in Memory Manufacturing (Overhead From 2% to 10%) Depending on expected failure density. A number of additional rows and/or columns are added and therefore included on the chip. Polysilicon fuses in decoding circuitry are selectively blown to allow addressing of the spare rows and columns. Two methods exist for blowing fuses: –By focusing a laser on a given fuse for about one second –By applying a volt signal across a highly resistive fuse With the rapidly increasing chip densities, the use of redundancy is standard among memory manufacturers

DS - VI - FTM - 3 Fault-tolerant Memories (Overhead From 2% to 200%) Identical copies of memory are used to mask erroneous results Replication is usually implemented at the module level to minimize the number of voters needed to determine the correct output, and may consist of static or dynamic redundancy, or a combination of both. –Duplex –Half-duplex (two halves of memory are encoded into a third half residing in a back-up module such that the original data may be recovered if one of the three modules fails) –N-modular redundancy (usually triple modular redundancy) Additional hardware includes: –Memory units voter –Or disagreement detector

DS - VI - FTM - 4 Fault-tolerant Memories (Continued) Exemplary systems with replicated memories include: Star –(Self-testing and self-repairing computer) Ftmp –(Fault-tolerant multiprocessor) Sift –(Software-implemented fault tolerance computer) Comtrac –(Computer-aided traffic control system) (4,2) concept –(Communication controller with four processors and duplicated memory from philips) Stratus –(Commercial fault-tolerant system) 3b20 from at&t –(commercial fault-tolerant system)

DS - VI - FTM - 5 Memory Codes Parity Codes Even parity Odd parity –(Better coverage since all o's or 1's errors can be detected in any word with even number of bits) Byte-parity –(Parity bit is appended to every 7 or 8 bits) Interlaced parity Chip-wide parity Two-dimensional parity

DS - VI - FTM - 6 Chip-wide Parity Method

DS - VI - FTM - 7 Two-dimensional Parity Method k words Column Parity Register NoNoYesNoNo Parity Error ? n bits/word No Yes No Overall parity check bit Row Parity Register Parity Error ?

DS - VI - FTM - 8 Hamming Codes Hamming codes provide error detection as well as error correction in a b-bit long word. Log2b check bits are generated whose values allow determination of the single bit if a single bit error occurs. As an example a (7, 4) single error-detecting hamming code is shown. There are a total of seven bits, four of which are data bits. Even though the code requires % additional hardware and results in degraded memory speed (due to encoding and decoding of the check bits), it often results in orders of magnitude or higher increase in the mean time between failures (mtbf) for the memory, a tradeoff which is often accepted. Hamming codes may be extended to provide k-error correction and 2k- error detection, but such modifications require even greater hardware and software overheads.

DS - VI - FTM - 9 Single-error Correction Example For A (7, 4) HAMMING CODE (Bit D3 Is in Error) Data bits are d 1, d 2, d 3, d 4 Check bits are c 1, c 2, c 3 Equations used for syndrome generation: s 3 = d 1  d 2  d 4  c 1 s 2 = d 1  d 3  d 4  c 2 s 1 = d 2  d 3  d 4  c 3 = c1c2d1c3d2d3d4c1c2d1c3d2d3d c1c2d1c3d2d3d4c1c2d1c3d2d3d s1s2s3s1s2s3 Parity-Check Matrix (PCM) Data Word From Memory Syndrome

DS - VI - FTM - 10 Sec-ded Memory Design 32-bit error detection and correction unit Corrects all single-bit errors Detects all double errors Detects some triple errors Detection in 32 nsec, correction in 64 nsec 7 check bits for 32-bit word via a modified hamming code May also work on 8-bit bytes Built-in diagnostics

DS - VI - FTM - 11 Block Diagram Of Memory System SYSTEM DATA BUS Dynamic RAM Control Error Detection and Correction Unit Bus Buffers 32 Data Bits Nx32 Memory Array Nx7 Check Array 32 Check Bits WRITE: DATA BUSBUFFERS EDC BUFFERSMEMORY ARRAY 7 READ: MEMORY ARRAYBUFFERS EDC BUFFERSDATA BUS

DS - VI - FTM - 12 Edc Unit Operation Configuration:32-Bit Memory Array/Data Bus, 7-Bit Check Array Memory Read Cycle 1.Data read from memory array to buffers and from check array to check-bit inputs 2.EDC unit gets data from buffers 3.EDC unit computes check bits and syndrome 4.On non-zero syndrome, error(s) are indicated via error or multierror lines and bit correction occurs (1-bit error) 5.EDC unit passes (corrected) data to buffers and then to data bus Memory Write Cycle 1.Data from data bus via buffers to EDC unit 2.Check bits are computed 3.Data from EDC unit via buffers to memory and check bits from EDC unit to check-bit memory array In the 2M bytes memory MTFB improved from 95h to 15,000h Up to 35% increase in cost on 16K memory cost Up to 40% increase in power consumption PARITY + COMPLEMENT METHOD FOR ERROR CORRECTION

DS - VI - FTM - 13 EDC UNIT OPERATION (Continued) 1st Write Original Data 1st Read PE (Parity Error) D  D Data Complement 2nd Write Complemented Data 2nd Read PE (Parity Error) D  D Data Complement  (Correct Data) Hard Error Location This double complement method in combination with an ECC system can correct additional errors, e.g., National Semiconductor DP8400 chip (detects 100% of 2-bit errors and both errors are correctable if no more than one of them is soft)

DS - VI - FTM - 14 Reconfiguration Reconfiguration involves the Permutation of the address and/or data lines between an array of memory chips and the cpu to prevent the building of multiple hard errors Spare memory locations technique (spare blocks method) Spare switchable columns technique

DS - VI - FTM - 15 Spare Blocks Method Special purpose hardware Intel's iQX Module using Reallocation Technique Hard error rate is 0.027% in 1000 hours Soft error rate is 0.1% in 1000 hours in the 2Mbyte memory system Memory Allocation RAM (Mapping Table) High-Order Address Low-Order Address Block containing Faulty Data Main Memory Spare Memory Blocks {... Memory Address from Host

DS - VI - FTM - 16 Spare Switchable Columns Method

DS - VI - FTM - 17 Fault-tolerant Memories In Commercial Systems (1) INTEL'S SERIES 90/IQX (A sec-ded code on the data, a parity check on the address bus, and the scrubbing of memory, which is the periodic dumping and rewriting of data to prevent the build-up of multiple soft errors, spare memory with pointer table) Vax-11/780 and microvaxes (a 7-bit sec-ded hamming code for 32-bit words and error logging) Memory systems for spaceborne computers (Sec-ded with periodic scrubbing or bit-per-chip memory organization with row/column, power isolation and error protocol data to assist reconfiguration)

DS - VI - FTM - 18 Fault-tolerant Memories In Commercial Systems (2) IBM 30xx AND 43xx –Use a hamming sec-ded code and parity + complement method UNIVAC 1100/60 –Employs sec-ded and sends an error signal to the requesting device if a double error is detected VAX-11/780 –Employ a hamming sec- and microvax ded code with error logging CRAY-XMP & YMP –Use an 8-bit sec-ded code word with each 64-bit memory word SUN WORKSTATIONS –Some use sec-ded

DS - VI - FTM - 19 Fault-tolerant Memories In Fault-tolerant Computers Self-testing and repairing (star) computer 12 bits of instruction words are stored in 2-out-of-4 code while the remaining 20 bits consist of 16-bits for the address field and 4 check bits. An inverse modulo-15 code is used to set the check bits such that the combined 20 bits represent a number that is divisible by 15. Operands also use the inverse modulo-15 code (28 data bits and 4 check bits in the data words) critical programs can be written into multiple memory units.

DS - VI - FTM - 20 Examples Carnegie-mellon university computers –C.Mmp uses two parity bits (one odd, one even) in its memory. –Cm* employs retry and error reporting mechanisms. –C.Vmp uses tmr. Electronic switching systems (bell labs) –Ess 1 uses two parity bits (one covering both address and data, the other covering just the address). The system also supports error logging, auto- retry and software error handling. –Ess 3a makes extensive use of totally self-checking checkers and duplication of critical processors to recover from errors. Fault-tolerant building blocks architecture (jpl-ucla) –Uses sec-ded and two spare switchable bits Other examples: –Tandem, stratus, august systems, plessey (great britain), philips (4.2)- concept (the netherlands), comtrac (japan) and copra (france)