RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.

Slides:



Advertisements
Similar presentations
Zombie Memory: Extending Memory Lifetime by Reviving Dead Blocks
Advertisements

CS 6560: Operating Systems Design
A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
Lecture 24 MAS 714 Hartmut Klauck
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.
Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,
SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†
Arbitrary Bit Generation and Correction Technique for Encoding QC-LDPC Codes with Dual-Diagonal Parity Structure Chanho Yoon, Eunyoung Choi, Minho Cheong.
Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.
Compressive Oversampling for Robust Data Transmission in Sensor Networks Infocom 2010.
5 Qubits Error Correcting Shor’s code uses 9 qubits to encode 1 qubit, but more efficient codes exist. Given our error model where errors can be any of.
1 Error Correction Coding for Flash Memories Eitan Yaakobi, Jing Ma, Adrian Caulfield, Laura Grupp Steven Swanson, Paul H. Siegel, Jack K. Wolf Flash Memory.
Coding for Flash Memories
Yinglei Wang, Wing-kei Yu, Sarah Q. Xu, Edwin Kan, and G. Edward Suh Cornell University Tuan Tran.
CAFO: Cost Aware Flip Optimization for Asymmetric Memories RAKAN MADDAH *, SEYED MOHAMMAD SEYEDZADEH AND RAMI MELHEM COMPUTER SCIENCE DEPARTMENT UNIVERSITY.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
Due to the economic downturn, Microsoft Research has eliminated all funding for title slides. We sincerely apologize for any impact these austerity measures.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
1 Recitation 8 Disk & File System. 2 Disk Scheduling Disks are at least four orders of magnitude slower than main memory –The performance of disk I/O.
Quantum Error Correction Jian-Wei Pan Lecture Note 9.
Repairable Fountain Codes Megasthenis Asteris, Alexandros G. Dimakis IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 5, MAY /5/221.
Prashant Nair Dae-Hyun Kim Moinuddin K. Qureshi
Towers of Hanoi. Introduction This problem is discussed in many maths texts, And in computer science an AI as an illustration of recursion and problem.
Defining Anomalous Behavior for Phase Change Memory
Ketan Patel, Igor Markov, John Hayes {knpatel, imarkov, University of Michigan Abstract Circuit reliability is an increasingly important.
Seongbo Shim, Yoojong Lee, and Youngsoo Shin Lithographic Defect Aware Placement Using Compact Standard Cells Without Inter-Cell Margin.
Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Lecture 7: PCM, Cache coherence
Sangyeun Cho Hyunjin Lee
Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,
1 Towards Phase Change Memory as a Secure Main Memory André Seznec IRISA/INRIA.
Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.
P AY -A S -Y OU -G O S TORAGE -E FFICIENT H ARD E RROR C ORRECTION Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Palette: Distributing Tables in Software-Defined Networks Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
DIGITAL COMMUNICATIONS Linear Block Codes
Re-Configurable Byzantine Quorum System Lei Kong S. Arun Mustaque Ahamad Doug Blough.
Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Cross-Product Packet Classification in GNIFS based on Non-overlapping Areas and Equivalence Class Author: Mohua Zhang, Ge Li Publisher: AISS 2012 Presenter:
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
Tunable QoS-Aware Network Survivability Presenter : Yen Fen Kao Advisor : Yeong Sung Lin 2013 Proceedings IEEE INFOCOM.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
RAID.
Rakan Maddah1, Sangyeun2,1 Cho and Rami Melhem1
External Memory.
Seyed Mohammad Seyedzadeh, Rakan Maddah, Alex Jones, Rami Melhem
Soft-Error Detection through Software Fault-Tolerance Techniques
Controlling the Cost of Reliability in Peer-to-Peer Overlays
Overview: Fault Diagnosis
Lecture 6: Reliability, PCM
RAID Redundant Array of Inexpensive (Independent) Disks
Mi Zhou, Li-Hong Shang Yu Hu, Jing Zhang
Parallel Sorting Algorithms
Use ECP, not ECC, for hard failures in resistive memories
Milestone 2 Enhancing Phase-Change Memory via DRAM Cache
Abstraction.
Disk Failures Disk failure ways and their mitigation
COMP755 Advanced Operating Systems
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Presentation transcript:

RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer Science Department University of Pittsburgh

Introduction DRAM is facing physical limitations that is expected to hinder its scalability Resistive memories e.g. Phase Change Memory(PCM) are regarded as a promising replacement for DRAM PCM is characterized by its scalability and density Initial measurements indicate that PCM is competitive to DRAM in terms of read/write latency and power efficiency.

Challenges Write endurance is one of the main causes precluding the adoption of PCM PCM Cells endure 10 6 to 10 8 write operations on average Repeated writes cause the cells to fail and get stuck permanently at either 0 or 1 A faulty cell can still be read but not reprogrammed Variable lifetime of cells due to process variation

Remedies Spreading the write evenly across the entire physical space i.e. wear leveling Suppressing unnecessary writes e.g. silent writes Multi-bit error correction schemes

Contribution RDIS: an error correction scheme for stuck-at faults prominent in resistive memories like PCM RDIS exploits the stuck-at fault model exhibited by hard- faults in SLC PCM: A worn-out cell can be classified as either stuck-at-right(SA-R) or stuck-at-wrong(SA-W) depending on the data pattern SA-1SA-0 00 Write SA-WSA-R

Goal Identify a set containing all the SA-W cells A simple way to build the set is to keep a list of pointers to the SA-W cells RDIS introduces a systematic method for building the set allowing it to include NF cells Pointer 1 Pointer 2 How? Pointer 1 Pointer 2

RDIS Encoding Process 16 cells D Mapping 4 X Stuck-at Cells

RDIS Encoding Process Introduce an auxiliary flag for each row and column Set the flags for each row and column containing a Stuck-at-wrong cell Form a mesh of cells where each cell have its corresponding column and row flags both set VX VY Mesh 1 1 VX 11 VY SA-W SA-R NF Mesh

RDIS Fault Masking Write data inverted within the initial mesh Stuck-at cells switch roles: SA-W  SA-R and SA-R  SA-W Set auxiliary flags accordingly and form a new mesh Recursively apply the same process until mesh size becomes zero i.e. no SA-W cells 1 1 VX 11 VY invert 1 0 VX 10 VY 0 VX 0 VY invert New mesh Done! SA-W SA-R NF

RDIS Fault Masking After reducing the mesh size to zero, this is how the original 2D data block will look like: VX VY Data Retrieval?

Data Retrieval/Decoding To retrieve data, read the value of a cell inverted if the minimum of its corresponding row and column counters is odd VX VY Min is odd, read inverted! Min is even, read un-inverted! Invertible Set!

Another Example VX VY SA-W SA-R NF Mesh Invertible Set

Another Example VX VY SA-W SA-R NF Mesh Invertible Set

Another Example VX VY SA-W SA-R NF Mesh Invertible Set

Another Example VX VY SA-W SA-R NF Mesh Invertible Set

Another Example VX VY SA-W SA-R NF Mesh Invertible Set

Another Example VX VY SA-W SA-R NF Mesh Invertible Set

RDIS Coverage RDIS guarantees the recovery from three stuck-at faults However, RDIS can effectively recovery from much more faults beyond what it guarantees with a high probability 2 sources for halting: The stuck-at faults form a cycle The auxiliary flag counters reach their capacity before the size of the initial formed mesh could be reduced to zero

Cycle Example The mesh size is not reduced after an inversion Faults pattern cannot be masked SA-W VX VY invert SA-W VX VY Mesh Size cannot be reduced Faults must form a cycle that is alternatively-stuck for RDIS to halt!

Counters Capacity Example A fault pattern cannot be masked due to counters capacity Assume counters capacity is limited to 3. SA-W VX VY

Counters Capacity Example A fault pattern cannot be masked due to counters capacity Assume counters capacity is limited to 3. SA-W VX VY

Counters Capacity Example A fault pattern cannot be masked due to counters capacity Assume counters capacity is limited to 3. SA-W VX VY

Counters Capacity Example A fault pattern cannot be masked due to counters capacity Assume counters capacity is limited to VX VY Counters cannot be increased further

Counters Capacity Example A fault pattern cannot be masked due to counters capacity Assume counters capacity is limited to 3. Fault pattern must be an incomplete cycle that is alternatively-stuck Faulty cell needed for cycle to be complete

Evaluation We rely on Monte-Carlo simulation to evaluate RDIS We assume that all cells have equal probability of failure We model an n * m memory block as bipartite graph with n + m nodes A block is deemed defective when the faults form a cycle or an incomplete cycle The defectiveness of a block is detected through a modification of the DFS algorithm.

Related Work SAFER[MICRO 10]: dynamically partitions a protected data block into a number of groups Each group contains at most one faulty cell Guaranties the recovery from lg n +1 faults, where n is the number of groups, and probabilistically from more faults. ECP[ISCA 10]: provides a number of programmable correction entries to a protected data block A correction entry holds a pointer to faulty cell and a patch cell that replaced the faulty one The number of recovered faults is equal to the number of provided correction entries

Overhead (%) Block size Fault Tolerance Capability

Prob. having a defective pattern 1,024-bit block2,048-bit block RDIS-3 RDIS-7 RDIS-max RDIS-3 RDIS-7 RDIS-max Probability of Defectiveness Probability increases slowly with the relative increase in the number of faults!

Aggregate Protection Protect a large block through an aggregation of smaller sub-blocks Declare defectiveness after the failure of the first sub- block

Aggregate Protection Protect a large block through an aggregation of smaller sub-blocks Declare defectiveness after the failure of the first sub- block

Overhead (%) Block size RDIS vs. SAFER More Faults! Less Overhead!

Prob. of failure 1,024-bit block RDIS-3 SAFER 256 SAFER 128 2,048-bit block RDIS-3 SAFER 128 SAFER 64 RDIS vs. SAFER

1,024-bit block2,048-bit block # of faults RDIS-3 ECP 20ECP 16 RDIS Vs. ECP: Probability of Defectiveness

Overhead (%) Block size RDIS-3 ECP 16 ECP 20 ECP 24 ECP 31 RDIS-3PX RDIS-3 ECP 14 RDIS-3 ECP 16 ECP 20 ECP 24 ECP 31 RDIS-3 ECP 14 RDIS bits 1,024 bits 2,048 bits4,096 bits8,192 bits512 bits 1,024 bits 2,048 bits4,096 bits8,192 bits Avg. # of Faults More Faults with less Overhead!

Conclusion Limited write endurance is a major weakness in PCM Multi-bit error correction schemes are needed We have presented RDIS as an error correction scheme that recursively identifies an invertible set containing all the stuck-at-wrong cell. RDIS effectively masks a large number of stuck-at faults with an affordable overhead