Rakan Maddah1, Sangyeun2,1 Cho and Rami Melhem1

Slides:



Advertisements
Similar presentations
Handling Resistance Drift in Phase Change Memory - Device, Circuit, Architecture, and System Solutions Manu Awasthi, Manjunath Shevgoor, Kshitij Sudan,
Advertisements

Thank you for your introduction.
Computer Engineering II
Better I/O Through Byte-Addressable, Persistent Memory
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.
Probabilistic Design Methodology to Improve Run- time Stability and Performance of STT-RAM Caches Xiuyuan Bi (1), Zhenyu Sun (1), Hai Li (1) and Wenqing.
SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.
Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.
Nak Hee Seong Sungkap Yeo Hsien-Hsin S. Lee
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
1 Error Correction Coding for Flash Memories Eitan Yaakobi, Jing Ma, Adrian Caulfield, Laura Grupp Steven Swanson, Paul H. Siegel, Jack K. Wolf Flash Memory.
Coding for Flash Memories
Yinglei Wang, Wing-kei Yu, Sarah Q. Xu, Edwin Kan, and G. Edward Suh Cornell University Tuan Tran.
CAFO: Cost Aware Flip Optimization for Asymmetric Memories RAKAN MADDAH *, SEYED MOHAMMAD SEYEDZADEH AND RAMI MELHEM COMPUTER SCIENCE DEPARTMENT UNIVERSITY.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
Servers Redundant Array of Inexpensive Disks (RAID) –A group of hard disks is called a disk array FIGURE Server with redundant NICs.
Due to the economic downturn, Microsoft Research has eliminated all funding for title slides. We sincerely apologize for any impact these austerity measures.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
Two or more disks Capacity is the same as the total capacity of the drives in the array No fault tolerance-risk of data loss is proportional to the number.
Prashant Nair Dae-Hyun Kim Moinuddin K. Qureshi
Defining Anomalous Behavior for Phase Change Memory
Lecture 7: PCM, Cache coherence
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. NON.
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
2010 IEEE ICECS - Athens, Greece, December1 Using Flash memories as SIMO channels for extending the lifetime of Solid-State Drives Maria Varsamou.
Lecture 16: Storage and I/O EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
P AY -A S -Y OU -G O S TORAGE -E FFICIENT H ARD E RROR C ORRECTION Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research.
RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.
I/O Computer Organization II 1 Introduction I/O devices can be characterized by – Behavior: input, output, storage – Partner: human or machine – Data rate:
Project 1: DRAM timing violation due to PV Due to PV, transistor and capacitor may have variations in their dimensions, causing charging time of a cell.
Chapter 12 – Mass Storage Structures (Pgs )
Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Data Retention in MLC NAND FLASH Memory: Characterization, Optimization, and Recovery. 서동화
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
Efficient Scrub Mechanisms for Error-Prone Emerging Memories Manu Awasthi ǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
Seyed Mohammad Seyedzadeh, Rakan Maddah, Alex Jones, Rami Melhem
DuraCache: A Durable SSD cache Using MLC NAND Flash Ren-Shuo Liu, Chia-Lin Yang, Cheng-Hsuan Li, Geng-You Chen IEEE Design Automation Conference.
Scalable High Performance Main Memory System Using PCM Technology
Better I/O Through Byte-Addressable, Persistent Memory
NOVA: A High-Performance, Fault-Tolerant File System for Non-Volatile Main Memories Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,
Introduction I/O devices can be characterized by I/O bus connections
Lecture 28: Reliability Today’s topics: GPU wrap-up Disk basics RAID
BIC 10503: COMPUTER ARCHITECTURE
CSE 451: Operating Systems Spring 2005 Module 17 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Autumn 2010 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
Lecture 6: Reliability, PCM
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Information Redundancy Fault Tolerant Computing
TECHNICAL SEMINAR PRESENTATION
RAID Redundant Array of Inexpensive (Independent) Disks
Mi Zhou, Li-Hong Shang Yu Hu, Jing Zhang
Mark Zbikowski and Gary Kimura
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Use ECP, not ECC, for hard failures in resistive memories
CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB
CSE 451: Operating Systems Winter 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity
Presentation transcript:

Power of One Bit: Increasing Error Correction Capability with Data Inversion Rakan Maddah1, Sangyeun2,1 Cho and Rami Melhem1 1Computer Science Department, University of Pittsburgh 2Memory Solutions Lab, Memory Division, Samsung Electronics Co. {rmaddah,cho,melhem}@cs.pitt.edu

Introduction DRAM and NAND flash are facing physical limitations putting their scalability into question An alternative memory technology is under quest Phase-Change Memory (PCM) is a promising emerging technology High scalability Low access latency Initial measurements and assessments show that PCM competes favorably to both DRAM and NAND Flash

PCM: The Basics PCM cells are composed of Chalcogenide alloy ( Ge, Sb and Te) PCM encode bits in different physical states through the application of varying levels of current to the phase change material SET (Crystalline) RESET (Amorphous) time Power

PCM: The Challenges Limited Endurance Slow Asymmetric Writes 106 to 108 writes on average Early failure due to parametric variation in manufacturing Slow Asymmetric Writes 4x slower than reads Writing 0s is faster than 1s Our focus is on the endurance problem

PCM: Fault Model A cell wears out when the heating element detaches from the chalcogenide material due to frequent expansions and contractions A worn out cell gets permanently stuck SA-1 SA-0 SA-1 SA-0 SA-1 SA-0

Data-Dependent Errors Physical state SA-1 SA-0 1 Write Request A Write on a memory block having a number of faults greater than the capability of the error correction code does not necessarily fail! 1 Errors after write 1 Write request 1 Errors after write 1 Write request Errors after write 1

Data-Dependent Errors Physical state SA-1 SA-0 Can we exploit this fact to increase the ECC capability? 1 Write Request 1 Errors after write 1 Write request 1 Errors after write 1 Write request Errors after write 1 Example: With an ECC code of capability 2, only 1 write out of the 3 fails A write fails only when the number of stuck-at wrong cells is above the capability of the ecc code

Contribution: Data Inversion After a write failure, Data Inversion reattempts a second write with the initial data inverted Polarity bit to flag inversion Impact: stuck-at wrong (SA-W) cells exchange role with the stuck-at right (SA-R) cells Consequence: only half of the faults in the data bits will manifest errors in the worst case Second write is successful if it brings the number of SA-W within the nominal capability of deployed error correction code Achievement: Data Inversion can increase the number of faults before a block turns defective

Data Inversion: Fault Tolerance Capability Data bits Data bits + Polarity bit Parity bits Block Defectiveness (t ECC capability) The number of faults that can be tolerated depends on their distribution within the protected block Q Faults R Faults Q + R >t Faults (Q SA-W + R SA-W in the worst case) Parity bits Q Faults Q/2 + R > t Faults (Q/2 SA-W + R SA-W in the worst case) R Faults

Execution Flow: Write (ECC-1) Physical state SA-1 SA-0 Write pattern 1 1st write 1 Data inverted auxiliary bits recomputed 1 2nd write 1

Execution Flow: Read (ECC-1) Original data 1 Physical state 1 Can we do better? Data decoded through ECC 1 Data read inverted 1

Data Inversion: Unintegrated Protection Un-integrate Polarity bit from the data bits Written infrequently Raw endurance should be enough Use other protection schemes e.g. TMR Impact: after a write failure, invert the entire codeword Abolishes the need to recompute the auxiliary information Achievement: doubles the number of faults that can be tolerated in a block before turning defective

Unintegrated Protection: Fault Tolerance Capability Data bits + Polarity bit Data bits + Parity bits Parity bits Block Defectiveness (t--ECC capability) The number of faults that can be tolerated is doubled irrespective of the faults distribution within the protected block Q Faults R Faults Q/2 + R > t Faults (Q/2 SA-W + R SA-W in the worst case) Q Faults Q> 2t +1 Faults (t+1 SA-W and t+1 SA-R in the worst case)

Execution Flow: Write (ECC-1) Physical state SA-1 SA-0 Write pattern 1 1st write 1 2nd write with data inversion 1 1

Execution Flow: Read (ECC-1) Original codeword 1 Physical state 1 1 Codeword read inverted 1 Data decoded through ECC 1

Integrated Vs. Unintegrated Protection Block size: 512 bits *BCH-6 (60 aux bits )

Integrated Vs. Unintegrated Protection Block size: 512 bits *BCH-6 (60 aux bits ) *BCH-6 + Data Inversion + Integrated Protection (60 aux bits + 1 polarity bit)

Integrated Vs. Unintegrated Protection Block size: 512 bits *BCH-6 (60 aux bits ) *BCH-6 + Data Inversion + Integrated Protection (60 aux bits + 1 polarity bit) unintegrated Protection (60

Evaluation Monte Carlo Simulation 2000 Pages of memory 512-bit cache line size for main memory protected by a BCH-6 code 512-byte sector size for secondary storage protected by a BCH-20 code Assign lifetime to cells based on a Gaussian distribution with a mean of 108 and stdev of 25 .106 A block is retired when the number of faults within it turns it defective In the case of unintegrated protection, a block is retired if the polarity bit wears out before the block turns defective

Main Memory Lifetime 21.1% 34.5% Lifetime of PCM main memory blocks achieved with BCH-6 and BCH-6 plus data inversion (DI) with integrated protection (IP) and un-integrated protection (UP).

Secondary Storage Lifetime 18.1% 25.2% Lifetime of PCM storage blocks achieved with BCH-20 and BCH-20 plus data inversion (DI) with integrated protection (IP) and un integrated protection (UP). This experiment assumed that 20% of spare storage capacity was provided.

Performance Overhead Data Inversion with Integrated Protection Data Inversion with Un-Integrated Protection Avg. % of extra writes before nominal capability is exceeded Avg. % of extra writes after nominal capability is exceeded 512 bits 0% 4.9% 13.1% 4096 bits 6.4% 8.9% Performance evaluation in terms of extra write operations required by data inversion to complete write requests successfully after the number of faults exceeds the nominal capability of the error correction code.

Conclusion Data Inversion is a simple yet powerful technique to increase the number of faults that an error correction code can tolerate Two variations: Integrated Protection: Block defectiveness depends on the distribution of faults within the block Unintegrated Protection: Doubles the number of faults that can be tolerated Data inversion extends the lifetime significantly while incurring a low performance overhead and a marginal physical overhead of one additional bit

Thank You!! Contact info: Rakan Maddah: www.cs.pitt.edu/~rmaddah Sangyeun Cho: www.cs.pitt.edu/~cho Rami Melhem: www.cs.pitt.edu/~melhem