P AY -A S -Y OU -G O S TORAGE -E FFICIENT H ARD E RROR C ORRECTION Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research.

Slides:

Advertisements

Similar presentations

Triple-Parity RAID and Beyond Hai Lu. RAID RAID, an acronym for redundant array of independent disks or also known as redundant array of inexpensive disks,

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

SDN Controller Challenges

Thank you for your introduction.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures June 14 th 2014 Prashant J. Nair - Georgia Tech David A. Roberts- AMD Research.

Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD Fundamental Latency Trade-offs in Architecting DRAM Caches MICRO 2012.

SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.

Moinuddin K. Qureshi ECE, Georgia Tech

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 17, 2003 Topic: Virtual Memory.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.

DEUCE: WRITE-EFFICIENT ENCRYPTION FOR PCM March 16 th 2015 ASPLOS-XX Istanbul, Turkey Vinson Young Prashant Nair Moinuddin Qureshi.

1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.

Moinuddin K. Qureshi ECE, Georgia Tech ISCA 2012 Michele Franceschini, Ashish Jagmohan, Luis Lastras IBM T. J. Watson Research Center PreSET: Improving.

Due to the economic downturn, Microsoft Research has eliminated all funding for title slides. We sincerely apologize for any impact these austerity measures.

Reducing Cache Power with Low-Cost, Multi-Bit Error-Correcting Codes Chris Wilkerson, Alaa R. Alameldeen, Zeshan Chishti, Wei Wu, Dinesh Somasekhar, Shih-Lien.

Prashant Nair Dae-Hyun Kim Moinuddin K. Qureshi

Defining Anomalous Behavior for Phase Change Memory

Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.

Basic File Structures and Hashing Lectured by, Jesmin Akhter, Assistant professor, IIT, JU.

Distributed Load Balancing for Key-Value Storage Systems Imranul Hoque Michael Spreitzer Malgorzata Steinder.

Lecture 7: PCM, Cache coherence

© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and.

Reducing Refresh Power in Mobile Devices with Morphable ECC

Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,

1 Towards Phase Change Memory as a Secure Main Memory André Seznec IRISA/INRIA.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.

DFTL: A flash translation layer employing demand-based selective caching of page-level address mappings A. gupta, Y. Kim, B. Urgaonkar, Penn State ASPLOS.

Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Address Translation. Recall from Last Time… Virtual addresses Physical addresses Translation table Data reads or writes (untranslated) Translation tables.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

© 2007 IBM Corporation MICRO-2009 Start-Gap: Low-Overhead Near-Perfect Wear Leveling for Main Memories Moinuddin Qureshi John Karidis, Michele Franceschini.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.

1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.

Carnegie Mellon University, *Seagate Technology

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Rakan Maddah1, Sangyeun2,1 Cho and Rami Melhem1

Lecture 11 Virtual Memory

Memory COMPUTER ARCHITECTURE

Virtual Memory User memory model so far:

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Scalable High Performance Main Memory System Using PCM Technology

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Lecture 6: Reliability, PCM

CARP: Compression-Aware Replacement Policies

Virtual Memory Hardware

MICRO-2018 Gururaj Saileshwar1 Prashant Nair1 Prakash Ramrakhyani2

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Use ECP, not ECC, for hard failures in resistive memories

Milestone 2 Enhancing Phase-Change Memory via DRAM Cache

COMP755 Advanced Operating Systems

Presentation transcript:

P AY -A S -Y OU -G O S TORAGE -E FFICIENT H ARD E RROR C ORRECTION Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research Center New York Research done while at: IBM T. J. Watson Research Center New York MICRO 2011 Dec 6, 2011

Introduction PCM is a scalable technology. Device state changed by heating. PAY-AS-YOU-GO, MICRO-2011 Over time, write operations break heater  Cell gets stuck Reported write endurance: million writes/cell With good wear leveling still possible to have 8+ years lifetime

Not All Cells Are Created Equal PAY-AS-YOU-GO, MICRO-2011 Variability in lifetime due to process variation: weak vs. strong cells Weak cells fail much earlier  reduce system lifetime greatly Lifetime usually modeled as Gaussian with SDEV of 10-30% of mean We use SDEV=20% of mean P (5 SDEV from mean) ≈ For 1GB memory bank, 8K bits fail at time 0, more as we write! PCM needs significant amount of error correction to handle variability

Write Efficient Code Traditional ECC codes are write intensive  More wear Endurance related (hard) faults identified with checker read Write-efficient code: Error Correcting Pointers [ISCA’10] PAY-AS-YOU-GO, MICRO-2011 ECP needs 10 bits per entry. Handles multiple faults (needs 1 Full bit) … 511 Cache Line (512b) X Pointer 9 bit D For correcting N errors, ECP needs (10N+1) bits 1 bit

Expensive to Correct Many Errors To get 6+ years lifetime, we need to correct six errors per line Storage: 61 bits/line (about 12%, 1GB for 8GB)  Expensive Unlike ECC in current DRAM chips, this overhead is not optional PAY-AS-YOU-GO, MICRO Baseline System Lifetime (years) NoECP ECP-1 ECP-2 ECP-3 ECP-4 ECP-5 ECP-6 Goal: Reduce storage significantly (3X-6X) while retaining lifetime

Motivation Uniformly allocating error correction entries is inefficient (by ~20X) We do not need to pay for error correction of each line upfront PAY-AS-YOU-GO, MICRO-2011 Pay-As-You-Go: Give error correction entries in proportion to errors Num Writes (Normalized) No ECP used Only ECP-1 used ECP-2 to ECP-6 used Average ECP Used 50%99.02%0.97%0.01% %79.63%18.14%2.23% %73.24%22.82%3.95%0.31 Utilization of error correction entries per line Key insight: Very few lines have large number of errors

Outline  Introduction & Motivation  PAYG Design  Results  Even More Storage Efficiency  Related Work  Summary PAY-AS-YOU-GO, MICRO-2011

Naïve Design for PAYG PAY-AS-YOU-GO, MICRO-2011 MEMORY LINE (64B) OFB Ways (Num GEC entries per set) Sets V TAG ECP-N GEC Entry Global Error Correction (GEC) Pool Given 73% of lines have no error, why not give ECP-6 only on error? GEC Pool structure: Set associative vs. Fully associative (impractical)

Three Key Problems 1.Set associative structure is inefficient (by ~8X for 8-way) 2.If we allocate six ECP entries per each GEC entry, most error correction entries still remain unused 3.Given >25% of lines are likely to have at-least on error, the latency impact of GEC is significant PAY-AS-YOU-GO, MICRO-2011

Inefficiency of Set Associative GEC PAY-AS-YOU-GO, MICRO-2011 There are 10s/100s of thousand of sets  Any set could overflow How many entries used before one set overflows? Buckets-and-Balls An 8-way GEC only 12% full when one set overflows  Need 8x entries

Scalable Structure for GEC Pool PAY-AS-YOU-GO, MICRO-2011 “Hash-Table With Chaining” structure for flexibility & low latency OFB Set Associative Table (SAT) Global Collision Table (GCT) GEC Entry 1 PTR 1 GCT-HEAD *PTR is two-way replicated TAKEN BY SOME OTHER SET

Scalable Structure for GEC Pool PAY-AS-YOU-GO, MICRO-2011 Structure Total EntriesLatency Fully Associative NVery High 8-way Set Associative 8*N 1 8-way (SAT+GCT) 1.5*N 1+epsilon Proposed GEC structure has latency similar to Set Associative Table while needing 5X fewer entries Global Collision Table (GCT) with half as many sets as SAT is sufficient Lets say we want to store N entries

Solving Other Two Problems 2. Fine Grained Allocation for effectively utilizing ECP entries Each GEC entry has only ECP-1. Each line can have multiple GEC entries We guarantee that all entries are in same set of (SAT/GCT) A faulty line can get more than ECP-6 as well 3. Local Error Correction (LEC) for low latency in common case Each line has dedicated ECP-1 (handles 95% lines) Ensures extra accesses (GEC) needed for only few lines PAY-AS-YOU-GO, MICRO-2011

PAYG: Tying it All Together PAY-AS-YOU-GO, MICRO-2011 PAYG performs on-demand allocation of error correction entries PAYG has 3 levels. LEC is first line of defense (lowers latency) SAT is second and GCT is third (flexible)

Outline  Introduction & Motivation  PAYG Design  Results  Even More Storage Efficiency  Related Work  Summary PAY-AS-YOU-GO, MICRO-2011

Evaluation Settings PAY-AS-YOU-GO, MICRO-2011 Assumptions: 1. Mean writes 32 Million, SDEV=20%, no correlation 2. Perfect wear leveling  all lines get same number of writes 3. Writes are converted into writes-read to detect faults Configuration: PCM bank of 1GB with 64B lines, so 16 million lines per bank Write latency of 1 micro second At 100% write traffic, lifetime is 18 years (if zero variance) Figure of Merit: Uniform ECP-6 gets 35% of ideal lifetime, so 6.5 years We report lifetime with respect to Uniform ECP-6

Importance of Scalable GEC Pool PAY-AS-YOU-GO, MICRO-2011 Proposed structure reduces storage overhead of GEC by more than 5X Num SAT Sets Num GCT Sets (SAT Sets=128K) NoFGA-NoGCTNoFGA-wGCT Total Sets 128K+64K=192K

Importance of Fine-Grained Alloc. PAY-AS-YOU-GO, MICRO-2011 Num ECP Entries in Each GEC Entry54321 Num GEC Entry per Set (64B line) Total ECP Entries per Set Fine-Grained Allocation improves the effectiveness of PAYG

Importance of LEC PAY-AS-YOU-GO, MICRO-2011 We can get higher lifetime by increasing GEC size but we still need LEC 5 years For first 5 years, PAYG incurs on avg 1 extra access for < 0.4% accesses Without LEC, latency impact is significant. With LEC, not so much

Storage Overhead PAY-AS-YOU-GO, MICRO-2011 LEC Storage13 bits/line (10 bit ECP + 1 valid + 2 OFB) GEC Storage6.5 bits/line on average Total19.5 bits/line SchemeStorage Overhead (bits/line) Lifetime Uniform ECP-6611X Uniform ECP X PAYG with ECP-1 in LEC X PAYG provides lifetime similar to ECP-8 at 3.1X less storage than ECP-6 (Total storage overhead to protect 1GB reduces from 122MB to 39MB, down 83MB)

Outline  Introduction & Motivation  PAYG Design  Results  Even More Storage Efficiency  Related Work  Summary PAY-AS-YOU-GO, MICRO-2011

Efficient Single Bit Correction PAY-AS-YOU-GO, MICRO-2011 LEC responsible for most of storage overhead (13 bits out of 19.5 bits) Need efficient schemes single bit hard faults  Alternate Data Retry (ADR) ADR: Mask hard fault by storing data in either normal or inverted form SA-0 0 INV SA-0 1 INV ADR needs only 1 bit to mask a single stuck-at-fault (caveat: double write) Reduce storage overhead of PAYG by using ADR instead of ECP-1 in LEC

Comparisons PAY-AS-YOU-GO, MICRO-2011 SchemeStorage Overhead (bits/line) Lifetime Uniform ECP-6611X Uniform ECP X PAYG with ECP-1 in LEC X PAYG with ADR in LEC X PAYG with heterogeneous error correction reduces storage by 6X Hard to scale ADR to multiple faults. SAFER [MICRO’10] partitions lines with multiple faults into single bit faults. SAFER needs 55 bits/line and lifetime ~ECP-6

Outline  Introduction & Motivation  PAYG Design  Results  Even More Storage Efficiency  Related Work  Summary PAY-AS-YOU-GO, MICRO-2011

Non Uniform Error Correction  Variable Strength ECC (VS-ECC) by Alameldeen+ ISCA’11 Proposed for cache reliability at low voltages Each way has ECC-4 for one quarter of ways, allocated based on testing Difference: Cache line disabling works. Only set associative structure.  Layered ECP by Schechter+ ISCA’10 ECP-1 for each line, and some ECP entries for each page In essence, this is a set-associative GEC with ECP-1 in LEC Difference: Set associative GEC requires 5X more entries (inefficient)  Line Sparing with FREE-p by Hyun+ HPCA’11 A faulty line is remapped to a spare area using embedded pointer Sparing needs 1 good line for 1 uncorrectable fault Difference: PAYG is much more storage efficient than sparing PAY-AS-YOU-GO, MICRO-2011

FREE-p: Sparing vs. Correction PAY-AS-YOU-GO, MICRO-2011 For 1 extra error bit, PAYG needs 20 bit GEC entry, FREE-p needs 512 bit PAYG is more effective than line sparing with FREE-p

Outline  Introduction & Motivation  PAYG Design  Results  Even More Storage Efficiency  Related Work  Summary PAY-AS-YOU-GO, MICRO-2011

Summary PAY-AS-YOU-GO, MICRO-2011 PCM: limited endurance, variability across cells reduces lifetime Need to correct many (six) errors per line Uniform allocation is expensive and inefficient (only 0.3 out of 6 used) Pay-As-You-Go (PAYG): Allocate error correction entries on demand PAYG has LEC + GEC Pool (Set Associative Table + Global Collision Table) Provides 1.13X lifetime compared to ECP-6 at 3.1X lower overhead Heterogeneous scheme (ADR for LEC) reduces storage by 6X PAYG useful for efficient hard-error correction in other technologies too