Prashant Nair Dae-Hyun Kim Moinuddin K. Qureshi

Slides:



Advertisements
Similar presentations
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Advertisements

Computer Organization and Architecture
A Case for Refresh Pausing in DRAM Memory Systems
Computer Organization and Architecture
High Performing Cache Hierarchies for Server Workloads
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.
Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures June 14 th 2014 Prashant J. Nair - Georgia Tech David A. Roberts- AMD Research.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,
Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.
Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
Moinuddin K. Qureshi ECE, Georgia Tech
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
Computer Organization and Architecture
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Reducing Refresh Power in Mobile Devices with Morphable ECC
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)
2010 IEEE ICECS - Athens, Greece, December1 Using Flash memories as SIMO channels for extending the lifetime of Solid-State Drives Maria Varsamou.
Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,
1 Towards Phase Change Memory as a Secure Main Memory André Seznec IRISA/INRIA.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Lecture 19 Today’s topics Types of memory Memory hierarchy.
P AY -A S -Y OU -G O S TORAGE -E FFICIENT H ARD E RROR C ORRECTION Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research.
Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
Memory Cell Operation.
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
ECE/CS 552: Main Memory and ECC © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and.
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
F AULT S IM : A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems* Memory Reliability Tutorial: HPCA-2016 Prashant.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Rakan Maddah1, Sangyeun2,1 Cho and Rami Melhem1
Reducing Memory Interference in Multicore Systems
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
Basic Performance Parameters in Computer Architecture:
Scalable High Performance Main Memory System Using PCM Technology
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Lecture 15: DRAM Main Memory Systems
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture 6: Reliability, PCM
MEMCON: Detecting and Mitigating Data-Dependent Failures by Exploiting Current Memory Content Samira Khan, Chris Wilkerson, Zhe Wang, Alaa Alameldeen,
DRAM SCALING CHALLENGE
SYNERGY: Rethinking Secure-Memory Design for Error-Correcting Memories
Use ECP, not ECC, for hard failures in resistive memories
15-740/ Computer Architecture Lecture 19: Main Memory
Co-designed Virtual Machines for Reliable Computer Systems
Presentation transcript:

Prashant Nair Dae-Hyun Kim Moinuddin K. Qureshi ArchShield: Architectural Framework for Assisting DRAM Scaling By Tolerating High Error-Rates Prashant Nair Dae-Hyun Kim Moinuddin K. Qureshi

Efficient error handling can help DRAM technology scale Introduction DRAM: Basic building block for main memory for four decades DRAM scaling provides higher memory capacity. Moving to smaller node provides ~2x capacity Shrinking DRAM cells becoming difficult  threat to scaling Efficient Error Mitigation Scaling (Feasibility) Technology node (smaller ) Efficient error handling can help DRAM technology scale

Why DRAM Scaling is Difficult? Scaling is difficult. More so for DRAM cells. Volume of capacitance must remain constant (25fF) Scaling: 0.7x dimension, 0.5x area  2x height With scaling, DRAM cells not only become narrower but longer

DRAM: Aspect Ratio Trend Narrow cylindrical cells are mechanically unstable  breaks

More Reasons for DRAM Faults Unreliability of ultra-thin dielectric material In addition, DRAM cell failures also from: Permanently leaky cells Mechanically unstable cells Broken links in the DRAM array DRAM Cells Q DRAM Cell Capacitor (tilting towards ground) Mechanically Unstable Cell DRAM Cell Capacitor Charge Leaks Permanently Leaky Cell Broken Links Permanent faults for future DRAMs expected to be much higher (we target an error rate as high as 100ppm)

Outline Introduction Current Schemes ArchShield Evaluation Summary

Row and Column Sparing Schemes have large area overheads DRAM chip (organized into rows and columns) have spares Laser fuses enable spare rows/columns Entire row/column needs to be sacrificed for a few faulty cells Replaced Rows DRAM Chip: After Row/Column Sparing Replaced Columns Deactivated Rows and Columns Spare Columns Faults Spare Rows DRAM Chip: Before Row/Column Sparing Row and Column Sparing Schemes have large area overheads

SECDED not enough for high error-rate (+ lost soft-error protection) Commodity ECC-DIMM Commodity ECC DIMM with SECDED at 8 bytes (72,64) Mainly used for soft-error protection For hard errors, high chance of two errors in same word (birthday paradox) For 8GB DIMM  1 billion words Expected errors till double-error word = 1.25*Sqrt(N) = 40K errors  0.5 ppm SECDED not enough for high error-rate (+ lost soft-error protection)

Strong ECC codes provide an inefficient solution for tolerating errors Strong ECC (BCH) codes are robust, but complex and costly Each memory reference incurs encoding/decoding latency For BER of 100 ppm, we need ECC-4  50% storage overhead Memory Controller Strong ECC Encoder + Decoder DRAM Memory System Memory Requests Memory Requests Strong ECC ArchShield How it works Architecture Components Strong ECC codes provide an inefficient solution for tolerating errors

Faulty Bits per word (8B) Dissecting Fault Probabilities At Bit Error Rate of 10-4 (100ppm) for an 8GB DIMM (1 billion words) Faulty Bits per word (8B) Probability Num words in 8GB 99.3% 0.99 Billion 1 0.007 7.7 Million 2 26 x 10-6 28 K 3 62 x 10-9 67 4 10-10 0.1 Most faulty words have 1-bit error  The skew in fault probability can be leveraged to develop low cost error resilience Goal: Tolerate high error rates with commodity ECC DIMM while retaining soft-error resilience

Outline Introduction Current Schemes ArchShield Evaluation Summary

ArchShield stores the error mitigation information in memory ArchShield: Overview Inspired from Solid State Drives (SSD) to tolerate high bit-error rate Expose faulty cell information to Architecture layer via runtime testing Replication Area Fault Map ArchShield Main Memory Most words will be error-free 1-bit error handled with SECDED Multi-bit error handled with replication Fault Map (cached) ArchShield stores the error mitigation information in memory

ArchShield: Runtime Testing When DIMM is configured, runtime testing is performed. Each 8B word gets classified into one of three types: No Error 1-bit Error Multi-bit Error (Replication not needed) SECDED can correct soft error SECDED can correct hard error Need replication for soft error Word gets decommissioned Only the replica is used (Information about faulty cells can be stored in hard drive for future use) Runtime testing identifies the faulty cells to decide correction

Architecting the Fault Map Fault Map (FM) stores information about faulty cells Per word FM is expensive (for 8B, 2-bits or 4-bits with redundancy)  Keep FM entry per line (4-bit per 64B) FM access method Table lookup with Lineaddr Avoid dual memory access via Caching FM entries in on-chip LLC Each 64-byte line has 128 FM entries Exploits spatial locality Faulty Words Main Memory Fault Map Replicated Words Replication Area Fault Map – Organized at Line Granularity and is also cachable Line-Level Fault Map + Caching provides low storage and low latency

Architecting Replication Area Faulty cells replicated at word-granularity in Replication Area Fully associative Replication Area? Prohibitive latency Set associative Replication Area? Set overflow problem Replication Area Faulty Words Fault Map Main Memory Set 1 Set 2 Set 1 Set 2 Replicated Words Fully Associative Structure Set Associative Structure Chances of Set Overflowing!

6-way table only 8% full when one set overflows  Need 12x entries Overflow of Set-Associative RA There are 10s/100s of thousand of sets  Any set could overflow How many entries used before one set overflows? Buckets-and-Balls 6-way table only 8% full when one set overflows  Need 12x entries

With Overflow Sets, Replication Area can handle non uniformity Scalable Structure for RA Replication Area Entry OFB PTR 1 16-Sets OFB 1 PTR TAKEN BY SOME OTHER SET 16-overflow sets With Overflow Sets, Replication Area can handle non uniformity

ArchShield: Operation Check R-Bit R-Bit Set, write to 2 locations Else 1 location OS-usable Mem(7.7GB) Last Level Cache Miss! Read Transaction Write Request Last Level Cache R-bit Read Request Fault Map Line Fetch Fault Map (64 MB) 1. Query Fault Map Entry 2. Fault Map Hit: No Faulty words 1. Query Fault Map Entry 2. Fault Map Hit: Faulty word 1. Query Fault Map Entry 2. Fault Map Miss Replication Area (256MB) Read Transaction Replication Area Set R-Bit Faulty Words Fault Map Entry Main Memory Replicated Word

Outline Introduction Current Schemes ArchShield Evaluation Summary

Experimental Evaluation Configuration: 8-core CMP with 8MB LLC (shared) 8GB DIMM, two channels DDR3-1600 Workloads: SPEC CPU 2006 suite in rate mode Assumptions: Bit error rate of 100ppm (random faults) Performance Metric: Execution time norm to fault free baseline

Execution Time Two sources of slowdown: Fault Map access and Replication Area access Low MPKI On average, ArchShield causes 1% slowdown

Hit rate of Fault Map in LLC is high, on average 95% Fault Map Hit-Rate Hit-Rate High MPKI Low MPKI Hit rate of Fault Map in LLC is high, on average 95%

Analysis of Memory Operations Transaction 1 Access(%) 2 Access (%) 3 Access (%) Reads 72.1 0.02 ~0 Writes 22.1 1.2 0.05 Fault Map 4.5 N/A Overall 98.75 Only 1.2% of the total accesses use the Replication Area Fault Map Traffic accounts for <5 % of all traffic

Comparison With Other Schemes FREE-p ECC-4 ArchShield Low MPKI The read-before-write of Free-p + strong ECC  high latency ECC-4 incurs decoding delay The impact of execution time is minimum in ArchShield

Outline Introduction Current Schemes ArchShield Evaluation Summary

Summary DRAM scaling challenge: High fault rate, current schemes limited We propose to expose DRAM errors to Architecture  ArchShield ArchShiled uses efficient Fault Map and Selective Word Replication ArchShield handles a Bit Error Rate of 100ppm with less than 4% storage overhead and 1% slowdown ArchShield can be used to reduce DRAM refresh by 16x (to 1 second)

Questions

Monte Carlo Simulation Probability that a structure is unable to handle given number of errors (in million). We recommend the structure with 16 overflow sets to tolerate 7.74 million errors in DIMM.

Entries in bold are frequently accessed ArchShield FlowChart Entries in bold are frequently accessed

Memory System with ArchShield

Memory Footprint

ArchShield compared to RAIDR