LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm.

Slides:

Advertisements

Similar presentations

Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah Ph.D. Dissertation.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Memory Controller Innovations for High-Performance Systems

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,

1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures June 14 th 2014 Prashant J. Nair - Georgia Tech David A. Roberts- AMD Research.

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.

1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.

1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

Due to the economic downturn, Microsoft Research has eliminated all funding for title slides. We sincerely apologize for any impact these austerity measures.

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.

Virtualized and Flexible ECC for Main Memory

Prashant Nair Dae-Hyun Kim Moinuddin K. Qureshi

Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.

1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.

Reducing Refresh Power in Mobile Devices with Morphable ECC

IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.

1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)

Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.

Data Sharing. Data Sharing in a Sysplex Connecting a large number of systems together brings with it special considerations, such as how the large number.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

ECE/CS 552: Main Memory and ECC © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.

1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.

1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.

1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.

1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.

1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Disks and RAID.

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Lecture 15: DRAM Main Memory Systems

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Part V Memory System Design

CMSC 611: Advanced Computer Architecture

Lecture: DRAM Main Memory

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture 28: Reliability Today’s topics: GPU wrap-up Disk basics RAID

Lecture: Memory Technology Innovations

Lecture 6: Reliability, PCM

Lecture 24: Memory, VM, Multiproc

Adapted from slides by Sally McKee Cornell University

RAID Redundant Array of Inexpensive (Independent) Disks

Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,

SYNERGY: Rethinking Secure-Memory Design for Error-Correcting Memories

Use ECP, not ECC, for hard failures in resistive memories

Presentation transcript:

LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm Jouppi * University of Utah and *HP Labs § Currently with ARM

Memory Reliability Datacenters are the backbone of the web-connected infrastructure –Reliability is essential Memory reliability is a major concern [Schroeder et al., SIGMETRICS ‘09] –among the most error-prone parts of a server –Even a few uncorrectable errors will require DIMM replacement  ranks near the top of component replacements in datacenters  Increases downtime  Increases operational cost 2 Source: Nagios

Some Numbers A single server blade 2 Billion DRAM cells per chip X 36 DRAM chips per DIMM X 2 DIMMs per channel X 4 Channels per processor X 4 processors per blade = ~2.5 x DRAM cells Datacenter 16 blades per enclosure X 4 enclosures per rack X 10 racks per container X 40 containers per datacenter = ~64 x DRAM cells 3 Assume MTTF per cell is the age of the universe ~14 Billion Years Blade DRAM MTTF = 2 days Datacenter DRAM MTTF = 7 seconds

Target Reliability High-end servers commonly have high reliability expectations –Single Symbol Correct Double Symbol Detect –One symbol == one DRAM chip (“Chipkill”) Today’s systems employ symbol-based ECC codes 4

Problems with Existing Solutions Increased access granularity –Every data access is spread across 36 DRAM chips –JEDEC standards define minimum access granularity per chip –Massive overfetch of data at multiple levels  Wastes energy  Wastes bandwidth  Reduced rank-level parallelism x4 device width restriction –fewer ranks for given DIMM real estate Reliability level: 1 failed chip out of 36 5

A New Approach: LOT-ECC Operate on a single rank of x8 memory: 9 chips –and support 1 failed chip out of 9 Multiple tiers of localized protection –Tier 1: Local Error Detection (checksum) –Tier 2: Global Error Correction (parity) –T3 & T4 to handle specific failure cases Data mapping handled by memory controller with firmware support –Transparent to OS, caches, etc. –Strictly commodity DRAM used Significant power and performance benefits 6

Tier 1 – Local Error Detection (LED) 7 Standard x72 DIMM (Nine x8 parts): Eight data + One ECC We use all 9 chips for both data and ECC 64 bits per chip per burst – 57 data + 7 checksum Chip 0 Chip 8 Chip 7

Tier 1 – Local Error Detection (LED) 57 bits * 9 = 513 –Only 1 cache line read at a time –57 bits/chip on first 8 chips; 56 bits on 9 th chip  1 bit extra on the 9 th chip –Use in a different tier of protection No performance impact on memory reads or writes –LED ops occur in parallel with data ops Note that LED is local to each chip –Need to pin-point exact failed chip, not simply detect an error in the rank 8

G0G0 L G0 G7G7 L G7 G8G8 L G8 G1G1 L G1 Tier 2 – Global Error Correction (GEC) 9 A0A0 L A0 PA 0-6 A7A7 L A7 A8A8 L A8 A, B, C, D, E, F, G, H – Cache Lines, each comprised of segments X 0 through X 8 L XN – L1 Local Error Detection for Cache Line X, Segment N [PX 0 :PX N ] – L2 Global Error Correction across segments X 0 through X 8 PP X – Parity across GEC segments PX 0-6 through PX Data LED GEC PA PP A 57 bits 7 bits Chip 0 Chip 7Chip 8 A1A1 L A1 PA Chip 1. PA 56 ✔ ✔ ✖ ✖

The Devil is in the Details....and the details are in the paper! Need to detect and correct additional errors in GEC region –Parity is 57 bits; write granularity is 72 bits –Use the remaining 15 bits wisely, add two more tiers of protection 10 7b 1b PA 0-6 PA 7-13 PA PP A.. T4 PA 56 T4 Surplus bit borrowed from data + LED Chip 0 Chip 1 Chip 7Chip 8 7b 1b

Optimizing Write Behavior Every write has to update its GEC bits –Already borrowing one bit from [data + LED] to use in the GEC –Put them all in the same DRAM row! –Guaranteed row-buffer hit –Data mapping handled by the memory controller 11 A7A7 L A7 B7B7 L B7 G7G7 L G7 H7H7 L H7 PA PB PH A8A8 L A8 B8B8 L B8 G8G8 L G8 H8H8 L H8 PP A PP B PP H A0A0 L A0 B0B0 L B0 PA 0-6 G0G0 L G0 H0H0 L H0 PB 0-6 PH bits 7 bits Chip 0 Chip 7Chip 8 A0A0 L A0 B0B0 L B0 PA 0-6 G0G0 L G0 H0H0 L H0 PB 0-6 PG bits 7 bits PH 0-6 A8A8 L A8 B8B8 L B8 PP A G8G8 L G8 H8H8 L H8 PP B PP G 57 bits 7 bits PP H

GEC Coalescing DDR3 burst of 8 forces 72 bytes per access –GEC per cache line is only 72 bits With sufficient locality, one GEC write can potentially cover 8 data writes –In reality, each write becomes 1 + δ writes (for < δ ≤ 1) Note that even with δ = 1, benefits of row-buffer hit remain Write typically buffered at the memory controller to avoid bus turnaround overheads –Controller can re-order accesses to accommodate coalescing Results show three cases: Basic design (δ = 1), Simple coalescing (measured δ), and Oracular design (δ = 0.125) 12

Constructing the LED Code Use a 7-bit ECC code to detect errors in 57 data bits –We choose a 7-bit 1’s complement checksum Paper details code operation and computes FIT –single-bit, double-bit, row, column, row-column, pin, chip, multiple random, combinations Very small rate of undetected errors –Caused by very specific, uncommon bit-flip combinations – Less than 5E-5 FIT! Captures some failure modes NOT captured by existing mechanisms (failure of 2 chips out of 18, errors in >2 chips/rank, etc.) 13

Checksum Design Not all error combinations actually occur in DRAM –Small number of failure modes with specific root causes –Code’s effectiveness under those failures is important Current symbol-based codes guarantee capturing 100% of SSC-DSD errors –At huge power and performance penalties –Likely overkill Not scalable as error rates increase –Use strong yet practical codes + RAS features –Example: Proactive patrol scrubbing will capture a majority of soft errors; may not coincide with hard errors 14

Evaluation Methodology Performance analysis: In-house DRAM simulator –Models refresh, address/command bus, data bus, banks/ranks/channels contention, read/write queues Power analysis: Micron power calculator spreadsheet –Reflects timing parameters assumed for performance simulations –Bus utilization and bank utilization numbers obtained from performance simulations –Accounts for activation power, read/ write power, termination power, and background power –Includes low-power sleep modes 15

Evaluation Platforms Xeon 7500-like system –8 DDR3 channels, 2 DIMMs/channel –Dual-ranked x4 or Quad-ranked x8 DIMMs –“Lockstep mode” is the only supported mode  Two ranks operate together to provide a 144-bit bus  Wasted bandwidth by masking out half the burst, OR  Forced prefetching Also evaluate Xeon 5500-like systems –3 DDR3 channels, 3 DIMMs/channel –“Lockstep mode” wastes one channel entirely, gangs other two Evaluate five design points each –Baseline symbol-based SSC-DSD – Virtualized ECC (Yoon & Erez, ASPLOS ’10) – LOT-ECC with no coalescing, simple coalescing, oracular coalescing

Power Results %

Power Results %

Performance Results Latency Reduction: LOT-ECC 4.6% +GEC Coalescing 7.7% Oracular 16.2%

Performance Results Latency Reduction: LOT-ECC 42.9% +GEC Coalescing 46.9% Oracular 57.3%

Storage Overhead For each 64-byte cache line –63 bits of LED checksum –57 bits of GEC parity –7 bits of T3 code –9 bits of T4 code Total storage overhead of 26.5% Current ECC implementations and DIMMs already accept 12.5% through extra chip Additional 14% in data memory via firmware Memory capacity is cheap if commodity –Better to spend on this than power/performance 21

Key Contributions Multi-tiered protection design to keep fault tolerance contained to fewer chips Unique data layout tailored to the access mechanism of commodity DRAM systems Exploit row-buffer efficiency –co-locate data and all tiers of fault-tolerance codes –Mitigates overheads of additional writes typical in parity-based systems Coalescing optimization to further minimize impact of parity writes 22

Key Benefits Power Efficiency: Fewer chips activated per access, reduced access granularity, reduced static energy through better use of low-power modes (43% memory power savings) Performance Gains: More rank-level parallelism, reduced access granularity (7.7% memory latency reduction) Improved Protection: Can handle 1 failed chip out of 9, compared to 1 in 36 currently Flexibility: Works with a single rank of x4 DRAMs or more efficient x8 DRAMs Implementation Ease: Changes to memory controller and system firmware only; commodity processor/memory/OS 23

BACKUP SLIDES 24

Tier 2 – Global Error Correction (GEC) GEC is a parity written across the cache line segments in each chip LED has already pinpointed erroneous segment –Error correction is trivial Storing the parity –A portion of memory set aside to hold GEC –Handled by memory controller + firmware No impact on reads unless error is detected GEC also self contained (single cache line) –No read-before-write 25