LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm Jouppi * University of Utah and *HP Labs § Currently with ARM
Memory Reliability Datacenters are the backbone of the web-connected infrastructure –Reliability is essential Memory reliability is a major concern [Schroeder et al., SIGMETRICS ‘09] –among the most error-prone parts of a server –Even a few uncorrectable errors will require DIMM replacement ranks near the top of component replacements in datacenters Increases downtime Increases operational cost 2 Source: Nagios
Some Numbers A single server blade 2 Billion DRAM cells per chip X 36 DRAM chips per DIMM X 2 DIMMs per channel X 4 Channels per processor X 4 processors per blade = ~2.5 x DRAM cells Datacenter 16 blades per enclosure X 4 enclosures per rack X 10 racks per container X 40 containers per datacenter = ~64 x DRAM cells 3 Assume MTTF per cell is the age of the universe ~14 Billion Years Blade DRAM MTTF = 2 days Datacenter DRAM MTTF = 7 seconds
Target Reliability High-end servers commonly have high reliability expectations –Single Symbol Correct Double Symbol Detect –One symbol == one DRAM chip (“Chipkill”) Today’s systems employ symbol-based ECC codes 4
Problems with Existing Solutions Increased access granularity –Every data access is spread across 36 DRAM chips –JEDEC standards define minimum access granularity per chip –Massive overfetch of data at multiple levels Wastes energy Wastes bandwidth Reduced rank-level parallelism x4 device width restriction –fewer ranks for given DIMM real estate Reliability level: 1 failed chip out of 36 5
A New Approach: LOT-ECC Operate on a single rank of x8 memory: 9 chips –and support 1 failed chip out of 9 Multiple tiers of localized protection –Tier 1: Local Error Detection (checksum) –Tier 2: Global Error Correction (parity) –T3 & T4 to handle specific failure cases Data mapping handled by memory controller with firmware support –Transparent to OS, caches, etc. –Strictly commodity DRAM used Significant power and performance benefits 6
Tier 1 – Local Error Detection (LED) 7 Standard x72 DIMM (Nine x8 parts): Eight data + One ECC We use all 9 chips for both data and ECC 64 bits per chip per burst – 57 data + 7 checksum Chip 0 Chip 8 Chip 7
Tier 1 – Local Error Detection (LED) 57 bits * 9 = 513 –Only 1 cache line read at a time –57 bits/chip on first 8 chips; 56 bits on 9 th chip 1 bit extra on the 9 th chip –Use in a different tier of protection No performance impact on memory reads or writes –LED ops occur in parallel with data ops Note that LED is local to each chip –Need to pin-point exact failed chip, not simply detect an error in the rank 8
G0G0 L G0 G7G7 L G7 G8G8 L G8 G1G1 L G1 Tier 2 – Global Error Correction (GEC) 9 A0A0 L A0 PA 0-6 A7A7 L A7 A8A8 L A8 A, B, C, D, E, F, G, H – Cache Lines, each comprised of segments X 0 through X 8 L XN – L1 Local Error Detection for Cache Line X, Segment N [PX 0 :PX N ] – L2 Global Error Correction across segments X 0 through X 8 PP X – Parity across GEC segments PX 0-6 through PX Data LED GEC PA PP A 57 bits 7 bits Chip 0 Chip 7Chip 8 A1A1 L A1 PA Chip 1. PA 56 ✔ ✔ ✖ ✖
The Devil is in the Details....and the details are in the paper! Need to detect and correct additional errors in GEC region –Parity is 57 bits; write granularity is 72 bits –Use the remaining 15 bits wisely, add two more tiers of protection 10 7b 1b PA 0-6 PA 7-13 PA PP A.. T4 PA 56 T4 Surplus bit borrowed from data + LED Chip 0 Chip 1 Chip 7Chip 8 7b 1b
Optimizing Write Behavior Every write has to update its GEC bits –Already borrowing one bit from [data + LED] to use in the GEC –Put them all in the same DRAM row! –Guaranteed row-buffer hit –Data mapping handled by the memory controller 11 A7A7 L A7 B7B7 L B7 G7G7 L G7 H7H7 L H7 PA PB PH A8A8 L A8 B8B8 L B8 G8G8 L G8 H8H8 L H8 PP A PP B PP H A0A0 L A0 B0B0 L B0 PA 0-6 G0G0 L G0 H0H0 L H0 PB 0-6 PH bits 7 bits Chip 0 Chip 7Chip 8 A0A0 L A0 B0B0 L B0 PA 0-6 G0G0 L G0 H0H0 L H0 PB 0-6 PG bits 7 bits PH 0-6 A8A8 L A8 B8B8 L B8 PP A G8G8 L G8 H8H8 L H8 PP B PP G 57 bits 7 bits PP H
GEC Coalescing DDR3 burst of 8 forces 72 bytes per access –GEC per cache line is only 72 bits With sufficient locality, one GEC write can potentially cover 8 data writes –In reality, each write becomes 1 + δ writes (for < δ ≤ 1) Note that even with δ = 1, benefits of row-buffer hit remain Write typically buffered at the memory controller to avoid bus turnaround overheads –Controller can re-order accesses to accommodate coalescing Results show three cases: Basic design (δ = 1), Simple coalescing (measured δ), and Oracular design (δ = 0.125) 12
Constructing the LED Code Use a 7-bit ECC code to detect errors in 57 data bits –We choose a 7-bit 1’s complement checksum Paper details code operation and computes FIT –single-bit, double-bit, row, column, row-column, pin, chip, multiple random, combinations Very small rate of undetected errors –Caused by very specific, uncommon bit-flip combinations – Less than 5E-5 FIT! Captures some failure modes NOT captured by existing mechanisms (failure of 2 chips out of 18, errors in >2 chips/rank, etc.) 13
Checksum Design Not all error combinations actually occur in DRAM –Small number of failure modes with specific root causes –Code’s effectiveness under those failures is important Current symbol-based codes guarantee capturing 100% of SSC-DSD errors –At huge power and performance penalties –Likely overkill Not scalable as error rates increase –Use strong yet practical codes + RAS features –Example: Proactive patrol scrubbing will capture a majority of soft errors; may not coincide with hard errors 14
Evaluation Methodology Performance analysis: In-house DRAM simulator –Models refresh, address/command bus, data bus, banks/ranks/channels contention, read/write queues Power analysis: Micron power calculator spreadsheet –Reflects timing parameters assumed for performance simulations –Bus utilization and bank utilization numbers obtained from performance simulations –Accounts for activation power, read/ write power, termination power, and background power –Includes low-power sleep modes 15
Evaluation Platforms Xeon 7500-like system –8 DDR3 channels, 2 DIMMs/channel –Dual-ranked x4 or Quad-ranked x8 DIMMs –“Lockstep mode” is the only supported mode Two ranks operate together to provide a 144-bit bus Wasted bandwidth by masking out half the burst, OR Forced prefetching Also evaluate Xeon 5500-like systems –3 DDR3 channels, 3 DIMMs/channel –“Lockstep mode” wastes one channel entirely, gangs other two Evaluate five design points each –Baseline symbol-based SSC-DSD – Virtualized ECC (Yoon & Erez, ASPLOS ’10) – LOT-ECC with no coalescing, simple coalescing, oracular coalescing
Power Results %
Power Results %
Performance Results Latency Reduction: LOT-ECC 4.6% +GEC Coalescing 7.7% Oracular 16.2%
Performance Results Latency Reduction: LOT-ECC 42.9% +GEC Coalescing 46.9% Oracular 57.3%
Storage Overhead For each 64-byte cache line –63 bits of LED checksum –57 bits of GEC parity –7 bits of T3 code –9 bits of T4 code Total storage overhead of 26.5% Current ECC implementations and DIMMs already accept 12.5% through extra chip Additional 14% in data memory via firmware Memory capacity is cheap if commodity –Better to spend on this than power/performance 21
Key Contributions Multi-tiered protection design to keep fault tolerance contained to fewer chips Unique data layout tailored to the access mechanism of commodity DRAM systems Exploit row-buffer efficiency –co-locate data and all tiers of fault-tolerance codes –Mitigates overheads of additional writes typical in parity-based systems Coalescing optimization to further minimize impact of parity writes 22
Key Benefits Power Efficiency: Fewer chips activated per access, reduced access granularity, reduced static energy through better use of low-power modes (43% memory power savings) Performance Gains: More rank-level parallelism, reduced access granularity (7.7% memory latency reduction) Improved Protection: Can handle 1 failed chip out of 9, compared to 1 in 36 currently Flexibility: Works with a single rank of x4 DRAMs or more efficient x8 DRAMs Implementation Ease: Changes to memory controller and system firmware only; commodity processor/memory/OS 23
BACKUP SLIDES 24
Tier 2 – Global Error Correction (GEC) GEC is a parity written across the cache line segments in each chip LED has already pinpointed erroneous segment –Error correction is trivial Storing the parity –A portion of memory set aside to hold GEC –Handled by memory controller + firmware No impact on reads unless error is detected GEC also self contained (single cache line) –No read-before-write 25