Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.

Slides:

Advertisements

Similar presentations

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Advertisements

Thank you for your introduction.

A Case for Refresh Pausing in DRAM Memory Systems

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems Brian Forney Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Wisconsin Network Disks University.

1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.

FAWN: Fast Array of Wimpy Nodes A technical paper presentation in fulfillment of the requirements of CIS 570 – Advanced Computer Systems – Fall 2013 Scott.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.

Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

Paper Title Your Name CMSC 838 Presentation. CMSC 838T – Presentation Motivation u Problem paper is trying to solve  Characteristics of problem  … u.

Kevin Lim*, Jichuan Chang +, Trevor Mudge*, Parthasarathy Ranganathan +, Steven K. Reinhardt* †, Thomas F. Wenisch* June 23, 2009 Disaggregated Memory.

Figure 1.1 Interaction between applications and the operating system.

Measuring Performance Chapter 12 CSE807. Performance Measurement To assist in guaranteeing Service Level Agreements For capacity planning For troubleshooting.

1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

Data Storage Technology

Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

Due to the economic downturn, Microsoft Research has eliminated all funding for title slides. We sincerely apologize for any impact these austerity measures.

Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.

Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.

Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu

Software Reliability SEG3202 N. El Kadri.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Business Data Communications, Fourth Edition Chapter 11: Network Management.

Computer Organization & Assembly Language © by DR. M. Amer.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. 1 Main Frame Computing Objectives Explain why data resides on mainframe.

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

Computer Hardware – Part 2 Basic Components V.T. Raja, Ph.D., Information Management Oregon State University.

1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.

1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.

33 rd IEEE International Conference on Computer Design ICCD rd IEEE International Conference on Computer Design ICCD 2015 Improving Memristor Memory.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Carnegie Mellon University, *Seagate Technology

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Reducing Memory Interference in Multicore Systems

Aya Fukami, Saugata Ghose, Yixin Luo, Yu Cai, Onur Mutlu

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Scaling for the Future Katherine Yelick U.C. Berkeley, EECS

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.

Scalable High Performance Main Memory System Using PCM Technology

Lecture 15: DRAM Main Memory Systems

Introduction I/O devices can be characterized by I/O bus connections

The Main Memory system: DRAM organization

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Session 1A at am MEMCON Detecting and Mitigating

Lecture 6: Reliability, PCM

Presentation transcript:

Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from the Field

Overview Study of DRAM reliability: ▪ on modern devices and workloads ▪ at a large scale in the field

New reliabilit y trends Page offlining at scale Error/failure occurrence Technology scaling Modeling errors Architecture & workload Overview

New reliabilit y trends Page offlining at scale Technology scaling Modeling errors Architecture & workload Error/failure occurrence Overview Errors follow a power-law distribution and a large number of errors occur due to sockets/channels

New reliabilit y trends Page offlining at scale Modeling errors Architecture & workload Error/failure occurrence Technology scaling Overview We find that newer cell fabrication technologies have higher failure rates

New reliabilit y trends Page offlining at scale Modeling errors Error/failure occurrence Technology scaling Architecture & workload Overview Chips per DIMM, transfer width, and workload type (not necessarily CPU/memory utilization) affect reliability

New reliabilit y trends Page offlining at scale Error/failure occurrence Technology scaling Architecture & workload Modeling errors Overview We have made publicly available a statistical model for assessing server memory reliability

New reliabilit y trends Error/failure occurrence Technology scaling Architecture & workload Modeling errors Page offlining at scale Overview First large-scale study of page offlining; real-world limitations of technique

▪ background and motivation ▪ server memory organization ▪ error collection/analysis methodology ▪ memory reliability trends ▪ summary Outline

Background and motivation

▪ examined extensively in prior work ▪ charged particles, wear-out ▪ variable retention time (next talk) ▪ error correcting codes ▪ used to detect and correct errors ▪ require additional storage overheads DRAM errors are common

Our goal Strengthen understanding of DRAM reliability by studying: ▪ new trends in DRAM errors ▪ modern devices and workloads ▪ at a large scale ▪ billions of device-days, across 14 months

Our main contributions ▪ identified new DRAM failure trends ▪ developed a model for DRAM errors ▪ evaluated page offlining at scale

Server memory organization

Socket

Memory channels

DIMM slots

DIMM

Chip

Banks

Rows and columns

Cell

User data ECC metadata additional 12.5% overhead

Reliability events Fault ▪ the underlying cause of an error ▪ DRAM cell unreliably stores charge Error ▪ the manifestation of a fault ▪ permanent: every time ▪ transient: only some of the time

Error collection/ analysis methodology

DRAM error measurement ▪ measured every correctable error ▪ across Facebook's fleet ▪ for 14 months ▪ metadata associated with each error ▪ parallelized Map-Reduce to process ▪ used R for further analysis

System characteristics ▪ 6 different system configurations ▪ Web, Hadoop, Ingest, Database, Cache, Media ▪ diverse CPU/memory/storage requirements ▪ modern DRAM devices ▪ DDR3 communication protocol ▪ (more aggressive clock frequencies) ▪ diverse organizations (banks, ranks,...) ▪ previously unexamined characteristics ▪ density, # of chips, transfer width, workload

Memory reliability trends

New reliabilit y trends Page offlining at scale Technology scaling Modeling errors Architecture & workload Error/failure occurrence

New reliabilit y trends Page offlining at scale Technology scaling Modeling errors Architecture & workload Error/failure occurrence

Server error rate 3% 0.03%

Memory error distribution

Decreasing hazard rate

Memory error distribution Decreasing hazard rate How are errors mapped to memory organization?

Sockets/channels: many errors

Not mentioned in prior chip- level studies

Sockets/channels: many errors Not accounted for in prior chip-level studies At what rate do components fail on servers?

Bank/cell/spurious failures are common

# of servers Denial-of- service– like behavior # of errors

# of servers Denial-of- service– like behavior # of errors What factors contribute to memory failures at scale?

Analytical methodology ▪ measure server characteristics ▪ not feasible to examine every server ▪ examined all servers with errors (error group) ▪ sampled servers without errors (control group) ▪ bucket devices based on characteristics ▪ measure relative failure rate ▪ of error group vs. control group ▪ within each bucket

New reliabilit y trends Page offlining at scale Modeling errors Architecture & workload Error/failure occurrence Technology scaling

inconclusive trends Prior work found with respect to memory capacity

inconclusive trends Prior work found with respect to memory capacity Examine characteristic more closely related to cell fabrication technology

Use DRAM chip density to examine technology scaling (closely related to fabrication technology)

Intuition: quadratic increase in capacity

New reliabilit y trends Page offlining at scale Modeling errors Architecture & workload Error/failure occurrence Technology scaling We find that newer cell fabrication technologies have higher failure rates

New reliabilit y trends Page offlining at scale Modeling errors Error/failure occurrence Technology scaling Architecture & workload

DIMM architecture ▪ chips per DIMM, transfer width ▪ 8 to 48 chips ▪ x4, x8 = 4 or 8 bits per cycle ▪ electrical implications

DIMM architecture ▪ chips per DIMM, transfer width ▪ 8 to 48 chips ▪ x4, x8 = 4 or 8 bits per cycle ▪ electrical implications Does DIMM organization affect memory reliability?

No consistent trend across only chips per DIMM

More chips ➔ higher failure rate

More bits per cycle ➔ higher failure rate

Intuition: increased electrical loading

Workload dependence ▪ prior studies: homogeneous workloads ▪ web search and scientific ▪ warehouse-scale data centers: ▪ web, hadoop, ingest, database, cache, media

Workload dependence ▪ prior studies: homogeneous workloads ▪ web search and scientific ▪ warehouse-scale data centers: ▪ web, hadoop, ingest, database, cache, media What affect to heterogeneous workloads have on reliability?

No consistent trend across CPU/memory utilization

Varies by up to 6.5x

New reliabilit y trends Page offlining at scale Modeling errors Error/failure occurrence Technology scaling Architecture & workload Chips per DIMM, transfer width, and workload type (not necessarily CPU/memory utilization) affect reliability

New reliabilit y trends Page offlining at scale Error/failure occurrence Technology scaling Architecture & workload Modeling errors

A model for server failure ▪ use statistical regression model ▪ compare control group vs. error group ▪ linear regression in R ▪ trained using data from analysis ▪ enable exploratory analysis ▪ high perf. vs. low power systems

Memory error model Density Chips Age Relative server failure rate...

Memory error model Density Chips Age Relative server failure rate...

Available online

New reliabilit y trends Page offlining at scale Error/failure occurrence Technology scaling Architecture & workload Modeling errors We have made publicly available a statistical model for assessing server memory reliability

New reliabilit y trends Error/failure occurrence Technology scaling Architecture & workload Modeling errors Page offlining at scale

Prior page offlining work ▪ [Tang+,DSN'06] proposed technique ▪ "retire" faulty pages using OS ▪ do not allow software to allocate them ▪ [Hwang+,ASPLOS'12] simulated eval. ▪ error traces from Google and IBM ▪ recommended retirement on first error ▪ large number of cell/spurious errors

Prior page off lining work ▪ [Tang+,DSN'06] proposed technique ▪ "retire" faulty pages using OS ▪ do not allow software to allocate them ▪ [Hwang+,ASPLOS'12] simulated eval. ▪ error traces from Google and IBM ▪ recommended retirement on first error ▪ large number of cell/spurious errors How effective is page offlining in the wild?

Cluster of 12,276 servers

-67% Cluster of 12,276 servers

-67% Prior work: -86% to -94% Cluster of 12,276 servers

-67% Prior work: -86% to -94% 6% of page offlining attempts failed due to OS

New reliabilit y trends Error/failure occurrence Technology scaling Architecture & workload Modeling errors Page offlining at scale First large-scale study of page offlining; real-world limitations of technique

New reliabilit y trends Error/failure occurrence Technology scaling Architecture & workload Modeling errors Page offlining at scale

▪ Vendors ▪ Age ▪ Processor cores ▪ Correlation analysis ▪ Memory model case study More results in paper

Summary

▪ Modern systems ▪ Large scale

New reliabilit y trends Page offlining at scale Error/failure occurrence Technology scaling Modeling errors Architecture & workload Summary

New reliabilit y trends Page offlining at scale Technology scaling Modeling errors Architecture & workload Error/failure occurrence Summary Errors follow a power-law distribution and a large number of errors occur due to sockets/channels

New reliabilit y trends Page offlining at scale Modeling errors Architecture & workload Error/failure occurrence Technology scaling Summary We find that newer cell fabrication technologies have higher failure rates

New reliabilit y trends Page offlining at scale Modeling errors Error/failure occurrence Technology scaling Architecture & workload Summary Chips per DIMM, transfer width, and workload type (not necessarily CPU/memory utilization) affect reliability

New reliabilit y trends Page offlining at scale Error/failure occurrence Technology scaling Architecture & workload Modeling errors Summary We have made publicly available a statistical model for assessing server memory reliability

New reliabilit y trends Error/failure occurrence Technology scaling Architecture & workload Modeling errors Page offlining at scale Summary First large-scale study of page offlining; real-world limitations of technique

Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from the Field

Backup slides

Decreasing hazard rate

Errors54, Density 4Gb1Gb2Gb

Density Errors 54, Density 4Gb1Gb2Gb

Buckets Errors Density Errors54, Density 4Gb1Gb2Gb

Errors54, Density 4Gb1Gb2Gb Errors Density

Errors 54, Density 4Gb1Gb2Gb

Case study

Output Inputs Case study

Output Inputs Case study Does CPUs or density have a higher impact?

Exploratory analysis