DRAM SCALING CHALLENGE

Slides:

Advertisements

Similar presentations

Advertisements

Computer Organization and Architecture

Computer Organization and Architecture

+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.

Memory Chapter 3. Slide 2 of 14Chapter 1 Objectives  Explain the types of memory  Explain the types of RAM  Explain the working of the RAM  List the.

What is memory? Memory is used to store information within a computer, either programs or data. Programs and data cannot be used directly from a disk or.

1 Lecture 16B Memories. 2 Memories in General Computers have mostly RAM ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.

Overview Memory definitions Random Access Memory (RAM)

1 Lecture 16B Memories. 2 Memories in General RAM - the predominant memory ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.

Memory. When we receive some instruction or information we retain them in our memory. Similarly a computer stores the instructions for solving a problem,

12/1/2004EE 42 fall 2004 lecture 381 Lecture #38: Memory (2) Last lecture: –Memory Architecture –Static Ram This lecture –Dynamic Ram –E 2 memory.

Computing Hardware Starter.

Computers in the real world Objectives Understand what is meant by memory Difference between RAM and ROM Look at how memory affects the performance of.

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

MODULE 5: Main Memory.

Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.

+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.

Computer Architecture Lecture 24 Fasih ur Rehman.

© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.

A+ Guide to Managing and Maintaining your PC, 6e Chapter 7 Upgrading Memory.

A memory is just like a human brain. It is used to store data and instructions. Computer memory is the storage space in computer where data is to be processed.

Data Retention in MLC NAND FLASH Memory: Characterization, Optimization, and Recovery. 서동화

07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.

Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.

Computer Architecture Chapter (5): Internal Memory

Types of RAM (Random Access Memory) Information Technology.

CACHE _View 9/30/ Memory Hierarchy To take advantage of locality principle, computer memory implemented as a memory hierarchy: multiple levels.

Components of Computer. Memory Unit Most important part of the computer Used to store data and instructions that are currently in use Main memory consists.

Getting the Most out of Scientific Computing Resources

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Chapter 5 Internal Memory

William Stallings Computer Organization and Architecture 7th Edition

Getting the Most out of Scientific Computing Resources

A Dummies guide to computer memory

Activity 1 6 minutes Research Activity: What is RAM? What is ROM?

ESE532: System-on-a-Chip Architecture

Chapter 2: Computer-System Structures

Types of RAM (Random Access Memory)

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Module IV Memory Organization.

Memory Units Memories store data in units from one to eight bits. The most common unit is the byte, which by definition is 8 bits. Computer memories are.

CS-301 Introduction to Computing Lecture 17

William Stallings Computer Organization and Architecture 7th Edition

Samira Khan University of Virginia Oct 9, 2017

Unit 2 Computer Systems HND in Computing and Systems Development

William Stallings Computer Organization and Architecture 8th Edition

Prof. Gennady Pekhimenko University of Toronto Fall 2017

Hasan Hassan, Nandita Vijaykumar, Samira Khan,

Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu

Module IV Memory Organization.

William Stallings Computer Organization and Architecture 7th Edition

Introduction to Computing

William Stallings Computer Organization and Architecture 8th Edition

BIC 10503: COMPUTER ARCHITECTURE

Computer Memory BY- Dinesh Lohiya.

Session 1A at am MEMCON Detecting and Mitigating

Lecture 6: Reliability, PCM

MEMCON: Detecting and Mitigating Data-Dependent Failures by Exploiting Current Memory Content Samira Khan, Chris Wilkerson, Zhe Wang, Alaa Alameldeen,

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

2.C Memory GCSE Computing Langley Park School for Boys.

Samira Khan University of Virginia Nov 14, 2018

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

William Stallings Computer Organization and Architecture 8th Edition

Hardware Main memory 26/04/2019.

Kenichi Kourai Kyushu Institute of Technology

Semiconductor memories are classified in different ways. A distinction is made between read-only (ROM) and read-write (RWM) memories. The contents RWMs.

Presentation transcript:

DRAM SCALING CHALLENGE SOLVING THE DRAM SCALING CHALLENGE Samira Khan

MEMORY IN TODAY’S SYSTEM Processor Memory DRAM Storage First I will start with a high-level view of our systems. In today’s systems persistent data is stored in disk, and processor computes over that data. However, disk is very slow, In between processor and storage, we have faster main memory built with DRAM, so that the processor can compute over the data faster. So DRAM is very critical for performance. DRAM is critical for performance

TREND: DATA-INTENSIVE APPLICATIONS DNA/PROTEIN SYNTHESIS IMAGE ANALYSIS VIRTUAL REALITY IN-MEMORY FRAMEWORKS However, DRAM is even more critical nowadays due to the emergence of data-intensive applications. We have scientific applications such a DNA or protein analysis that operate on huge data sets, and we have streams of data to process and analyze for images or virtual reality. The current analytics frameworks store the whole working set in memory to reduce the over head of disk access. Increasing demand for high-capacity, high-performance, energy-efficient main memory

DRAM scaling is getting difficult DRAM SCALING TREND 2X/1.5 YEARS 2X/3 YEARS Here I show the trend of DRAM chip capacity over the years. In the x axis I show the year of mass production and in the y axis we have the capacity in megabits per chip. So in the last 30 years, chip capacity grown from 1 Mb to 8Gb. However, if we look at the growth, previously capacity used to double every two year, but the scaling trend has slowed down and capacity is doubling in every three years. It is very clear from the trend that DRAM scaling is getting difficult. DRAM scaling is getting difficult Source: Flash Memory Summit 2013, Memcon 2014

DRAM SCALING CHALLENGE WHY? Technology Scaling DRAM Cells DRAM Cells Manufacturing reliable cells at low cost is getting difficult This is due to the fact that, as the cells are shrinking down there are more failures in DRAM cells and Manufacturing reliable cells at low cost in getting difficult.

In order to answer this we need to take a closer look to a DRAM cell WHY IS IT DIFFICULT TO SCALE? DRAM Cells In order to answer this we need to take a closer look to a DRAM cell

DRAM CELL OPERATION 1. A DRAM cell stores data as charge 2. A DRAM cell is refreshed every 64 ms Transistor Capacitor Bitline Bitline Contact Here in the left side I will show a logical view of a DRAM cell, and in the right side I will show a vertical cross section of a cell. DRAM cell stores data as charge in a capacitor. The amount of the charge indicates either data zero or one. A transistor acts as a switch to the capacitor connected to a line called bitline. When a DRAM cell is read, the transistor in turned on, and the capacitor charge is sensed through the bitline. Capacitor Transistor LOGICAL VIEW VERTICAL CROSS SECTION A DRAM cell

DRAM RETENTION FAILURE Retention time: The time when we can still access a cell reliably Cells need to be refreshed before that to avoid failure Retention Time Retention Time Refresh Interval 64 ms Time Capacitor Retention time is greater than refresh interval Retention time is less than refresh interval Failure depends on the amount of charge

SCALING CHALLENGE: CELL-TO-CELL INTERFERENCE Cell-to-cell interference affects the charge in neighboring cells Technology Scaling Less Interference More Interference The second scaling challenge is cell-to-cell interference. Charge in DRAM cells can get affected by neighboring cells due to cell coupling. DRAM faced this interference from the manufacturing of the first DRAM chip. But as the cells are getting smaller there are more interference. The reason is interference provide an indirect path to neighboring cells. As the cells get smaller, it becomes easier to interfere the charge in other cells, resulting in DRAM failures. Indirect path Indirect path More interference results in more failures

IMPLICATION: DRAM ERRORS IN THE FIELD 1.52% of DRAM modules failed in Google Servers 1.6% of DRAM modules failed in LANL DRAM comes with a 10 year reliability guarantee. However, as DRAM gets more vulnerable to failures with technology scaling, DRAM errors occurring in the field. Recent studies have shown that the fraction of the DRAM modules that fail in the field is higher than expected. 1.52% modules failed in Google and 1.6% in LANL. Los alamos national lab 1.8X more failures in new generation DRAMs in Facebook SIGMETRICS’09, SC’12, DSN’15

high-capacity, low-latency memory sacrificing reliability GOAL Enable high-capacity, low-latency memory without sacrificing reliability

SIGMETRICS’14, DSN’15, HPCA’15, SIGMETRICS’17, HPCA’17, MICRO’17 Traditional DRAM Scaling is Ending MEMORY TOLERATE FAILURES Difficult to scale WAX’18,ONGOING NON-VOLATILE MEMORY LEVERAGE NEW TECHNOLOGIES Highly scalable WEED’13, MICRO’15, HPCA’18, ONGOING MEMORY Difficult to scale MAKE DRAM SCALABLE SIGMETRICS’14, DSN’15, HPCA’15, SIGMETRICS’16, DSN’16, CAL’16, SIGMETRICS’17, HPCA’17, MICRO’17 Solution Space Image: Loke et al., Science 2012

Traditional DRAM Scaling is Ending MEMORY NON-VOLATILE MEMORY MEMORY MAKE DRAM SCALABLE LEVERAGE NEW TECHNOLOGIES TOLERATE FAILURES System-Level Detection and Mitigation of Failures Unifying Memory and Storage with NVM Restricted Approximation Solution Space

TRADITIONAL APPROACH TO ENABLE DRAM SCALING Make DRAM Reliable Unreliable DRAM Cells Reliable DRAM Cells Reliable System Manufacturing Time System in the Field Traditional way to enable scaling is to use circuit-level optimizations and testing to make sure every DRAM cell operates correctly. So when the chips are used in our systems, systems assumes that DRAM is error-free. Currently, Manufacturers have to provide a strict reliability guarantee for DRAMs. DRAM has strict reliability guarantee

Shift the responsibility to systems MY APPROACH Make DRAM Reliable Unreliable DRAM Cells Reliable DRAM Cells Reliable System Manufacturing Time Manufacturing Time System in the Field System in the Field In my work, I show that in stead of the manufacturers, if the system becomes responsible for ensuring the reliability of the DRAM cells, manufactures can shrink the cells to smaller technology nodes without paying the cost for providing reliability guarantee. By shifting the responsibility of providing relizable DRAM operation from manufactures to systems, we can enable DRAM scaling. Shift the responsibility to systems

VISION: SYSTEM-LEVEL DETECTION AND MITIGATION 2 Not fully tested during manufacture-time Ship modules with possible failures 1 PASS FAIL Detect and mitigate failures online 3 Detect and mitigate errors after the system has become operational ONLINE PROFILING

BENEFITS OF ONLINE PROFILING Technology Scaling Reliable DRAM Cells Unreliable DRAM Cells Improves yield, reduces cost, enables scaling Vendors can make cells smaller without a strong reliability guarantee

BENEFITS OF ONLINE PROFILING Reduce Refresh HI-REF LO-REF Reliable DRAM Cells Unreliable DRAM Cells Improves yield, reduces cost, enables scaling Vendors can make cells smaller without a strong reliability guarantee 2. Improves performance and energy efficiency

DRAM CELLS ARE NOT EQUAL Ideal Real Smallest cell Largest cell Ideally, DRAM would have uniform cells which have same size, thereby having same access latency. Unfortunately, due to process variation, each cell has different size, thereby having different access latency. Therefore, DRAM shows large variation in both charge amount in its cell and access latency to its cell. Same size  Different size  Same charge  Different charge  Large variation in DRAM cells

DRAM CELLS ARE NOT EQUAL Ideal Real Smallest cell FAST FAST Large variation in retention time Ideally, DRAM would have uniform cells which have same size, thereby having same access latency. Unfortunately, due to process variation, each cell has different size, thereby having different access latency. Therefore, DRAM shows large variation in both charge amount in its cell and access latency to its cell. Most cells have high retention time  can be refreshed at a lower rate without any failure Smaller cells will fail to retain data at a lower refresh rate

BENEFITS OF ONLINE PROFILING LO-REF HI-REF LO-REF HI-REF LO-REF Unreliable DRAM Cells Reduce refresh count by using a lower refresh rate, but use higher refresh rate for faulty cells Improves yield, reduces cost, enables scaling Vendors can make cells smaller without a strong reliability guarantee 2. Improves performance and energy efficiency Reduce refresh rate, refresh faulty rows more frequently

In order to enable these benefits, we need to detect the failures at the system level

CHALLENGE: INTERMITTENT FAILURES Detect and Mitigate Unreliable DRAM Cells Reliable System Depends on accurately detecting DRAM failures If failures were permanent, a simple boot up test would have worked, but there are intermittent failures What are the these intermittent failures?

CELL-TO-CELL INTERFERENCE: DATA-DEPENDENT FAILURES 1 NO FAILURE Indirect path 1 FAILURE Due to coupling effect in DRAM, neighboring cells provide an indirect path that can interfere with the charge stored in cells. AS a result cells can fail based on the content in the neighboring cells. For example, here when the neighboring cells are storing 010, the middle cell can lose charge due to coupling between cells and can result in a failure. Indirect path Some cells can fail depending on the data stored in neighboring cells How to detect these failures at the system?

Data-Dependent Failures CHALLENGE: Data-Dependent Failures DRAM Efficacy of Testing Data-Dependent Failures MAKE DRAM SCALABLE SIGMETRICS’14 System-Level Detection and Mitigation of Failures MEMCON: DRAM-Internal Independent Detection CAL’16, MICRO’17

Experimental Methodology Custom FPGA-based infrastructure PCIe DDR3 PC FPGA DIMM C++ programs to specify commands Generate command sequence Tested more than hundred chips from three different manufacturers

DRAM Testing Infrastructure Temperature Controller FPGAs Heater FPGAs PC This is the photo of our infrastructure, that we built mostly from scratch. For higher throughput, we employ eight FPGAs, all of which are enclosed in thermally-regulated environment. Open-source infrastructure to test real DRAM chips Characterization data publicly available HPCA’17, SIGMETRICS’14, DSN’15, HPCA’15, SIGMETRICS’16, DSN’16, CAL’16, SIGMETRICS’17, MICRO’17

DETECT FAILURES WITH TESTING Write some pattern in the module 1 Repeat 3 2 Read and verify Wait until refresh interval Test with different data patterns

DETECTING DATA-DEPENDENT FAILURES Even after hundreds of rounds, a small number of new cells keep failing Conclusion: Tests with many rounds of random patterns cannot detect all failures

WHY SO MANY ROUNDS OF TESTS? DATA-DEPENDENT FAILURE Fails when specific pattern in the neighboring cell 1 L D R LINEAR MAPPING X-1 X X+1 NOT EXPOSED TO THE SYSTEM 1 1 SCRAMBLED MAPPING X-1 X-4 X X+2 X+1 Even many rounds of random patterns cannot detect all failures

CHALLENGE IN DETECTION 1 SCRAMBLED MAPPING X-? X X+? How to detect data-dependent failures when we even do not know which cells are neighbors?

Data-Dependent Failures CHALLENGE: Data-Dependent Failures DRAM Efficacy of Testing Data-Dependent Failure MAKE DRAM SCALABLE System-Level Detection and Mitigation of Failures SIGMETRICS’14 MEMCON: DRAM-Internal Independent Detection CAL’16, MICRO’17

CURRENT DETECTION MECHANISM Initial Failure Detection and Mitigation Execution of Applications Detection is done with some initial testing isolated from system execution Detect and mitigate all failures with every possible content Only after that start program execution

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution (All possible failing cell) Unreliable DRAM Cells List of Failures Initial Failure Detection and Mitigation

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 1 Pattern x, Cell A (All possible failing cell) Unreliable DRAM Cells List of Failures Initial Failure Detection and Mitigation

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 1 Pattern x, Cell A Pattern y, Cell B (All possible failing cell) Unreliable DRAM Cells List of Failures Initial Failure Detection and Mitigation

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 1 Pattern x, Cell A Pattern y, Cell B Pattern z, Cell C … (All possible failing cell) Unreliable DRAM Cells List of Failures Applications Initial Failure Detection and Mitigation Execution of Applications

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 1 ?? No Reliability Guarantee Unreliable DRAM Cells List of Failures Applications Initial Failure Detection and Mitigation Execution of Applications Online profiling cannot detect all failures as the address mapping is not visible to the system

MEMCON: MEMORY CONTENT-BASED DETECTION AND MITIGATION NO NEED TO DETECT EVERY POSSIBLE FAILURE 1 Current content, Cell A Unreliable DRAM Cells with Program Content List of Failures Application Simultaneous Detection and Execution Based on current memory content of running applications Need to detect and mitigate only with the current content

MEMCON: HIGH-LEVEL DESIGN Simultaneous Detection and Execution 1 LO-REF HI-REF HI-REF Current content, Cell A Application Unreliable DRAM Cells No initial detection and mitigation Start running the application with a high refresh rate Detect failures with the current memory content If no failure found, use a low refresh rate

SUMMARY: ONLINE PROFILING Unreliable DRAM Cells Detect and Mitigate Reliable System Detection at the system-level is challenging due to data-dependent failures It is possible to detect and mitigate data-dependent failures simultaneously with program execution 65%-74% Reduction in refresh count 40%-50% Performance improvement

DRAM SCALING CHALLENGE SOLVING THE DRAM SCALING CHALLENGE Samira Khan