Architecting Phase Change Memory as a Scalable DRAM Alternative

Slides:

Advertisements

Similar presentations

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Thank you for your introduction.

Better I/O Through Byte-Addressable, Persistent Memory

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,

†The Pennsylvania State University

STT-RAM as a sub for SRAM and DRAM

Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.

Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu,

Moinuddin K. Qureshi ECE, Georgia Tech

Phase Change Memory What to wear out today? Chris Craik, Aapo Kyrola, Yoshihisa Abe.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

Due to the economic downturn, Microsoft Research has eliminated all funding for title slides. We sincerely apologize for any impact these austerity measures.

Lecture on Electronic Memories. What Is Electronic Memory? Electronic device that stores digital information Types –Volatile v. non-volatile –Static v.

Ovonic Unified Memory.

Defining Anomalous Behavior for Phase Change Memory

A Novel Cache Architecture with Enhanced Performance and Security Zhenghong Wang and Ruby B. Lee.

© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Outline Motivation Simulation Framework Experimental methodology STT-RAM (Spin Torque Transference) ReRAM (Resistive RAM) PCRAM (Phase change) Comparison.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

COMP203/NWEN Memory Technologies 0 Plan for Memory Technologies Topic Static RAM (SRAM) Dynamic RAM (DRAM) Memory Hierarchy DRAM Accelerating Techniques.

Data Retention in MLC NAND FLASH Memory: Characterization, Optimization, and Recovery. 서동화

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.

CS203 – Advanced Computer Architecture Main Memory Slides adapted from Onur Mutlu (CMU)

15-740/ Computer Architecture Lecture 5: Project Example Justin Meza Yoongu Kim Fall 2011, 9/21/2011.

CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Reducing Memory Interference in Multicore Systems

18-447: Computer Architecture Lecture 23: Caches

Adaptive Cache Partitioning on a Composite Core

18-447: Computer Architecture Lecture 34: Emerging Memory Technologies

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Lecture 23: Cache, Memory, Security

Computer Architecture & Operations I

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Samira Khan University of Virginia Oct 9, 2017

Scalable High Performance Main Memory System Using PCM Technology

Better I/O Through Byte-Addressable, Persistent Memory

Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques Yu Cai, Saugata Ghose, Yixin Luo, Ken.

Row Buffer Locality Aware Caching Policies for Hybrid Memories

William Stallings Computer Organization and Architecture 7th Edition

BIC 10503: COMPUTER ARCHITECTURE

Lecture 6: Reliability, PCM

Adapted from slides by Sally McKee Cornell University

Jianbo Dong, Lei Zhang, Yinhe Han, Ying Wang, and Xiaowei Li

Reducing DRAM Latency via

Lecture 22: Cache Hierarchies, Memory

Samira Khan University of Virginia Nov 14, 2018

15-740/ Computer Architecture Lecture 19: Main Memory

Literature Review A Nondestructive Self-Reference Scheme for Spin-Transfer Torque Random Access Memory (STT-RAM) —— Yiran Chen, et al. Fengbo Ren 09/03/2010.

Samira Khan University of Virginia Mar 27, 2019

Prof. Onur Mutlu Carnegie Mellon University 9/21/2012

Presented by Florian Ettinger

Presentation transcript:

Architecting Phase Change Memory as a Scalable DRAM Alternative Benjamin C. Lee, Engin Ipek, Onur Mutlu, Doug Burger ISCA 2009 Presented by Skanda Koppula December 2018

Problem Charge-based storage is hard to scale! Subthreshold charge leakage Neighbor interference and capacitive coupling (e.g. Rowhammer) Capacitors: large enough to store charge for reliable sensing Access transistors: large enough to exercise control over bit cell Manufacturability of <40nm DRAM was unknown Image adapted from https://users.ece.cmu.edu/~omutlu/pub/understanding-latency-variation-in-DRAM-chips_sigmetrics16.pdf

PCM is an alternative memory technology Phase Change Memory (PCM) is a non-volatile resistive-based memory Each bit cell contains a deposit of chalcogenide glass (GST) GST is either a crystallized state (low resistance) or amorphous state (high resistance) Image adapted from https://people.inf.ethz.ch/omutlu/pub/pcm_isca09.pdf

PCM is an alternative memory technology Phase Change Memory (PCM) is a non-volatile resistive-based memory Each bit cell contains a deposit of chalcogenide glass (GST) GST is either a crystallized state (low resistance) or amorphous state (high resistance) Application of current changes GST state Amorphous Crystalline Image adapted from https://people.inf.ethz.ch/omutlu/pub/pcm_isca09.pdf

PCM is an alternative memory technology Phase Change Memory (PCM) is a non-volatile resistive-based memory Each bit cell contains a deposit of chalcogenide glass (GST) GST is either a crystallized state (low resistance) or amorphous state (high resistance) Scalable with decreasing technology node But naive application of PCM consumes 2.2x more energy than DRAM and is 1.6x slower on SPEC benchmarks Image adapted from https://people.inf.ethz.ch/omutlu/pub/pcm_isca09.pdf

Key Ideas: Phase Change Memory can be a competitive main memory solution if the system is architected with a PCM-optimized memory design This paper contributes: A study of row buffer reorganizations demonstrating: Narrow row buffers mitigate high energy writes, making PCM power comparable to DRAM Multiple row buffers allow write coalescing, reducing the PCM slowdown from 1.6x to 1.2x A proposal to track data modifications, and execute partial row writes Extends PCM lifetime by four orders of magnitude

This is now! Images from https://www.pcper.com/news/Storage/Intels-Optane-DC-Persistent-Memory-DIMMs-Push-Latency-Closer-DRAM, https://en.wikipedia.org/wiki/3D_XPoint#/media/File:3D_XPoint.png

PCM Characteristics Survey of PCM Prototypes: 2003 - 2008 Speculated future PCM settings in 2008: GST bit cell material BJT access device 108 write cycles 42ns read latency // 40uW read power <100ns write latency // 480 uW write power 9 - 12F2 density using BJTs 0.05W idle power

PCM Characteristics: Then and Now Survey of PCM Prototypes: 2003 - 2008 Speculated future PCM settings in 2008: GST bit cell material (same as in 2017) BJT access device (same as in 2017) 108 write cycles lifetime (same as in 2017) 42ns read latency // 40uW read power (same in 2017) <100ns write latency // 480 uW write power (lower in 2017) 9 - 12F2 density using BJTs (lower in 2017) 0.05W idle power (same in 2017) Derived from Table 1 in Dynamic Adaptive Replacement Policy in Shared Last-Level Cache of DRAM/PCM Hybrid Memory for Big Data Storage

PCM Characteristics vs. DRAM Survey of PCM Prototypes: 2003 - 2008 Speculated future PCM settings in 2008: GST bit cell material (N/A in DRAM) BJT access device (N/A in DRAM) 108 write cycle lifetime (108x lower than DRAM) <50ns read latency // 40uW read power (5x higher than DRAM) <100ns write latency // 480 uW write power (12x higher than DRAM) 9 - 12F2 density using BJTs (1.3x higher than DRAM) 0.05W idle power (14x lower than DRAM) From https://people.inf.ethz.ch/omutlu/pub/lee_isca09_talk.pdf

Mechanisms How do we fix we high read and write latencies? How do we fix the high read and write power consumption? Key PCM/DRAM difference: PCM row activation + read is non-destructive ⇒ Narrow and multiple row buffers!

Mechanisms DRAM Selected row’s charge setting is lost and subsequently restored MEMORY ARRAY activated row Row Decoder SENSE AMP ROW BUFFER 1-8KB

Mechanisms PCM Proposal No row restoration enables ‘Narrow Buffers’ Divide the length of the sense amp and row buffers by up to 32X to read/write only the pertinent row section MEMORY ARRAY activated row Row Decoder SENSE AMP SENSE AMP SENSE AMP ROW BUFFER ROW BUFFER ROW BUFFER

Mechanisms PCM Proposal No row restoration enables ‘Narrow Buffers’ Multiple row buffers allow write coalescing (LRU eviction) Reduce number of writes that hit the memory array: better endurance! MEMORY ARRAY activated row Row Decoder SENSE AMP SENSE AMP SENSE AMP ROW BUFFER 1 ROW BUFFER 1 ROW BUFFER 1 ROW BUFFER 2 ROW BUFFER 2 ROW BUFFER 2

What does this buy us? Smaller sense-amps avoid fan-out area blow-up of wide sense-amps Partial row writes avoid large current/power waste for parts of row that are not modified Multiple row buffers increase write coalescing!

What does this buy us? Each dot is a different buffer width/number of buffers combination PCM base is (presumably) the undivided, nominal width.

What does this buy us? % of Nominal Runtime (

What does this cost us? Need additional decoders to mux between multiple row buffers PCM uses current sense amp which consumes more area than the voltage sense amp of DRAM Latches to keep data in the multiple row buffers Narrow buffers reduce write-coalescing

Mechanism Buffer reorganization improves PCM energy consumption Buffer reorganization improves PCM application slowdown Can we further improve the lifetime and write endurance problem? Yes, if we reduce the number of writes even more!

Mechanism: Partial Writes Reduce number of writes to PCM array by tracking dirty data from caches During eviction, only dirty words are written back Trade-off: ‘Modest’ increase in cache state (3.1% extra cache bits) to reduce writes

Evaluation: Partial Writes Uses an analytical model to estimate number of PCM array writes. Factors in: Different application write intensity Buffer organizations Dirty-tracking granularities Nominal lifetime running the benchmarks 24/7 is 525 hours ~ 0.05 years

Evaluation: Partial Writes Improves endurance to: 0.7 years (cacheline gran.) 5.6 years (word granularity) Reduced writes to PCM array by 41% (cacheline) and 92% (word granularity)

Strengths Demonstrates that PCM has the potential to be a viable memory solution Highly prescient of future trends! Presents two very effective techniques to reduce energy, latency, and write wear of PCM memories: buffer reorganization and partial writes Does all of this in a time when physical PCM modules were not widespread or very available Good exposition of how PCM works

Weaknesses Baselines are weak Uses DDR2-800 as a baseline (2008 paper) No comparison with equivalently optimized DRAM (e.g. multiple row-buffers) DRAM energy analysis is analytical, based on MICRON datasheets and technical notes VAMPIRE: “Don’t always trust the datasheets: actual consumption may be lower” Endurance estimates are based on self-created approximate analytical model, no evaluation on how this aligns with empirical data Exposition of core idea can be improved Some graph axis mislabeled and other labels missing (e.g. Figure 8) Details missing on evaluation experiments

Benjamin C. Lee, Engin Ipek, Onur Mutlu, Doug Burger Questions? (before proposals and discussion starters) Architecting Phase Change Memory as a Scalable DRAM Alternative Benjamin C. Lee, Engin Ipek, Onur Mutlu, Doug Burger Presented by Skanda Koppula December 2018

Questions and Proposals Can we extend the characterization? How does DDR4/LPDDR4/NAND Flash compare to PCM today? How valid are the energy simulation results and endurance models to commercially available PCM modules?

Questions and Proposals Can we extend the characterization? How does DDR4/LPDDR4/NAND Flash compare to PCM today? How valid are the energy simulation results and endurance models to commercially available PCM modules? Does PCM introduce new security threads? How sensitive is the read and write process in PCM to external temperature variation? Are RowHammer like attacks possible in PCM by playing with external temperatures or targeted heating of chip area? What are current wear-leveling algorithms for PCM? Do manage at fine enough granularity to prevent targeted attacks that induce early aging in specific memory sections?

Questions and Proposals How can we combine memory architectures? Hybrid memory: how would we architect a PCM-DRAM-Flash-HBM-SRAM memory systems for current and future workloads? What would the cache hierarchy look like? Buffer organization? Do we have simulators for such systems? What are the ideal partitioning and controller policies?

Questions and Proposals How can we combine memory architectures? Hybrid memory: how would we architect a PCM-DRAM-Flash-HBM-SRAM memory systems for current and future workloads? What would the cache hierarchy look like? Buffer organization? Do we have simulators for such systems? What are the ideal partitioning and controller policies? Miscellaneous Is it possible to reduce the expensive SET/RESET voltages/latencies and compensate with ECC?