Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu,

Slides:

Advertisements

Similar presentations

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Advertisements

Thank you for your introduction.

A Case for Refresh Pausing in DRAM Memory Systems

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Probabilistic Design Methodology to Improve Run- time Stability and Performance of STT-RAM Caches Xiuyuan Bi (1), Zhenyu Sun (1), Hai Li (1) and Wenqing.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,

Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Phase Change Memory What to wear out today? Chris Craik, Aapo Kyrola, Yoshihisa Abe.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab

Defining Anomalous Behavior for Phase Change Memory

© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

The Evicted-Address Filter

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Sunpyo Hong, Hyesoon Kim

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.

Carnegie Mellon University, *Seagate Technology

DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

15-740/ Computer Architecture Lecture 5: Project Example Justin Meza Yoongu Kim Fall 2011, 9/21/2011.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Reducing Memory Interference in Multicore Systems

Memory COMPUTER ARCHITECTURE

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Cache Memory Presentation I

Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Achieving High Performance and Fairness at Low Cost

Address-Value Delta (AVD) Prediction

MICROPROCESSOR MEMORY ORGANIZATION

Reducing DRAM Latency via

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

Increasing Effective Cache Capacity Through the Use of Critical Words

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Architecting Phase Change Memory as a Scalable DRAM Alternative

Presented by Florian Ettinger

DSPatch: Dual Spatial pattern prefetcher

Presentation transcript:

Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu, Norm Jouppi* † Carnegie Mellon University * Hewlett-Packard Labs † Google, Inc.

2 Executive Summary Phase-change memory (PCM) is a promising emerging technology  More scalable than DRAM, faster than flash  Multi-level cell (MLC) PCM = multiple bits per cell → high density Problem: Higher latency/energy compared to non-MLC PCM Observation: MLC bits have asymmetric read/write characteristics  Some bits can be read quickly but written slowly and vice versa

3 Executive Summary Goal: Read data from fast-read bits; write data to fast-write bits Solution:  Decouple bits to expose fast-read/write memory regions  Map read/write-intensive data to appropriate memory regions  Split device row buffers to leverage decoupling for better locality Result: – Improved performance (+19.2%) and energy efficiency (+14.4%) – Across SPEC CPU2006 and data-intensive/cloud workloads

4 Outline Background Problem and Goal Key Observations – MLC-PCM cell read asymmetry – MLC-PCM cell write asymmetry Our Techniques – Decoupled Bit Mapping (DBM) – Asymmetric Page Mapping (APM) – Split Row Buffering (SRB) Results Conclusions

5 Background: PCM Emerging high-density memory technology – Potential for scalable DRAM alternative Projected to be 3 to 12x denser than DRAM Access latency within an order or magnitude of DRAM Stores data in the form of resistance of cell material

6 PCM Resistance → Value Cell resistance 10 Cell value:

7 Background: MLC-PCM Multi-level cell: more than 1 bit per cell  Further increases density by 2 to 4x [Lee+,ISCA'09] But MLC-PCM also has drawbacks  Higher latency and energy than single-level cell PCM  Let's take a look at why this is the case

8 MLC-PCM Resistance → Value Cell resistance Cell value: Bit 1Bit 0

9 MLC-PCM Resistance → Value Cell resistance Cell value: Less margin between values → need more precise sensing/modification of cell contents → higher latency/energy (~2x for reads and 4x for writes)

10 Problem and Goal Want to leverage MLC-PCM's strengths – Higher density – More scalability than existing technologies (DRAM) But, also want to mitigate MLC-PCM's weaknesses – Higher latency/energy Our goal in this work is to design new hardware/software optimizations designed to mitigate the weaknesses of MLC-PCM

11 Outline Background Problem and Goal Key Observations – MLC-PCM cell read asymmetry – MLC-PCM cell write asymmetry Our Techniques – Decoupled Bit Mapping (DBM) – Asymmetric Page Mapping (APM) – Split Row Buffering (SRB) Results Conclusions

12 Observation 1: Read Asymmetry The read latency/energy of Bit 1 is lower than that of Bit 0 This is due to how MLC-PCM cells are read

13 Observation 1: Read Asymmetry Capacitor filled with reference voltage MLC-PCM cell with unknown resistance Simplified example

14 Observation 1: Read Asymmetry Simplified example

15 Observation 1: Read Asymmetry Simplified exampleInfer data value

16 Observation 1: Read Asymmetry Voltage Time

17 Observation 1: Read Asymmetry Voltage Time

18 Observation 1: Read Asymmetry Voltage Time Initial voltage (fully charged capacitor)

19 Observation 1: Read Asymmetry Voltage Time PCM cell connected → draining capacitor

20 10 Observation 1: Read Asymmetry Voltage Time Capacitor drained → data value known (01)

21 Observation 1: Read Asymmetry In existing devices – Both MLC bits are read at the same time – Must wait maximum time to read both bits However, we can infer information about Bit 1 before this time

22 Observation 1: Read Asymmetry Voltage Time

23 Observation 1: Read Asymmetry Voltage Time

24 Observation 1: Read Asymmetry Voltage Time Time to determine Bit 1's value

25 Observation 1: Read Asymmetry Voltage Time Time to determine Bit 0's value

26 Observation 2: Write Asymmetry The write latency/energy of Bit 0 is lower than that of Bit 1 This is due to how PCM cells are written In PCM, cell resistance must physically be changed – Requires applying different amounts of current – For different amounts of time

27 Observation 2: Write Asymmetry Writing both bits in an MLC cell: 250ns Only writing Bit 0: 210ns Only writing Bit 1: 250ns Existing devices write both bits simultaneously (250ns)

28 Key Observation Summary Bit 1 is faster to read than Bit 0 Bit 0 is faster to write than Bit 1 We refer to Bit 1 as the fast-read/slow-write bit (FR) We refer to Bit 0 as the slow-read/fast-write bit (FW) We leverage read/write asymmetry to enable several optimizations

29 Outline Background Problem and Goal Key Observations – MLC-PCM cell read asymmetry – MLC-PCM cell write asymmetry Our Techniques – Decoupled Bit Mapping (DBM) – Asymmetric Page Mapping (APM) – Split Row Buffering (SRB) Results Conclusions

30 Technique 1: Decoupled Bit Mapping (DBM) Key Idea: Logically decouple FR bits from FW bits – Expose FR bits as low-read-latency regions of memory – Expose FW bits as low-write-latency regions of memory

31 Technique 1: Decoupled Bit Mapping (DBM) MLC-PCM cell Bit 1 (FR) Bit 0 (FW)

32 Technique 1: Decoupled Bit Mapping (DBM) bit Coupled (baseline): Contiguous bits alternate between FR and FW MLC-PCM cell Bit 1 (FR) Bit 0 (FW)

33 Technique 1: Decoupled Bit Mapping (DBM) bit Coupled (baseline): Contiguous bits alternate between FR and FW MLC-PCM cell Bit 1 (FR) Bit 0 (FW)

Technique 1: Decoupled Bit Mapping (DBM) bit Coupled (baseline): Contiguous bits alternate between FR and FW bit Decoupled: Contiguous regions alternate between FR and FW MLC-PCM cell Bit 1 (FR) Bit 0 (FW)

35 Technique 1: Decoupled Bit Mapping (DBM) By decoupling, we've created regions with distinct characteristics – We examine the use of 4KB regions (e.g., OS page size) Want to match frequently read data to FR pages and vice versa Toward this end, we propose a new OS page allocation scheme Fast read pageFast write page Physical address

36 Technique 2: Asymmetric Page Mapping (APM) Key Idea: predict page read/write intensity and map accordingly – Measure write intensity of instructions that access data – If instruction has high write intensity and first touches page » OS allocates FW page, otherwise, allocates FR page Implementation (full details in paper) – Small hardware cache of instructions that often write data – Updated by cache controller when data written to memory – New instruction for OS to query table for prediction

37 Technique 3: Split Row Buffering (SRB) Row buffer stores contents of currently-accessed data – Used to buffer data when sending/receiving across I/O ports Key Idea: With DBM, buffer FR bits independently from FW bits – Coupled (baseline): must use large monolithic row buffer (8KB) – DBM: can use two smaller associative row buffers (2x4KB) – Can improve row buffer locality, reducing latency and energy Implementation (full details in paper) – No additional SRAM buffer storage – Requires multiplexer logic for selecting FR/FW buffers

38 Outline Background Problem and Goal Key Observations – MLC-PCM cell read asymmetry – MLC-PCM cell write asymmetry Our Techniques – Decoupled Bit Mapping (DBM) – Asymmetric Page Mapping (APM) – Split Row Buffering (SRB) Results Conclusions

39 Evaluation Methodology Cycle-level x86 CPU-memory simulator – CPU: 8 cores, 32KB private L1/512KB private L2 per core – Shared L3: 16MB on-chip eDRAM – Memory: MLC-PCM, dual channel DDR3 1066MT/s, 2 ranks Workloads – SPEC CPU2006, NASA parallel benchmarks, GraphLab Performance metrics – Multi-programmed (SPEC): weighted speedup – Multi-threaded (NPB, GraphLab): execution time

40 Comparison Points Conventional: coupled bits (slow read, slow write) All-FW: hypothetical all-FW memory (slow read, fast write) All-FR: hypothetical all-FR memory (fast read, slow write) DBM: decouples bit mapping (50% FR pages, 50% FW pages) DBM+: techniques that leverage DBM (APM and SRB) Ideal: idealized cells with best characteristics (fast read, fast write)

41 System Performance +19% +10% +16% +13% +31% Conventional All fast write All fast read DBM DBM+APM+SRB Ideal Normalized Speedup

42 System Performance +19% +10% +16% +13% +31% Conventional All fast write All fast read DBM DBM+APM+SRB Ideal All-FR > All-FW → dependent on workload access patterns Normalized Speedup

43 System Performance +19% +10% +16% +13% +31% Conventional All fast write All fast read DBM DBM+APM+SRB Ideal DBM allows systems to take advantage of reduced read latency (FR region) and reduced write latency (FW region) Normalized Speedup

44 Memory Energy Efficiency +14% +5% +12% +8% +30% Conventional All fast write All fast read DBM DBM+APM+SRB Ideal Normalized Performance per Watt

45 Memory Energy Efficiency +14% +5% +12% +8% +30% Conventional All fast write All fast read DBM DBM+APM+SRB Ideal Benefits from lower read energy by exploiting read asymmetry (dominant case) and from lower write energy by exploiting write asymmetry Normalized Performance per Watt

46 Other Results in the Paper Improved thread fairness (less resource contention) – From speeding up per-thread execution Techniques do not exacerbate PCM wearout problem – ~6 year operational lifetime possible

47 Outline Background Problem and Goal Key Observations – MLC-PCM cell read asymmetry – MLC-PCM cell write asymmetry Our Techniques – Decoupled Bit Mapping (DBM) – Asymmetric Page Mapping (APM) – Split Row Buffering (SRB) Results Conclusions

48 Conclusions Phase-change memory (PCM) is a promising emerging technology  More scalable than DRAM, faster than flash  Multi-level cell (MLC) PCM = multiple bits per cell → high density Problem: Higher latency/energy compared to non-MLC PCM Observation: MLC bits have asymmetric read/write characteristics  Some bits can be read quickly but written slowly and vice versa

49 Conclusions Goal: Read data from fast-read bits; write data to fast-write bits Solution:  Decouple bits to expose fast-read/write memory regions  Map read/write-intensive data to appropriate memory regions  Split device row buffers to leverage decoupling for better locality Result: – Improved performance (+19.2%) and energy efficiency (+14.4%) – Across SPEC CPU2006 and data-intensive/cloud workloads

50 Thank You!

Efficient Data Mapping and Buffering Techniques for Multi- Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu, Norm Jouppi* † Carnegie Mellon University * Hewlett-Packard Labs † Google, Inc.

52 Backup Slides

53 PCM Cell Operation

54 Integrating ADC

55 APM Implementation ProgCounterInstruction Cache Write access Writeback Memory PCWBs 0x f7279 0x00400fbd x00400f x00400fc14744 PC table + 0x00400f91mov %r14d,%eax 0x00400f94movq $0xff..,0xb8(%r13) 0x00400f9f mov %edx,0xcc(%r13) 0x00400fa6neg %eax 0x00400fa8lea 0x68(%r13),%rcx Program execution. PC table indices index