© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and.

Slides:

Advertisements

Similar presentations

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

A Case for Refresh Pausing in DRAM Memory Systems

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 12 Reduce Miss Penalty and Hit Time

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

High Performing Cache Hierarchies for Server Workloads

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

International Symposium on Microarchitecture Fine-grained Power Budgeting to Improve Write Throughput of MLC PCM 1 Lei Jiang, 2 Youtao Zhang, 2 Bruce R.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

Phase Change Memory What to wear out today? Chris Craik, Aapo Kyrola, Yoshihisa Abe.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

Moinuddin K. Qureshi ECE, Georgia Tech ISCA 2012 Michele Franceschini, Ashish Jagmohan, Luis Lastras IBM T. J. Watson Research Center PreSET: Improving.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Defining Anomalous Behavior for Phase Change Memory

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

P AY -A S -Y OU -G O S TORAGE -E FFICIENT H ARD E RROR C ORRECTION Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

© 2007 IBM Corporation MICRO-2009 Start-Gap: Low-Overhead Near-Perfect Wear Leveling for Main Memories Moinuddin Qureshi John Karidis, Michele Franceschini.

© 2007 IBM Corporation WEST-2010 Practical and Secure PCM Memories via Online Attack Detection Moinuddin Qureshi Luis Lastras, Michele Franceschini, John.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs Lunkai Zhang, Diana Franklin, Frederic T. Chong 1 Brian Neely,

UH-MEM: Utility-Based Hybrid Memory Management

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

ISPASS th April Santa Rosa, California

CSC 4250 Computer Architectures

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Scalable High Performance Main Memory System Using PCM Technology

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Presentation transcript:

© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and Luis Lastras IBM T. J. Watson Research Center, Yorktown Heights, NY

© 2007 IBM Corporation 2 Introduction More cores in system  More concurrency  Larger working set DRAM-based memory system hitting: power, cost, scaling wall Phase Change Memory (PCM): Emerging technology, projected to be more scalable, higher density, power-efficient

© 2007 IBM Corporation 3 PCM Operation T melt T cryst Time RESET SET Temperature Switching by heating using electrical pulses RESET state: amorphous (high resistance) SET state: crystalline (low resistance) Large Current SET Low resistance Photo Courtesy: Bipin Rajendran, IBM Read latency 2x-4x of DRAM. Write latency much higher Small Current RESET High resistance Access Device Memory Element

© 2007 IBM Corporation 4 Problem of Contention from Slow Writes PCM writes 4x-8x slower than reads Writes not latency critical. Typical response: Use large buffers and intelligent scheduling. But once write is scheduled to a bank, later arriving read waits Write request causes contention for reads  increased read latency

© 2007 IBM Corporation 5 Outline  Introduction  Quantifying the Problem  Adaptive Write Cancellation  Write Pausing  Combining Cancellation & Pausing  Summary

© 2007 IBM Corporation 6 Configuration: Hybrid Memory Processor Chip DRAM Cache PCM-Based Main Memory Baseline uses read priority scheduling if WRQ < 80% full. If WRQ>80% full, oldest-first policy  “forced write” (rare <0.1%) Each bank has a separate RDQ and WRQ (32-entry) (256MB)

© 2007 IBM Corporation 7 Problem Writes significantly increase read latency (Problem only for asymmetric memories) Read Latency=1k cycles Write Latency=8k cycles (sensitivity in paper) 12 workloads: each with 8 benchmarks from SPEC06 Baseline No Read Priority Write Latency=1K Write Latency=0 Effective Read Latency (Cycles) Norm. Execution Time

© 2007 IBM Corporation 8 Outline  Introduction  Problem: Writes Delaying Reads  Adaptive Write Cancellation  Write Pausing  Combining Cancellation & Pausing  Summary

© 2007 IBM Corporation 9 Write Cancellation Write Cancellation: “abort” on-going write to Improve read latency Line in non-deterministic state: read matching read request from WRQ Perform write cancellation as soon as a read request arrives at a bank (as long as the write is not done in forced-mode)

© 2007 IBM Corporation 10 Write Cancellation with Static Threshold WCST: Cancel write request only if less than K% service done Canceling a write request close to completion is wasteful and causes episodes of forced-writes (low performance) 2365 (NeverCancel) (AlwaysCancel)

© 2007 IBM Corporation 11 Adaptive Write Cancellation Best threshold depends on num pending entries in WRQ. Fewer entries  Higher threshold (best read latency) More entries  Lower threshold (reduces forced writes) Write Cancellation with Adaptive Threshold (WCAT) Threshold = 100 – (4*NumEntriesInWRQ) 100% 0% % Num Entries in WRQ Threshold High Low ForcedWrites

© 2007 IBM Corporation 12 Adaptivity of WCAT Num Entries in WRQLow (0-1) Med (2-13) High (14-25) Forced (26+) WCST(K=75%)61.4%29.8%7.4%1.43% WCAT58.2%35.4%5.6%0.72% WCAT uses higher threshold initially with empty WRQ but Lower threshold later reduces the episodes of forced-writes We sampled all WRQ every 2M cycles to measure occupancy

© 2007 IBM Corporation 13 Results for WCAT Baseline: 2365 cycles Ideal:1K cycles Adaptive threshold reduces latency and incurs half the overhead

© 2007 IBM Corporation 14 Outline  Introduction  Problem: Writes Delaying Reads  Adaptive Write Cancellation  Write Pausing  Combining Cancellation & Pausing  Summary

© 2007 IBM Corporation 15 Iterative Write in PCM devices In Multi-Level Cells (MLC), the programming precision requirement increases linearly with the number of levels PCM cells respond differently to same programming pulse Acknowledged solution to address uncertainty: Iterative writes Each iteration consists of steps of: write-read-verify Write Verify Read Not done Done

© 2007 IBM Corporation 16 Model for Iterative Writes We develop an analytical model to capture number of iterations: In terms of bits/cell, num levels written in one shot, and learning Time required to write a line is worst-case of all cells in line Avg number of iterations: 8.3 (consistent with MLC literature) MLC:3 bits/cell

© 2007 IBM Corporation 17 Concept of Write Pausing Iterative writes can be paused to service pending read requests Reads can be performed at the end of each iteration (potential pause point) Iter 1Iter 2Iter 3Iter 4 Potential Pause Points Iter 1Iter 2Rd XIter 3 Rd X Iter 4 Better read latency with negligible write overhead We extend the iterative write algorithm of Nirschl et al. [IEDM’07] to support Write Pausing

© 2007 IBM Corporation 18 Results for Write Pausing Write Pausing at end of iteration gets 85% of benefit of “Anytime” Pause

© 2007 IBM Corporation 19 Outline  Introduction  Problem: Writes Delaying Reads  Adaptive Write Cancellation  Write Pausing  Combining Cancellation & Pausing  Summary

© 2007 IBM Corporation 20 Write Pausing + WCAT Iter 1Iter 2Iter 3 Rd X Iter 4 Iter 1Iter 2Rd XIter 3 Rd X Iter 4 Iter 1Iter 2Rd XIter 3 Rd X Iter 4 Iter2 Cancelled Only one iteration is cancelled  “micro-cancellation” has low overhead

© 2007 IBM Corporation 21 Results Write Pause + Micro Cancellation very close to Anytime Pause (re-execution overhead of micro cancellation <4% extra iterations) Baseline: 2365 cycles Ideal:1K cycles

© 2007 IBM Corporation 22 Impact of Write Queue Size We will need large buffers to best exploit the benefit of Pausing Speedup wrt Baseline (32-entry)

© 2007 IBM Corporation 23 Outline  Introduction  Problem: Writes Delaying Reads  Adaptive Write Cancellation  Write Pausing  Combining Cancellation & Pausing  Summary

© 2007 IBM Corporation 24 Summary  Slow writes increase the effective read latency (2.3x)  Write Cancellation: Cancel ongoing write to service read  Threshold based write cancellation  Adaptive Threshold: better performance, half the overhead  Write Pausing exploits iterative write to service pending reads  Write Pausing + Micro Cancellation close to optimal pause  Effective read latency: from 2365 to 1330 cycles (1.45x speedup)  We will need large write buffers to exploit the benefit of Pausing

© 2007 IBM Corporation 25 Questions

© 2007 IBM Corporation 26 Write Pausing in Iterative Algorithms (Nirschl+ IEDM’07)

© 2007 IBM Corporation 27 Workloads and Figure of Merit 12 memory-intensive workloads from SPEC 2006: 6 rate-mode (eight copies of same benchmark) 6 mix-mode (two copies of four benchmarks) Key metric: Effective Read Latency Tin = Time at which read request enters RDQ Tout = Time at which read request finishes service at memory Effective Read Latency = Tout – Tin (average reported)

© 2007 IBM Corporation 28 Sensitivity to Write Latency At WriteLatency=4K, the speedup is 1.35x instead of 1.45x (at 8K latency)