MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

High Performing Cache Hierarchies for Server Workloads

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.

Loose-Ordering Consistency for Persistent Memory

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.

Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013.

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

ThyNVM Enabling Software-Transparent Crash Consistency In Persistent Memory Systems Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and.

1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Sunpyo Hong, Hyesoon Kim

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

15-740/ Computer Architecture Lecture 25: Main Memory

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs Lunkai Zhang, Diana Franklin, Frederic T. Chong 1 Brian Neely,

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Failure-Atomic Slotted Paging for Persistent Memory

UH-MEM: Utility-Based Hybrid Memory Management

Cache Memory and Performance

Reducing Memory Interference in Multicore Systems

A Staged Memory Resource Management Method for CMP Systems

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Application Slowdown Model

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Lecture: DRAM Main Memory

Achieving High Performance and Fairness at Low Cost

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Presented by Florian Ettinger

Presentation transcript:

MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie

New Design Opportunity with NVMs CPU Main Memory Load/store Not persistent DRAM Page faults Disk/Flash Storage Fopen, fread, fwrite Persistent

New Design Opportunity with NVMs Persistent Applications STT-MRAM, PCM, ReRAM, NV-DIMM, Battery-backed DRAM, etc. CPU Persistent Memory Load/store Persistent Main Memory Load/store Not persistent Persistent Applications NVM DRAM Disk/Flash Storage Fopen, fread, fwrite Persistent Examples applications Databases, file systems, key-value stores (In-memory data structures can immediately become permanent)

New Design Opportunity with NVMs STT-MRAM, PCM, ReRAM, NV-DIMM, Battery-backed DRAM, etc. CPU Persistent Memory Load/store Persistent Working Memory Load/store Not persistent Persistent Applications NVM DRAM Disk/Flash Main Memory Load/store Not persistent Non-persistent Applications New use case of NVM: concurrently running two types of applications [Kannan + HPCA’14, Liu + ASPLOS’14, Meza + WEED’14]

Focus of Our Work: Memory Controller Design CPU Memory Controller MC DRAM Persistent Applications Memory requests go through the shared memory interface NVM Non-persistent Applications Fair and High-Performance Memory Control

Why Another Memory Control Scheme? Reads Writes … BLISS [ICCD’14] SMS [ISCA’12] TCM [MICRO’10] ATLAS [HPCA’10] Careful control over writes to guarantee data persistence PAR-BS [ISCA’08] STFM [MICRO’07] FR-FCFS [ISCA’00] Persistent Memory

Memory Controller Memory Controller CPU Scheduler NVM Memory Requests MC Read Queue DRAM Memory Bus Scheduler Limited number of entries NVM Write Queue If one of them overflows, MC needs to drain it by stalling the service of the other Determine which requests can be sent on the memory bus to be serviced

How to design fair and high-performance Why conventional memory control schemes are inefficient in persistent memory systems How to design fair and high-performance memory control in this new scenario

Assumptions and Design Choices Conventional memory control schemes Assumptions Design choices 1. Reads are on the critical path of application execution 1. Prioritize reads over writes (Application execution is read-dependent) 2. Applications are usually read-intensive 2. Delay writes until they overflow the write queue (Infrequent write queue drains)

Mechanisms: multiversioning and write-order control These assumptions no longer hold in persistent memory, which needs to support data persistence (Data consistency when the system suddenly crashes or loses power) Mechanisms: multiversioning and write-order control

Implication of Multiversioning Updates of data structure A have multiple write requests NVM Crash Persistent A

Implication of Multiversioning NVM Persistent A’ A Ver N Ver N-1 Original data Crash - Logging - Copy-on-write The two versions are not updated at the same time Significantly increasing write traffic – Two writes with each one data update [Volos+ ASPLOS’11, Coburn+ ASPLOS’11, Condit+ SOSP’09, Venkataraman+ FAST'11]

Assumptions and Design Choices Design decisions 1. Reads are on the critical path of application execution 1. Prioritize reads over writes 2. Applications are usually read-intensive 2. Delay writes until they fill up the write queue Persistent applications are usually write-intensive Infrequent write queue drains

Implication of Write-order Control The two versions are not supposed to be updated at the same time, issued in order: A’ = {A’1, A’2 , A’3} A = {A1, A2, A3} Processor NVM A’ A Persistent Ver N Ver N-1 Reordered by caches and MCs: A2, A’2, A’1, A1, A3 Crash

Implication of Write-order Control Issued in order: Update A’; Cache flush; Memory barrier; Update A ; Processor Cache flushes Memory barriers NVM Persistent A’ A Writes to A’ complete Writes to A start Restrict the ordering of writes arriving at the memory Making application execution write dependent – Subsequent writes, reads, and computation can all depend on a previously issued write [Volos+ ASPLOS’11, Coburn+ ASPLOS’11, Condit+ SOSP’09, Venkataraman+ FAST'11]

Assumptions and Design Choices Design decisions 1. Reads are on the critical path of application execution 1. Prioritize reads over writes Writes are also on the critical execution path (Application execution is write-dependent) (Application execution is read-dependent) 2. Applications are usually read-intensive 2. Delay writes until they fill up the write queue Persistent applications are usually write-intensive Infrequent write queue drains

Assumptions and Design Choices 1. Reads are on the critical path of application execution 1. Writes are also on the critical execution path 2. Applications are usually read-intensive 2. Persistent applications are usually write-intensive

Assumptions and Design Choices Unfairness 1. Writes are also on the critical execution path 1. Prioritize reads over writes Performance Degradation 2. Delay writes until they overflow the write queue 2. Persistent applications are usually write-intensive Infrequent write queue drains Frequent Why Frequent stall reads to drain the write queue, frequently switch between servicing reads and servicing writes

Bus Turnaround Overhead tRTW ~ 7.5ns tWTR ~ 15ns [Kim + ISCA’12] Forcing write queue drain by stalling reads CPU Read Queue Write Queue Overflow MC DRAM Read Write NVM Bus cycles wasted on bus turnarounds (tRTW and tWTR) 17% Bus cycles to perform memory accesses

Assumptions and Design Choices Design decisions Unfairness 1. Writes are also on the critical execution path 1. Prioritize reads over writes Performance Degradation 2. Delay writes until they fill up the write queue 2. Persistent applications are usually write-intensive 3. Writes in persistent memory have low bank-level parallelism (BLP)

Low Bank-level Parallelism (BLP) Stalling reads for a long time while the bus is servicing writes to persistent memory A Log (Spans Multiple Banks) One or several sets of contiguous memory locations BA-NVM Bank 1 Req Req Req Log0 Req Req Req Writes Req Req Req Req Req Req Log1 Req Bank 1 Bank 2 Req Req Log2 Req No service for reads, idle, wasted memory bandwidth Req Req Bank 3 … Bank 2 … Bank 3

? ? ? Assumptions and Design Choices Assumptions Design choices 1. Writes are also on the critical execution path ? Fairness 2. Persistent applications are usually write-intensive ? High Performance 3. Writes of persistent applications have low BLP ?

Design Principles Persistence-Aware Memory Scheduling Minimizing write queue drains and bus turnarounds, while ensuring fairness Persistent Write Striding Increasing BLP of writes to persistent memory to fully utilize memory bandwidth

Persistence-aware Memory Scheduling Minimizing write queue drains and bus turnarounds, while ensuring fairness Problem: when to switch between servicing read batches and write batches Low bus turnaround overhead, risk frequent write queue drains Batch From the same source, To the same NVM row, In the same R/W direction Read Queue Write Queue

Persistence-aware Memory Scheduling Minimizing write queue drains and bus turnarounds, while ensuring fairness Problem: when to switch between servicing read batches and write batches Less likely to starve reads and writes, higher bus turnaround overhead Batch From the same source, To the same NVM row, In the same R/W direction Read Queue Write Queue

Persistence-aware Memory Scheduling Minimizing write queue drains and bus turnarounds, while ensuring fairness Key idea 1: balance the amount of time spent in continuously servicing reads and writes Tmax = 200 ns Time T = 100 ns Read Queue Tmax (read) T (read) 1 = = 2 Write Queue Tmax (write) T (write) T = 200 ns Time Tmax = 400 ns

Persistence-aware Memory Scheduling Minimizing write queue drains and bus turnarounds, while ensuring fairness Key idea 2: Time to service read batches and write batches is JUST long enough tRTW + tWTR < µ T(read batches) + T(write batches) User-defined bus turnaround overhead Pick the choice with the shortest times

Persistent Write Striding Increasing BLP of persistent writes to fully utilize memory bandwidth A Log (Spans Multiple Banks) BA-NVM Bank 1 Req Req Req Req Log0 Req Req Req Writes Req Req Req Req Req Req Log1 Bank 1 Req Bank 2 Req Req Log2 No service for reads, idle Req Req Bank 3 … Bank 2 … Bank 3

Persistent Write Striding Increasing BLP of persistent writes to fully utilize memory bandwidth Key idea: stride writes by an offset to remap them to different banks A Log (Spans Multiple Banks) BA-NVM Req Log0 Bank 1 Req Req Req Req Bank 1 Req Req Benefit: no bookkeeping required Req Req Req Req Log k Req Req Stride + offset Bank 2 Req Serviced in parallel Req Req Log1 Req Req Bank 2 Req Bank 3 Req Req Log2 Req Stride + offset … Bank 3 Req …

Experimental Setup Simulator - McSimA+ [Ahn+, ISPASS’13] (modified) Configuration - Four-core processor, eight threads - Private caches: L1/L2, SRAM, 64KB/256KB per core - Shared last-level cache: L3, SRAM, 2MB per core - Main memory: STT-MRAM DIMM (8GB) Benchmarks - 7 non-persistent applications - 3 persistent applications Metrics - Weighted speedup & maximum slowdown 18 Workloads

Performance 18%

Fairness 23%

Bus Turnaround Overhead 6.7% Fraction of Bus Cycles Wasted on Bus Turnarounds 1% TCM Our Design (The worst case of previous designs: 17%)

Other Results Sensitivity study on various NVM Latencies Row-buffer sizes Number of threads

Persistence-aware memory scheduling Persistent write striding Summary Reads Writes Reads Writes FIRM [MICRO’14] … BLISS [ICCD’14] SMS [ISCA’12] TCM [MICRO’10] ATLAS [HPCA’10] Design principles: PAR-BS [ISCA’08] Persistence-aware memory scheduling STFM [MICRO’07] Persistent write striding FR-FCFS [ISCA’00] Fairness High Performance Persistent Memory

Thank you!

MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie