Improving DRAM Performance by Parallelizing Refreshes with Accesses

Slides:



Advertisements
Similar presentations
Chapter 5: CPU Scheduling
Advertisements

Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Scalable Routing In Delay Tolerant Networks
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Solve Multi-step Equations
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §
Mohamed Hauter CMPE 259 – Sensor Networks UCSC 1.
1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre.
1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,
2 |SharePoint Saturday New York City
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi * Chang Joo Lee * Onur.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Januar MDMDFSSMDMDFSSS
SE-292 High Performance Computing
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.
Energy Generation in Mitochondria and Chlorplasts
A Case for Refresh Pausing in DRAM Memory Systems
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
Reducing Memory Interference in Multicore Systems
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Achieving High Performance and Fairness at Low Cost
Presented by Florian Ettinger
Presentation transcript:

Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu

Executive Summary DRAM refresh interferes with memory accesses Degrades system performance and energy efficiency Becomes exacerbated as DRAM density increases Goal: Serve memory accesses in parallel with refreshes to reduce refresh interference on demand requests Our mechanisms: 1. Enable more parallelization between refreshes and accesses across different banks with new per-bank refresh scheduling algorithms 2. Enable serving accesses concurrently with refreshes in the same bank by exploiting DRAM subarrays Improve system performance and energy efficiency for a wide variety of different workloads and DRAM densities 20.2% and 9.0% for 8-core systems using 32Gb DRAM Very close to the ideal scheme without refreshes

Outline Motivation and Key Ideas DRAM and Refresh Background Our Mechanisms Results

Refresh Penalty Refresh interferes with memory accesses DRAM Processor Capacitor Access transistor Processor Memory Controller DRAM Refresh Read Data Refresh delays requests by 100s of ns

Existing Refresh Modes All-bank refresh in commodity DRAM (DDRx) Per-bank refresh allows accesses to other banks while a bank is refreshing Bank 7 Time Refresh … Bank 1 Bank 0 Time Per-bank refresh in mobile DRAM (LPDDRx) Round-robin order Bank 7 Bank 1 Bank 0 … …

Shortcomings of Per-Bank Refresh Problem 1: Refreshes to different banks are scheduled in a strict round-robin order The static ordering is hardwired into DRAM chips Refreshes busy banks with many queued requests when other banks are idle Key idea: Schedule per-bank refreshes to idle banks opportunistically in a dynamic order

Shortcomings of Per-Bank Refresh Problem 2: Banks that are being refreshed cannot concurrently serve memory requests Delayed by refresh Per-Bank Refresh Bank 0 RD Time

Shortcomings of Per-Bank Refresh Problem 2: Refreshing banks cannot concurrently serve memory requests Key idea: Exploit subarrays within a bank to parallelize refreshes and accesses across subarrays Bank 0 Subarray 1 Subarray 0 Parallelize RD Time Time Subarray Refresh

Outline Motivation and Key Ideas DRAM and Refresh Background Our Mechanisms Results

DRAM System Organization Rank 1 Rank 1 Bank 7 Bank 1 Bank 0 … Rank 0 Banks can serve multiple requests in parallel

DRAM Refresh Frequency DRAM standard requires memory controllers to send periodic refreshes to DRAM tRefLatency (tRFC): Varies based on DRAM chip density (e.g., 350ns) Read/Write: roughly 50ns tRefPeriod (tREFI): Remains constant Timeline

Increasing Performance Impact DRAM is unavailable to serve requests for of time 6.7% for today’s 4Gb DRAM Unavailability increases with higher density due to higher tRefLatency 23% / 41% for future 32Gb / 64Gb DRAM tRefLatency tRefPeriod

All-Bank vs. Per-Bank Refresh All-Bank Refresh: Employed in commodity DRAM (DDRx, LPDDRx) Timeline Bank 0 Bank 1 Refresh Refresh Read Staggered across banks to limit power Refresh Timeline Bank 0 Bank 1 Refresh Per-Bank Refresh: In mobile DRAM (LPDDRx) Read Read Shorter tRefLatency than that of all-bank refresh More frequent refreshes (shorter tRefPeriod) Can serve memory accesses in parallel with refreshes across banks

Shortcomings of Per-Bank Refresh 1) Per-bank refreshes are strictly scheduled in round-robin order (as fixed by DRAM’s internal logic) 2) A refreshing bank cannot serve memory accesses Goal: Enable more parallelization between refreshes and accesses using practical mechanisms

Outline Motivation and Key Ideas DRAM and Refresh Background Our Mechanisms 1. Dynamic Access-Refresh Parallelization (DARP) 2. Subarray Access-Refresh Parallelization (SARP) Results

Our First Approach: DARP Dynamic Access-Refresh Parallelization (DARP) An improved scheduling policy for per-bank refreshes Exploits refresh scheduling flexibility in DDR DRAM Component 1: Out-of-order per-bank refresh Avoids poor static scheduling decisions Dynamically issues per-bank refreshes to idle banks Component 2: Write-Refresh Parallelization Avoids refresh interference on latency-critical reads Parallelizes refreshes with a batch of writes

1) Out-of-Order Per-Bank Refresh Dynamic scheduling policy that prioritizes refreshes to idle banks Memory controllers decide which bank to refresh

1) Out-of-Order Per-Bank Refresh Reduces refresh penalty on demand requests by refreshing idle banks first in a flexible order Baseline: Round robin Read Request queue (Bank 0) Request queue (Bank 1) Read Bank 1 Timeline Refresh Read Saved cycles Bank 0 Refresh Saved cycles Read Delayed by refresh Bank 1 Bank 0 Our mechanism: DARP Refresh Read Read Refresh

Outline Motivation and Key Ideas DRAM and Refresh Background Our Mechanisms 1. Dynamic Access-Refresh Parallelization (DARP) 1) Out-of-Order Per-Bank Refresh 2) Write-Refresh Parallelization 2. Subarray Access-Refresh Parallelization (SARP) Results

Refresh Interference on Upcoming Requests Problem: A refresh may collide with an upcoming request in the near future Bank 1 Time Read Bank 0 Refresh Delayed by refresh Read

DRAM Write Draining Observations: 1) Bus-turnaround latency when transitioning from writes to reads or vice versa To mitigate bus-turnaround latency, writes are typically drained to DRAM in a batch during a period of time 2) Writes are not latency-critical Timeline Bank 1 Bank 0 Write Read Turnaround

2) Write-Refresh Parallelization Proactively schedules refreshes when banks are serving write batches Avoids stalling latency-critical read requests by refreshing with non-latency-critical writes Timeline Bank 1 Bank 0 Turnaround Refresh Read Baseline Delayed by refresh Write Saved cycles Write-refresh parallelization Timeline Bank 1 Bank 0 Read Turnaround Write Refresh 1. Postpone refresh Read Refresh 2. Refresh during writes

Outline Motivation and Key Ideas DRAM and Refresh Background Our Mechanisms 1. Dynamic Access-Refresh Parallelization (DARP) 2. Subarray Access-Refresh Parallelization (SARP) Results

Our Second Approach: SARP Observations: 1. A bank is further divided into subarrays Each has its own row buffer to perform refresh operations 2. Some subarrays and bank I/O remain completely idle during refresh Bank 7 Row Buffer Subarray Bank I/O … Idle Bank 1 Bank 0

Our Second Approach: SARP Subarray Access-Refresh Parallelization (SARP): Parallelizes refreshes and accesses within a bank

Our Second Approach: SARP Subarray Access-Refresh Parallelization (SARP): Parallelizes refreshes and accesses within a bank Bank 7 Bank 1 Bank 0 … Subarray Bank I/O Read Refresh Data Timeline Subarray 1 Subarray 0 Bank 1 Refresh Read Very modest DRAM modifications: 0.71% die area overhead

Outline Motivation and Key Ideas DRAM and Refresh Background Our Mechanisms Results

Methodology 8-core processor Simulator configurations Memory Controller 8-core processor Bank 7 Bank 1 Bank 0 … DDR3 Rank L1 $: 32KB L2 $: 512KB/core 100 workloads: SPEC CPU2006, STREAM, TPC-C/H, random access System performance metric: Weighted speedup

Comparison Points All-bank refresh [DDR3, LPDDR3, …] Per-bank refresh [LPDDR3] Elastic refresh [Stuecheli et al., MICRO ‘10]: Postpones refreshes by a time delay based on the predicted rank idle time to avoid interference on memory requests Proposed to schedule all-bank refreshes without exploiting per-bank refreshes Cannot parallelize refreshes and accesses within a rank Ideal (no refresh)

System Performance 7.9% 12.3% 20.2% 2. Consistent system performance improvement across DRAM densities (within 0.9%, 1.2%, and 3.8% of ideal) 1. Both DARP & SARP provide performance gains and combining them (DSARP) improves even more

Energy Efficiency Consistent reduction on energy consumption 9.0% 5.2% 3.0% 5.2% 9.0% Consistent reduction on energy consumption

Other Results and Discussion in the Paper Detailed multi-core results and analysis Result breakdown based on memory intensity Sensitivity results on number of cores, subarray counts, refresh interval length, and DRAM parameters Comparisons to DDR4 fine granularity refresh

Executive Summary DRAM refresh interferes with memory accesses Degrades system performance and energy efficiency Becomes exacerbated as DRAM density increases Goal: Serve memory accesses in parallel with refreshes to reduce refresh interference on demand requests Our mechanisms: 1. Enable more parallelization between refreshes and accesses across different banks with new per-bank refresh scheduling algorithms 2. Enable serving accesses concurrently with refreshes in the same bank by exploiting DRAM subarrays Improve system performance and energy efficiency for a wide variety of different workloads and DRAM densities 20.2% and 9.0% for 8-core systems using 32Gb DRAM Very close to the ideal scheme without refreshes

Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu