Reducing DRAM Latency via

Slides:

Advertisements

Similar presentations

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

A Case for Refresh Pausing in DRAM Memory Systems

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

High Performing Cache Hierarchies for Server Workloads

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Improving Cache Performance by Exploiting Read-Write Disparity

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

The Evicted-Address Filter

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Improving Cache Performance using Victim Tag Stores

UH-MEM: Utility-Based Hybrid Memory Management

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

COSC3330 Computer Architecture

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Kevin Chang Prashant Nair, Donghyuk Lee, Saugata Ghose, Moinuddin.

Mosaic: A GPU Memory Manager

(Find all PTEs that map to a given PPN)

Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Prefetch-Aware Cache Management for High Performance Caching

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality

Energy-Efficient Address Translation

Prof. Gennady Pekhimenko University of Toronto Fall 2017

Application Slowdown Model

Rachata Ausavarungnirun

Part V Memory System Design

MASK: Redesigning the GPU Memory Hierarchy

Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu

Milad Hashemi, Onur Mutlu, Yale N. Patt

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Using Dead Blocks as a Virtual Victim Cache

Achieving High Performance and Fairness at Low Cost

CARP: Compression-Aware Replacement Policies

Arash Tavakkol, Mohammad Sadrosadati, Saugata Ghose,

Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,

Morgan Kaufmann Publishers

Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu

CS 3410, Spring 2014 Computer Science Cornell University

Samira Khan University of Virginia Nov 14, 2018

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

CROW A Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability Hi, my name is Hasan. Today, I will be presenting CROW, which.

CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators

Lei Zhao, Youtao Zhang, Jun Yang

Computer Architecture Lecture 6a: ChargeCache

Architecting Phase Change Memory as a Scalable DRAM Alternative

Presented by Florian Ettinger

Computer Architecture Lecture 30: In-memory Processing

Presentation transcript:

Reducing DRAM Latency via Charge-Level-Aware Look-Ahead Partial Restoration Yaohua Wang1,2 Arash Tavakkol1 Lois Orosa1,4 Saugata Ghose3 Nika Mansouri Ghiasi1 Minesh Patel1 Jeremie S. Kim1 Hasan Hassan1 Mohammad Sadrosadati1 Onur Mutlu1,3 1 2 3 4 University of Campinas National University of Defense Technology 1: Summary 2: Background and Key Ideas DRAM latency is a major bottleneck: Mainly consists of: activation and restoration Partial restoration on soon-to-be refreshed cells can help to reduce restoration latency Motivation: The potential of partial restoration can be greatly improved when applied on soon-to-be-reactivated cells We can trade-off restoration and activation latency reductions to maximize the overall benefit Charge-level-aware look-ahead Partial restoration (CAL): Accurately predict the next access-to-access interval Carefully apply partial restoration according to the prediction and next scheduled refresh Greatly improve overall performance and energy efficiency at low cost (a) To compensate for the charge depletion and avoid data loss, DRAM fully restores the charge level of the cell during access; (b) Restore Truncation[Zhang+,HPCA 2016] partially restores the charge of soon-to-be-refreshed cell to a level, such that the amount of charge is just enough to ensure that the refresh operation can still correctly read the data; (c) CAL effectively enable partial restoration for both soon-to-be-refreshed and soon-to-be-reactivated cells, while still effectively exploiting the benefits of activation latency reduction DRAM subarray structure Two ciritial parts of DRAM access latency: Activation: Sensing and amplifying the charge of cell (tRCD) Restoration: Restoring the charge of cell after access (tRAS) 3: Motivations A charge level aware partial restoration can trade a smaller tRCD reduction for a larger tRAS reduction. How to know the next access-to-access interval? Applying partial restoration on soon-to-be-reactivated DRAM cells can provide larger benefit 2 1 If the last access-to-access interval is small (e.g., <16ms), the next one would highly likely to be small (i.e., 98%) 4: Charge-Level-Aware Look-Ahead Partial Restoration Hardware Structure High Level Operations Reduced Timing Parameters Key Mechanism Timer Table Evict [PRE] [ACT] Three cases to reduce tRCD/tRAS/tWR according to the timer value (T): T == 15: < 1ms since last restoration 11.2/16.1/6.8ns 0 < T < 15: 1-15ms since last restoration 13.75/19.4/8.4ns T == 0: >15ms passed (or lookup miss) Default parameters 2 Insertion: New items are inserted upon PRE command, potentially evicting other items, and issuing ACT and PRE commands Initialization: The timer value is initialized upon PRE command Update: The timer table performs an update every 1ms to record access intervals Lookup: Each ACT/WRITE incurs a lookup in the timer table to reduce timing parameters accordingly Insert [PRE] 1 Tag Timer PR V 1. Uses the last access-to-access interval of a row to predict whether the row will be reactivated again soon Initialization [PRE] 3 2. Decides by how much the restoration latency should be reduced, based on the prediction and trade-off Self-update Every 1ms 4 Lookup [ACT][WRITE] 5 Tag: row number, Timer: 4-bit timer, PR: partially restored, V: valid 5: Evaluation Methodology Mechanisms Evaluated Performance Improvement Single-core 8-core Simulator DRAM Simulator (Ramulator [Kim+, CAL’15]) https://github.com/CMU-SAFARI/ramulator Workloads 20 single-core workloads SPEC CPU2006, TPC, BioBench 20 multi-programmed 8-core workloads By randomly choosing from single-core workloads Mechanism Parameters 8-way cache-like set-associative timer table DDR4 timing tRCD/tRAS/tWR: 13.75/35/15ns CAL ChargeCache (CC) [Hassan+, HPCA’16] reduces tRCD and tRAS for highly-charged rows Restore Truncation (RT) [Xian+, HPCA16] reduces tRAS and tWR for soon-to-be-refreshed rows Combinations of activation and restoration latency reductions CCRT: a simple combination of CC and RT GreedyPR: similar to CAL, but unaware of future charge level Idealized CAL (IdealCAL) CAL always outperforms the other mechanisms By an average of 7.4%(14.7%) for single-core (8-core) workloads System Energy Breakdown 6: Hardware Overhead & Sensitivity Analysis Mem Non-Intensive Mem Intensive 25% 50% 75% 100% Area: 0.034mm2, 0.11% of 16MB LLC Power: 0.202mW, 0.08% of 16MB LLC CAL’s performance is robust across various configurations TC table size Page management policy Address mapping policy Partial restoration level plays an important role, trading tRCD with tRAS reduction can provide opportunities for performance gain CAL’s is still effective to high temperatures, where the refresh rate is increased On average, CAL reduces system energy by 10.1% and 18.9% for memory intensive single-core and 8-core workloads