Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

Slides:

Advertisements

Similar presentations

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Advertisements

1 JILP Workshop on Architecture Competitions JWAC-3 Memory Scheduling Championship (MSC) 9 th June 2012 Organizers: Rajeev Balasubramonian, Utah Niladrish.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

A Case for Refresh Pausing in DRAM Memory Systems

Memory Controller Innovations for High-Performance Systems

The Compact Memory Scheduling Maximizing Row Buffer Locality Young-Suk Moon, Yongkee Kwon, Hong-Sik Kim, Dong-gun Kim, Hyungdong Hayden Lee, and Kunwoo.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

QUANTIFYING THE RELATIONSHIP BETWEEN THE POWER DELIVERY NETWORK AND ARCHITECTURAL POLICIES IN A 3D-STACKED MEMORY DEVICE Manjunath Shevgoor, Niladrish.

Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian Al Davis.

UNDERSTANDING THE ROLE OF THE POWER DELIVERY NETWORK IN 3-D STACKED MEMORY DEVICES Manjunath Shevgoor, Niladrish Chatterjee, Rajeev Balasubramonian, Al.

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Reducing Refresh Power in Mobile Devices with Morphable ECC

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.

Designing a Fast and Reliable Memory with Memristor Technology

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

1 The University of Texas at Austin Ali Shafiee, A. Gundu, M. Shevgoor, R. Balasubramonian and M. Tiwari.

Enabling Big Memory with Emerging Technologies Manjunath Shevgoor Enabling Big Memory with Emerging Technologies1.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.

Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.

33 rd IEEE International Conference on Computer Design ICCD rd IEEE International Conference on Computer Design ICCD 2015 Improving Memristor Memory.

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures. A DRAM Refresh Method By Ishan Thakkar, Sudeep Pasricha

Gwangsun Kim, Jiyun Jeong, John Kim

UH-MEM: Utility-Based Hybrid Memory Management

Seth Pugsley, Jeffrey Jestes,

Reducing Memory Interference in Multicore Systems

A Staged Memory Resource Management Method for CMP Systems

Adaptive Cache Partitioning on a Composite Core

Zhichun Zhu Zhao Zhang ECE Department ECE Department

HISTORY OF MICROPROCESSORS

Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian

Lecture 15: DRAM Main Memory Systems

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Energy-Efficient Address Translation

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Lecture: Memory Technology Innovations

Ali Shafiee Rajeev Balasubramonian Feifei Li

Reducing DRAM Latency via

Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian

Rajeev Balasubramonian

Lei Zhao, Youtao Zhang, Jun Yang

DSPatch: Dual Spatial pattern prefetcher

Presentation transcript:

Addressing Service Interruptions in Memory with Thread-to-Rank Assignment Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah Niladrish Chatterjee, NVIDIA Jung-Sik Kim, Samsung Electronics 4/18/2016 ISPASS 2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

DRAM Refresh: Quick Recap DRAM cell leaks through access transistor Leakage increases with temperature DRAM cell must be Refreshed every 64ms 1/8K of the DRAM rank is refreshed every 7.8µs Bit Line Word DRAM Cell Leaks more with Temperature Leak 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Refresh Timing Parameters 7.8 ms or 3.9 ms tREFI tRFC tRFC tRFC 640 ns (32 Gb) tRefresh tRecovery 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

tRFC Projections 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Refresh determines memory peak power Refresh Power in DRAM Command Current (mA) Act 67 Read 125 Write Refresh 245 Refresh determines memory peak power 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Stagger refresh to reduce peak power Rank 1 Rank 2 Rank 3 Rank 4 MC 8-core CMP MC Channel 1 Channel 2 Stagger refresh to reduce peak power 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Effect of Staggered Refresh 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Talk Outline DRAM refresh background Goal: Low peak power of staggered refresh, performance of simultaneous refresh Analyzing stalls from refresh Solution: Thread-to-rank assignment Results 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Each Staggered Refresh Rank 1 Rank 2 Rank 3 Rank 4 Each Staggered Refresh stalls many cores MC 8-core CMP MC Channel 1 Channel 2 Stalled T1 R1 T2 R3 R2 T7 T3 T8 Stalled Thread Rank T1 R2 T2 R3 T1 R1 T2 R2 T2 R1 T3 R1 Rank 1 Refreshing => 3 Threads Stalled Rank 3 Refreshing => 3 Threads Stalled 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Limit the Spread- Address Mapping 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

% Refreshes Affecting a Thread Highest Performance Loss 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

37% increase in Execution Time Highest Performance Loss 37% increase in Execution Time 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Rank Assigned Page Mapping Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Rank 1 Rank 3 Rank 2 Rank 4 8-core CMP MC MC Channel 1 Channel 2 Strict mapping of threads to ranks. e.g., used for cache partitioning by Lin et al., HPCA 2008 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Limit the Spread- Page Mapping Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Rank 1 Rank 3 Rank 2 Rank 4 MC 8-core CMP MC Channel 1 Channel 2 Relaxed mapping of threads to ranks. 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Modified Clock Algorithm P List of Pages in Memory P P P P P P P P P P P Baseline Hand 1 2 3 4 Modified List of Pages in Ranks Hands 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Methodology Simics + USIMM DRAM Specifications 8 RISC cores, UltraSPARC III ISA 3.2 GHz, 4-wide OoO, 64-entry RoB 32 KB I&D L1 caches, 4 cycles 4/8 MB shared L2 cache, 10 cycles DRAM Specifications 2 Channels, 2 Ranks per Channel, 16 Banks per Rank 800MHz DDR4 DRAM SPEC 2006, NPB, and Cloudsuite, Parsec 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

18% better than Staggered Refresh Thread-to-rank Assignment 18% better than Staggered Refresh 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Relaxing Rank Assignment 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Comparisons to Prior Work 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Conclusions Exposes an important artifact in memory stalls Service interruptions require a re-evaluation of data placement RA (rank assignment) is a simple solution for an emerging problem RA can also be leveraged to reduce the impact of NVM write drain RA is a software solution that only requires best-effort page mapping Outperforms hardware-only schemes 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment

Thank You 4/18/2016 Addressing Service Interruptions in Memory with Thread to Rank Assignment