Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian Al Davis.

Slides:



Advertisements
Similar presentations
A Case for Refresh Pausing in DRAM Memory Systems
Advertisements

Memory Controller Innovations for High-Performance Systems
1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012.
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,
†The Pennsylvania State University
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Energy-efficient Cluster Computing with FAWN: Workloads and Implications Vijay Vasudevan, David Andersen, Michael Kaminsky*, Lawrence Tan, Jason Franklin,
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
Heterogeneous Memory & Its Impact on Rack-Scale Computing Babak Falsafi Director, EcoCloud ecocloud.ch Contributors: Ed Bugnion, Alex Daglis, Boris Grot,
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar*
1 AppliedMicro X-Gene ® ARM Processors Optimized Scale-Out Solutions for Supercomputing.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
Hystor : Making the Best Use of Solid State Drivers in High Performance Storage Systems Presenter : Dong Chang.
QUANTIFYING THE RELATIONSHIP BETWEEN THE POWER DELIVERY NETWORK AND ARCHITECTURAL POLICIES IN A 3D-STACKED MEMORY DEVICE Manjunath Shevgoor, Niladrish.
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.
Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian
UNDERSTANDING THE ROLE OF THE POWER DELIVERY NETWORK IN 3-D STACKED MEMORY DEVICES Manjunath Shevgoor, Niladrish Chatterjee, Rajeev Balasubramonian, Al.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.
LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Reducing Refresh Power in Mobile Devices with Morphable ECC
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Designing a Fast and Reliable Memory with Memristor Technology
Web Search Using Mobile Cores Presented by: Luwa Matthews 0.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Implications of Emerging Hardware Tom Wenisch (University of Michigan) Nikos Hardavellas (Northwestern University) Sangyeun Cho (University of Pittsburgh)
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
11 Using Checksum to Reduce Power Consumption of Display Systems for Low-Motion Content Kyungtae Han*, Zhen Fang, Paul Diefenbaugh, Richard Forand, Ravi.
Enabling Big Memory with Emerging Technologies Manjunath Shevgoor Enabling Big Memory with Emerging Technologies1.
MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt.
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.
Memory Hierarchy David Kilgore CS 147 Dr. Lee Spring 2008.
Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
33 rd IEEE International Conference on Computer Design ICCD rd IEEE International Conference on Computer Design ICCD 2015 Improving Memristor Memory.
GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
UH-MEM: Utility-Based Hybrid Memory Management
Seth Pugsley, Jeffrey Jestes,
ISPASS th April Santa Rosa, California
Concurrent Data Structures for Near-Memory Computing
Online parameter optimization for elastic data stream processing
Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
CSCI206 - Computer Organization & Programming
Lecture 2: Performance Today’s topics: Technology wrap-up
Managing GPU Concurrency in Heterogeneous Architectures
Ali Shafiee Rajeev Balasubramonian Feifei Li
Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah
Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian
A Case for Interconnect-Aware Architectures
Rajeev Balasubramonian
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian Al Davis Zhen Fang ‡† Ramesh Illikkal* Ravi Iyer* University of Utah, NVidia ‡ and Intel Labs* † Work done while at Intel

Memory Bottleneck DRAM power as high as 25% of total datacenter power Low-Power DRAM in place of DDR3. – BOOM from HP Labs – Energy Proportional Memory from Stanford 2 CPU DDR3 CPU LPDDR BASELINE Low Power Memory

Latency Wall Memory latency wall not going away – Emerging scale-out workloads e.g. Cloudsuite – Move towards energy-efficient in-order cores Reduced Latency DRAM offers very low latency – Row-cycle time (tRC) of 8-12ns (DDR3 tRC = 48.75ns, LPDDR2 tRC = 60ns) 3

No one memory works best 4

Heterogeneous Memory 5 PERFORMANCE OPTIMIZED DRAM CPU Combine high-performance and low-power dram to outperform DDR3 at a lower energy cost Large number of possible designs – Different DRAM device combinations – Channel Organization – Data Placement Granularity POWER OPTIMIZED DRAM

Critical Word Regularity 6 Most DRAM requests are for word-0 of the cache-line Frequency of accesses to individual words of a cache-line

Critical Word Acceleration 7 CPU LPDDR RLDRAM LPDDR RLDRAM Word 0 Words Critical Word fetched from RLDRAM to boost performance Rest of the cache-line placed & retrieved from LPDRAM for energy efficiency.

Results 8 Throughput improved by 12.9% System energy improved by 6%