1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Bypass and Insertion Algorithms for Exclusive Last-level Caches

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

High Performing Cache Hierarchies for Server Workloads

Performance of Cache Memory

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

†The Pennsylvania State University

Improving Cache Performance by Exploiting Read-Write Disparity

Computer Architecture Evaluation, Simulation and Research OSU ECE OS Interaction with Cache Memories Dr. Sohum Sohoni School of Electrical and Computer.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Compressed Memory Hierarchy Dongrui SHE Jianhua HUI.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

FAMU-FSU College of Engineering 1 Computer Architecture EEL 4713/5764, Fall 2006 Dr. Linda DeBrunner Module #17—Main Memory Concepts.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Main Memory CS448.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

The Evicted-Address Filter

Computer Organization CS224 Fall 2012 Lessons 41 & 42.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Reducing Memory Interference in Multicore Systems

Lecture: Large Caches, Virtual Memory

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Adaptive Cache Partitioning on a Composite Core

Zhichun Zhu Zhao Zhang ECE Department ECE Department

CSC 4250 Computer Architectures

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Memory System Characterization of Commercial Workloads

Cache Memory Presentation I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Application Slowdown Model

Using Dead Blocks as a Virtual Victim Cache

Lecture: Cache Innovations, Virtual Memory

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

Lecture: Cache Hierarchies

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University of Texas at Austin Presentation by Pravin Dalale

2 OUTLINE Motivation Main idea in the paper - Analysis - Main idea Prefetch engine - Insertion policy - Prefetch scheduling Results Conclusion

3 Motivation Memory density and capacity have grown along with the CPU power and complexity, but memory speed has not kept pace.

4 Solutions Multithreading Multiple levels of caches Prefetching

5 OUTLINE Motivation Main idea in the paper - Analysis - Main idea Prefetch engine - Insertion policy - Prefetch scheduling Results Conclusion

6 Analysis (1) IPC Real – Instructions per cycle with real memory system IPC PerfectL2 - Instructions per cycle with real L1 cache but perfect L2 cache IPC PerfectMem - Instructions per cycle with perfetct L1 cache but perfect L2 cache

7 Analysis (2) Fraction of performance lost due to imperfect L1 and L2 = (IPC PerfectMem – IPC Real ) / IPC PerfectMem Fraction of performance lost due to imperfect L2 = (IPC PerfectL2 – IPC Real ) / IPC PerfectL2

8 Analysis (3) -Simulated 1.6GHz, out-of-order core -64KB L1 -1MB L2 -Direct Rambus Memory System with four 1.6GB/s channels The 26 SPEC benchmarks were tested on this system to obtain IPC Real, IPC PerfectL2, IPC PerfectMem.

9 Analysis (4) L2 stall fraction is 80% for mcf benchmark Average stall fraction caused by L2 misses is 57% over 26 SPEC CPU2000 benchmarks

10 Main idea Paper describes technique to reduce L2 miss latencies Introduces a prefetch engine to prefetch data to L2 cache upon a L2 demand miss

11 OUTLINE Motivation Main idea in the paper - Analysis - Main idea Prefetch Engine - Insertion policy - Prefetch scheduling Results Conclusion

12 Prefetch Engine 1 2 3

13 Prefetch Engine Prefetch queue maintains the list of n region entries not in L2 cache 2 Prefetch prioritizer uses the bank state and the region age to determine which prefetch to issue next. 3 Access prirotizer selects a prefetch in case of no demand misses

14 Insertion policy (1) The prefetched block may be loaded into L2 with one of four priorities: 1. most-recently-used (MRU) 2. second-most-recently-used (SMRU) 3. second-least-recently-used (SLRU) 4. least-recently-used (LRU)

15 Insertion policy (2) Benchmarks were divided into two classes 1.High (above 20%) prefetch accuracy benchmarks 2.Low (below 20%) prefetch accuracy benchmarks All benchmarks were tested for four possible insertion policies. LRU insertion policy gives best results in both categories.

16 Prefetch Scheduling Simple aggressive prefetching can consume large amount of bandwidth and cause channel contention This large contention at channel can be avoided scheduling prefetch accesses onlt when Rambus channels are idle

17 OUTLINE Motivation Main idea in the paper - Analysis - Main idea Prefetch engine - Insertion policy - Prefetch scheduling Results Conclusion

18 Results (1/3) - Overall performance improvement The performance with prefetching is very close to that of perfect L2

19 Results (2/3) - Sensitivity of prefetch scheme to DRAM latencies Base DRDRAM had 40ns latency and 800MHz data transfer rate If latency is increased to 50ns the mean performance of prefetch scheme reduces by less than 1% as compared to the base system If latency is reduced to 34ns the mean performance of prefetch scheme was again reduced by less than 2%

20 Results (3/3) - Interaction with software prefetching When prefetch scheme is coupled with software prefetching, none of the benchmarks improved significantly (at most 2%) Thus proposed prefetch scheme overshadows the software prefetching benefits

21 OUTLINE Motivation Main idea in the paper - Analysis - Main idea Prefetch engine - Insertion policy - Prefetch scheduling Results Conclusion

22 Conclusions Authors proposed and evaluated a prefetch architecture, integrated with on chip L2 cache This architecture involves aggressive prefetching of large regions of data to L2 on demand misses By scheduling these prefetches only during idle cycles, inserting them into the cache with low replacement prioroty a significant improvement is obtained in 10 of 26 SPEC benchmarks

23 QUESTIONS