DSPatch: Dual Spatial Pattern Prefetcher

Slides:



Advertisements
Similar presentations
Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §
Advertisements

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.
Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD Fundamental Latency Trade-offs in Architecting DRAM Caches MICRO 2012.
Data Prefetching Mechanism by Exploiting Global and Local Access Patterns Ahmad SharifQualcomm Hsien-Hsin S. LeeGeorgia Tech The 1 st JILP Data Prefetching.
Improving Cache Performance by Exploiting Read-Write Disparity
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
TOWARDS BANDWIDTH- EFFICIENT PREFETCHING WITH SLIM AMPM June 13 th 2015 DPC-2 Workshop, ISCA-42 Portland, OR Vinson Young Ajit Krisshna.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)
Memory/Storage Architecture Lab 1 Virtualization History of Computing = History of Virtualization  e.g., process abstraction, virtual memory, cache memory,
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Mini-Project Presentation: Prefetching TDT4260 Computer Architecture
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,
Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
MICRO-48, 2015 Computer System Lab, Kim Jeong Won.
A Case for Toggle-Aware Compression for GPU Systems
Improving Cache Performance using Victim Tag Stores
Prof. Hsien-Hsin Sean Lee
Lecture: Large Caches, Virtual Memory
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Carbon: Transformations in Matter and Energy
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
CS 704 Advanced Computer Architecture
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/14/2011
Milad Hashemi, Onur Mutlu, and Yale N. Patt
Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.
Appendix B. Review of Memory Hierarchy
5.2 Eleven Advanced Optimizations of Cache Performance
(Find all PTEs that map to a given PPN)
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Rachata Ausavarungnirun
Bojian Zheng CSCD70 Spring 2018
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Module 3: Branch Prediction
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Address-Value Delta (AVD) Prediction
Lecture: Cache Innovations, Virtual Memory
CARP: Compression-Aware Replacement Policies
Number Patterns Name: ______________________________
Lecture 20: OOO, Memory Hierarchy
The University of Adelaide, School of Computer Science
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Sarah Diesburg Operating Systems COP 4610
Eshan Bhatia1, Gino Chacon1, Elvira Teran2, Paul V. Gratz1, Daniel A
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

DSPatch: Dual Spatial Pattern Prefetcher Rahul Bera* Anant V. Nori* Onur Mutlu+ Sreenivas Subramoney* *Processor Architecture Research Lab (PARL), Labs + ETH Zürich 1. Motivation 2. Challenge 3. Goal Current state-of-the-art spatial prefetcher performance plateaus despite increasing memory bandwidth Need to boost speculation and coverage to maximize utilization of memory bandwidth resource Fundamental tradeoff in traditional prefetcher design between Coverage versus Accuracy Limits ability to dynamically adapt to memory bandwidth headroom and significantly boost Coverage New prefetcher design should include: Pattern representations best suited to capture spatially co-located program accesses and boost Coverage Mechanisms to simultaneously optimize for both Coverage and Accuracy Ability to dynamically adjust aggressiveness (Coverage vs. Accuracy) based on available DRAM bandwidth A C 4. DSPatch Key Insights A bit-pattern representation, rotated and anchored to the first “triggering” access to a page, captures all spatially identical patterns subsuming any temporal variability. Captures all “global deltas” from the trigger access Bit-wise OR of rotated bit-patterns adds missing bits to the pattern, biasing it towards Coverage Bit-wise AND of rotated bit-patterns keeps only repeating bits in the pattern, biasing it for Accuracy Using dual modulated bit-patterns allows DSPatch to simultaneously optimize for both Coverage and Accuracy A C 5. DSPatch Design DSPatch: Overall Flow DSPatch: Pattern Update DSPatch: Pattern Predict PC Signature CovP AccP 0x7ffecca 1100011001111100 1000001000011000 Page Buffer (PB) [1.23 KB] 1 BW>=75%? Predict AccP Vulnerable Fill No prediction Y N BW>=50%? MeasureCovP saturated? Predict CovP MeasureAccP saturated? Y 5 Pattern Update Program Access Program 1001011100010001 Signature Prediction Table (SPT) [2.375 KB] 2 OR AND CovP 4 3 AccP PC Signature CovP AccP 0x7ffecca 1101111101111101 1000001000010000 Pattern Predict Bandwidth utilization MeasureCovP : Incr. if CovP Coverage < 50% || CovP Accuracy < 50% MeasureAccP : Incr. if AccP Accuracy < 50% 5. DSPatch Results DSPatch: Single Core Performance DSPatch: Single Core DRAM B/W Scaling DSPatch: Multi-Core Performance 15.1% 2ch-2400 6% 10% 1ch-2133 7.4% 6%