DSPatch: Dual Spatial pattern prefetcher

Slides:

Advertisements

Similar presentations

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Prefetch-Aware Cache Management for High Performance Caching

Improving Cache Performance by Exploiting Read-Write Disparity

Memory System Characterization of Big Data Workloads

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

TOWARDS BANDWIDTH- EFFICIENT PREFETCHING WITH SLIM AMPM June 13 th 2015 DPC-2 Workshop, ISCA-42 Portland, OR Vinson Young Ajit Krisshna.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

Performance Tradeoffs in Read-Optimized Databases Stavros Harizopoulos * MIT CSAIL joint work with: Velen Liang, Daniel Abadi, and Sam Madden massachusetts.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†

Srihari Makineni & Ravi Iyer Communications Technology Lab

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

IMP: Indirect Memory Prefetcher

Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.

Sunpyo Hong, Hyesoon Kim

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu,

A Case for Toggle-Aware Compression for GPU Systems

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

CS 704 Advanced Computer Architecture

(Find all PTEs that map to a given PPN)

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Prefetch-Aware Cache Management for High Performance Caching

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Energy-Efficient Address Translation

Temporal Streaming of Shared Memory

Application Slowdown Model

Rachata Ausavarungnirun

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

CARP: Compression Aware Replacement Policies

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Using Dead Blocks as a Virtual Victim Cache

A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.

Address-Value Delta (AVD) Prediction

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

CARP: Compression-Aware Replacement Policies

Lecture 20: OOO, Memory Hierarchy

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Many-Core Graph Workload Analysis

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Eshan Bhatia1, Gino Chacon1, Elvira Teran2, Paul V. Gratz1, Daniel A

Rajeev Balasubramonian

Multi-Lookahead Offset Prefetching

DSPatch: Dual Spatial Pattern Prefetcher

Presentation transcript:

DSPatch: Dual Spatial pattern prefetcher Rahul Bera1, Anant V. Nori1, Onur Mutlu2, Sreenivas Subramoney1 1Processor Architecture Research Lab, Intel Labs 2ETH Zürich

Summary Motivation: Challenge: DRAM bandwidth increases with every generation Prefetchers need to adapt to effectively utilize this resource Challenge: Significantly boost Coverage AND Simultaneously optimize for Accuracy Dual Spatial Pattern Prefetcher (DSPatch) Simultaneously learn two spatial bit-pattern representations of program accesses per page Coverage-biased (CovP) Accuracy-biased (AccP) Predict one of the patterns based on current DRAM bandwidth headroom CovP AccP BW utilization 6% average speedup over baseline with PC-stride @ L1 and SPP @ L2 10% average speedup when DRAM bandwidth is doubled Only 3.6 KB of hardware storage

Outline 1. Motivation and Observations 2. Dual Spatial Pattern Prefetcher (DSPatch) 3. Evaluation 4. Conclusion

Motivation: Current prefetchers’ performance scales poorly with increasing DRAM bandwidth Traditional prefetcher design tradeoff between Coverage and Accuracy Limits ability to significantly boost Coverage when needed

Observation 1 Anchored spatial bit-patterns well suited to capture spatial access patterns Cacheline offset seq. within a page A: [02]-[06]-[05]-[12]-[13] B: [01]-[05]-[04]-[11]-[12] C: [01]-[05]-[11]-[04]-[12] D: [01]-[11]-[05]-[04]-[12] E: [01]-[11]-[04]-[05]-[12] F: [01]-[04]-[11]-[05]-[12] G: [01]-[04]-[05]-[11]-[12] +4, -1,+7,+1 +4,+6,-7,+8 +10,-6,-1,+8 +10,-7,+1,+7 +3,+7,-6,+7 +3,+1,+6,+7 Cacheline delta seq. within a page Spatial bit-pattern 0010 0110 0000 1100 0100 1100 0001 1000 1001 1000 0011 0000 Anchored spatial bit-pattern “_” marks the first (triggering) access to the page Spatial bit-patterns subsume any small temporal variability in accesses Anchored spatial bit-patterns capture all “global” deltas from the triggering access

Observation 2 Anchored spatial bit-patterns allow modulation for Coverage versus Accuracy Cacheline delta seq. within a page Anchored spatial bit-patterns +4,-1,+7,+1 +4,+6,-7,+8 … +10,-6,-1,+8 +5,-1,+6,+1 +11,-1,-6,+1 +3,+2,+4,+2 +9,+1,-5,-2 1001 1000 0011 0000 1000 1100 0011 0000 1001 0100 0110 0000 Modulated anchored spatial bit-patterns 1001 1100 0011 1000 Coverage biased : 100% Coverage OR 1000 1000 0001 0000 Accuracy biased : 100% accuracy AND Bit-wise OR modulation ⇒ Biased towards Coverage Bit-wise AND modulation ⇒ Biased towards Accuracy

Outline 1. Motivation and Observations 2. Dual Spatial Pattern Prefetcher (DSPatch) 3. Evaluation 4. Conclusion

DSPatch: Key Idea Coverage-Biased (CovP) Accuracy-Biased (AccP) Simultaneously learn two modulated spatial bit- pattern representations of program accesses Coverage-Biased (CovP) Accuracy-Biased (AccP) Predict one of the patterns based on current DRAM bandwidth headroom CovP AccP BW utilization A C Optimize simultaneously for both: Significantly higher Coverage High Accuracy

DSPatch: Overall Design Page Buffer (PB) Page Trigger PC Bit-pattern 0x65 0x7ffecca 0010 1100 0001 0010 0x66 0x7ff8980 0001 0010 0011 0001 1 Program Access . . . 5 Signature Prediction Table (SPT) 2 CovP 4 PC Signature CovP AccP 0x7ffecca 1011 0011 1110 0011 0001 0000 0011 0010 3 AccP . . . BW utilization

DSPatch: Quantify Coverage, Accuracy and Bandwidth Quantifying Coverage and Accuracy Identify good bits in SPT pattern(s) PopCount compare to program pattern Quantize to quartile buckets (25%,50,75%) Page Buffer (PB) 1 Program Access 5 Signature Prediction Table (SPT) Pattern Value PopCount From Program 1011 0100 0011 1100 8 From SPT 1010 0110 0000 0001 5 Bitwise-AND 1010 0100 0000 0000 3 Coverage [25%-50%] 2 Accuracy [50%-75%] CovP 4 3 AccP Quantifying DRAM bandwidth utilization Count DRAM accesses in 4 x tRC window Compare to peak possible access count Quantize to quartile buckets (25%,50,75%) BW utilization

DSPatch: Pattern Update PC Signature CovP AccP 0x7ffecca 1100 0110 0111 1100 1000 0010 0001 1000 5 Program 1001 0111 0001 0001 OR AND ORcount Limit number of modulation operations MeasureCovP Incr if CovP coverage < 50% || CovP accuracy < 50% MeasureAccP Incr if AccP accuracy < 50% PC Signature CovP AccP 0x7ffecca 1101 1111 0111 1101 1000 0010 0001 0000 2b 2b 2b

DSPatch: Pattern Predict BW utilization >=75%? MeasureAccP saturated? Y Y No prefetch N N BW utilization >=50%? MeasureCovP saturated? Prefetch using AccP Vulnerable Fill Y Y PC Signature CovP AccP 0x7ffecca 1101111101111101 1000001000010000 N N Prefetch using CovP 2b 2b 2b BW Utilization < 50% Y

Outline 1. Motivation and Observations 2. Dual Spatial Pattern Prefetcher (DSPatch) 3. Evaluation 4. Conclusion

Evaluation Methodology Configuration 4 x86 cores @ 4 GHz, 4 wide OoO, 224 deep ROB entries 32KB, 8 way Data and Code L1 caches PC-stride prefetcher into Data L1 256 KB L2 Cache BOP, SPP, SMS, DSPatch evaluated @ L2 2 MB LLC/Core Memory Configuration Single Core evaluations: 1Ch-DDR4-2133MHz Multi Core evaluations: 2Ch-DDR4-2133MHz Workloads Single-Core: 75 workloads spanning Cloud, Client, Server, HPC, Sysmark, SPEC06 and SPEC17 Multi-Core: 42 high-MPKI homogeneous mixes, 75 heterogeneous mixes

DSPatch: Single-Core Performance 6% Standalone DSPatch outperforms SPP and SMS by 1% and 3% respectively DSPatch+SPP outperforms SPP by 6% on average

DSPatch: Performance Scaling With Bandwidth 10% 1ch-1600 6% 2ch-2133 2ch-2400 2ch-1600 DSPatch performance scales well with increase in memory bandwidth

DSPatch: Multi-Core Performance 15.1% 7.9% 5.9% 7.4% DSPatch performs well in bandwidth-constrained multi-core scenarios

DSPatch: Hardware Storage Structure Size Per Entry (in bytes) No. of Entries Size (in KB) PB 19.75 B 64 1.23 KB SPT 9.5 B 256 2.375 KB Total 3.6 KB DSPatch is an order of magnitude smaller than SMS and takes only 2/3rd storage of SPP

More In Paper Design optimizations Results Bit-pattern compression (Reduce storage) Multiple triggers per 4KB page (Increase coverage) Results Breakdown of coverage and accuracy of DSPatch Performance comparison of ISO-sized adjunct prefetchers Contribution of Coverage vs. Accuracy biased patterns

Outline 1. Motivation and Observations 2. Dual Spatial Pattern Prefetcher (DSPatch) 3. Evaluation 4. Conclusion

Summary Motivation: Challenge: DRAM bandwidth increases with every generation Prefetchers need to adapt to effectively utilize this resource Challenge: Significantly boost Coverage AND Simultaneously optimize for Accuracy Dual Spatial Pattern Prefetcher (DSPatch) Simultaneously learn two spatial bit-pattern representations of program accesses per page Coverage-biased (CovP) Accuracy-biased (AccP) Predict one of the patterns based on current DRAM bandwidth headroom CovP AccP BW utilization 6% average speedup over baseline with PC-stride @ L1 and SPP @ L2 10% average speedup when DRAM bandwidth is doubled Only 3.6 KB of hardware storage

DSPatch: Dual Spatial pattern prefetcher Rahul Bera1, Anant V. Nori1, Onur Mutlu2, Sreenivas Subramoney1 1Processor Architecture Research Lab, Intel Labs 2ETH Zürich

BACKUP

Observation: Pollution

Coverage and Accuracy Covered Uncovered Mispredicted For DSPatch, every 2% increase in coverage comes at a cost of only 1% increase in misprediction