Prefetching On-time and When it Works Sequential Prefetcher With Adaptive Distance (SPAD) Ibrahim Burak Karsli Mustafa Cavus

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

School of Electrical Engineering and Computer Science University of Central Florida Combining Local and Global History for High Performance Data Prefetching.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Access Map Pattern Matching Prefetch: Optimization Friendly Method

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Prefetch-Aware Cache Management for High Performance Caching

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

SCHOOL OF COMPUTING SCIENCE SIMON FRASER UNIVERSITY CMPT 820 : Error Mitigation Schaar and Chou, Multimedia over IP and Wireless Networks: Compression,

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Reducing the Energy Usage of Office Applications Jason Flinn M. Satyanarayanan Carnegie Mellon University Eyal de Lara Dan S. Wallach Willy Zwaenepoel.

Combining Branch Predictors

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

ECE/CSC Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Yi Wang, Bhaskar Krishnamachari, Qing Zhao, and Murali Annavaram 1 The Tradeoff between Energy Efficiency and User State Estimation Accuracy in Mobile.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Exploiting the cache capacity of a single-chip multicore processor with execution migration Pierre Michaud February 2004.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Analysis of Branch Predictors

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

1 Sampling-based Program Locality Approximation Yutao Zhong, Wentao Chang Department of Computer Science George Mason University June 8th,2008.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.

Sampling Dead Block Prediction for Last-Level Caches

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Fetch Directed Prefetching - a Study

T-BAG: Bootstrap Aggregating the TAGE Predictor Ibrahim Burak Karsli, Resit Sendag University of Rhode Island.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

MICRO-48, 2015 Computer System Lab, Kim Jeong Won.

Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,

Outline Motivation Project Goals Methodology Preliminary Results

Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.

2nd Data Prefetching Championship Results and Awards

Prefetch-Aware Cache Management for High Performance Caching

Chapter 5 Memory CSE 820.

Address-Value Delta (AVD) Prediction

A Practical Stride Prefetching Implementation in Global Optimizer

CARP: Compression-Aware Replacement Policies

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

The Vision of Self-Aware Performance Models

Patrick Akl and Andreas Moshovos AENAO Research Group

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Multi-Lookahead Offset Prefetching

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Prefetching On-time and When it Works Sequential Prefetcher With Adaptive Distance (SPAD) Ibrahim Burak Karsli Mustafa Cavus Resit Sendag Department of Electrical, Computer, and Biomedical Engineering University of Rhode Island

Outline  Motivation  Sequential Prefetcher with Adaptive Distance (SPAD)  Hardware Budget  Results

Motivation  Next-line prefetcher (offset: +1) is simple and performs quite well (score ~4.439). But  Opportunity loss due to no feedback mechanism Timeliness: Late prefetches most important problem Accuracy: No on/off mechanism No adaptivity to program behavior changes  Basic idea: Add adaptive distance to next-line prefetcher  Start with +1, increment/decrement distance based on feedback

Motivation Sequential Prefetcher Performance with FIXED distance (offset) Distance 1 (next-line) score : Distance 3 (best) score: 4.484

Terminology  Interval: A period of 512 L2 demand accesses  L2miss: Number of L2 misses in an interval  Testing Queue (TQ): FIFO Queue Every predicted address is inserted into TQ Also acts as a prefetch filter tqhits: Number of L2 demand accesses found in TQ in an interval tqmhits: Number of L2 demand access misses found in TQ in an interval

SPAD Prefetcher Components

SPAD Decision Engine: Distance Update Mechanism

SPAD Adaptiveness BD:3 BD:4 BD:6 BD:1 BD:5 BD:1 Comparing the results of SPAD with the results of fixed distance sequential prefetcher using best distances (BD).

SPAD Hardware & Performance PrefetcherScore Sequential Sequential +3 (Best performing offset) Ampm lite4.511 Sandbox (+/- 16) 32 offsets SPAD4.584  SPAD Hardware Budget Test Queue:4103 bits Registers&Counters: 160 bits Total: 4263 bits SPAD Performance

IP-Stride and SPAD  The score of SPAD is significantly better than the score of ip stride prefetcher.  However, ip stride works significantly better than SPAD for some benchmarks, such as bzip2 and soplex.  Integrating SPAD with ip stride improves SPAD performance by 5.5%.

Submission Hardware Budget  SPAD (4263 bits)  Test Queue (4103 bits)  Registers&Counters (160 bits)  Ip Stride (67584 bits)  Global Prefetch Queue (4103 bits)  Total (75950 bits)

Benchmarks  40 benchmarks from SPEC CPU2000, SPEC CPU2006 and Olden benchmark suites.  We used Simpoint 2.0 to generate representative 100M-instruction traces.  10m instructions for warmup  90m instructions for simulation

Results

PrefetcherScore Sequential Sequential Ampm lite4.511 Sandbox4.578 Ip stride4.300 SPAD4.584 SPAD & IP Stride (Combined)4.616

Conclusion  Adaptive distance in sequential prefetchers have significant benefits.  Our submitted version is not optimized. It can be significantly improved as we observed in our later tests.  Combining SPAD with ip stride prefetcher boosts the performance.

Questions? Thank You