Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

Slides:

Advertisements

Similar presentations

A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.

Advertisements

T OR A AMODT Andreas Moshovos Paul Chow Electrical and Computer Engineering University of Toronto Canada The Predictability of.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

School of Electrical Engineering and Computer Science University of Central Florida Combining Local and Global History for High Performance Data Prefetching.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Access Map Pattern Matching Prefetch: Optimization Friendly Method

Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer.

CSE502: Computer Architecture Prefetching. CSE502: Computer Architecture Prefetching (1/3) Fetch block ahead of demand Target compulsory, capacity, (&

CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Data Prefetching Mechanism by Exploiting Global and Local Access Patterns Ahmad SharifQualcomm Hsien-Hsin S. LeeGeorgia Tech The 1 st JILP Data Prefetching.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang Steve Carr Soner Önder Zhenlin Wang.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Fetch Directed Prefetching - a Study

1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.

CSE 502: Computer Architecture

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Lecture: Cache Hierarchies

Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.

Lecture: Cache Hierarchies

Improving cache performance of MPEG video codec

Milad Hashemi, Onur Mutlu, Yale N. Patt

Using Dead Blocks as a Virtual Victim Cache

Lecture 14: Reducing Cache Misses

Address-Value Delta (AVD) Prediction

Phase Capture and Prediction with Applications

Lecture 10: Branch Prediction and Instruction Delivery

15-740/ Computer Architecture Lecture 27: Prefetching II

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Multi-Lookahead Offset Prefetching

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith

February 20042/19 Outline  Motivation  Related Work  Global History Buffer Prefetching  Results  Conclusion

February 20043/19 Motivation  D-Cache misses to main memory are of increasing importance –Main memory is getting farther away (in clock cycles) –Many demanding, memory intensive workloads  Computation is inexpensive compared to data accesses –Good opportunity to reevaluate prefetching data structures –Simple computation can supplement table information  We consider prefetches from main memory to lowest level cache (L2 cache in this study)

February 20044/19 Markov Prefetching  Markov prefetching forms address correlations –Joseph and Grunwald (ISCA ‘97)  Uses global memory addresses as states in the Markov graph  Correlation Table approximates Markov graph B C B A B C Correlation Table 1st predict.2nd predict. miss address A B C A B C B C... AB C 1.5 Miss Address Stream 1.5 Markov Graph A

February 20045/19 Correlation Prefetching  Distance Prefetching forms delta correlations –Kandiraju and Sivasubramaniam (ISCA ‘02)  Delta-based prefetching leads to much smaller table than “classical” Markov Prefetching  Delta-based prefetching can remove compulsory misses Markov Prefetching Global Delta Stream Distance Prefetching Miss Address Stream global delta st predict.2nd predict. miss address 1st predict.2nd predict.

February 20046/19 Global History Buffer (GHB)  Holds miss address history in FIFO order  Linked lists within GHB connect related addresses –Same static load –Same global miss address –Same global delta Global History Buffer miss addresses Index Table FI Load PC  Linked list walk is short compared with L2 miss latency FO

February 20047/19 Miss Address Stream Global History Buffer miss addresspointer Index Table head pointer GHB - Example => Current => Prefetches Key Global Miss Address

February 20048/19 GHB – Deltas Global Delta Stream Miss Address Stream => Current => Prefetches Key 844 WidthDepthHybrid Markov Graph => => 87 Prefetches => => 79 Prefetches => => 75 Prefetches

February 20049/19 GHB – Hybrid Delta  Width prefetching suffers from poor accuracy and short lookahead  Depth prefetching has good look-ahead, but may miss prefetch opportunities when a number of “next” addresses have similar probability  The hybrid method combines depth and width

February / => => 75 Global History Buffer miss addresspointer Index Table head pointer GHB - Hybrid Example 1 => Current => Prefetches Key Global Delta Global Delta Stream Miss Address Stream => => 87 Prefetches

February /19 Simulation Methodology  Simulated SPEC CPU2000 benchmarks  Fast forwarded 1 billion instructions and simulated 1 billion instructions  Used peak binaries compiled -O4 optimization  Results include all benchmarks that have at least a 5% IPC improvement with an ideal L2 cache Issue Width4 Instructions Load Store Queue 64 Entries RUU Size128 Entries Level 1 D-Cache16 KB, 2-way Level 1 I-Cache16 KB, 2-way Level 2 Cache512 KB, 4-way Memory Latency140 Cycles

February /19 Simulation Methodology  Table walk - one cycle per access  IT size reduces table conflicts  GHB size reflects prefetch history working set  In general, the GHB prefetching requires less history Prefetching MethodTable ConfigurationSize Conventional Distance Prefetching512 Table Entries18 KB GHB Distance Prefetching512 IT Entries & 512 GHB Entries 8 KB

February /19 Results  Our results compare: –IPC Improvement (harmonic mean) vs. Prefetch Degree –Increase in Memory Traffic per instruction (arithmetic mean) vs. Prefetch Degree –Prefetch Accuracy – The percent of prefetches that are used by the program

February /19 Distance Prefetching (Performance) 5% 15% 25% 35% Prefetch Degree Table (width) GHB (width) GHB (depth) GHB (hybrid) IPC Improvement

February /19 Distance Prefetching (Performance) -10% 10% 30% 50% 70% 90% 110% ammp art wupwise swim lucas mgrid applu galgel apsi mcf twolf vpr parser gap bzip2 hmean Table (width) GHB (width) GHB (depth) GHB (hybrid) IPC Improvement (~300%)

February /19 Distance Prefetching (Memory Traffic) 0% 30% 60% 90% 120% 150% 180% Prefetch Degree Table (width) GHB (width) GHB (depth) GHB (hybrid) Increase in Memory Traffic

February /19

February /19 Distance Prefetching (Memory Traffic) 0% 30% 60% 90% 120% 150% 180% Prefetch Degree Table (width) GHB (width) GHB (depth) GHB (hybrid) Increase in Memory Traffic

February /19 Conclusions  More complete picture of history –Allows width, depth, and hybrid –Also can improve other prefetching methods (covered in depth in the paper)  Eliminates stale history in a natural way –FIFO discards old history to make room for new history –In a conventional table, old history can remain for a very long time and trigger inaccurate prefetches

February /19 Acknowledgements  This research was funded by: –An Intel Undergraduate Research scholarship. –A University of Wisconsin Hilldale Undergraduate Research fellowship. –The National Science Foundation under grants CCR and EIA

February /19 Backup Slides

February /19 Prefetching Metrics  Accuracy is the percent of prefetches that are actually used.  Coverage is the percent of memory references prefetched rather than demand fetched.  Timeliness indicates if prefetched data arrives early enough to prevent the processor from stalling.

February /19 GHB – Deltas Global Delta Stream Miss Address Stream => Current => Prefetches Key 844 Markov Graph

February /19 Prefetch Taxonomy  To simplify the discussion and illustrate the relation between prefetching methods we introduce a consistent naming convention.  Each name is a X/Y pair. –X is the key used for localizing the address stream. –Y is the method for detecting address patterns.

February /19 Prefetch Taxonomy  We study two localizing methods –No localization or global (G) –Program Counter (PC)  And three pattern detection methods –Address Correlation –Delta Correlation –Constant Stride

February /19 Prefetch Taxonomy  Markov Prefetching - G/AC  Distance Prefetching - G/DC  Stride Prefetching - PC/CS

February /19 Stride Prefetching  Table tracks the local history of loads.  If a constant stride is detected in a load’s local history, then n + s, n + 2s, …, n + ds are prefetched. – n is the current target address – s is the detected stride – d is the prefetch degree or aggressiveness of the prefetching.

February /19 Stride Prefetching TagLast AddressStrideState Reference Prediction Table PC of Load Target Address sub add Prefetch Address

February /19 GHB – Stride Prefetching  GHB-Stride uses the PC to access the index table.  The linked lists contain the local history of each load.  Compare the last two local strides. If the same then prefetch n + s, n + 2s, …, n + ds. Global History Buffer miss addresspointer Index Table head pointer A B C A B C B 1 C 1 PC =?

February /19 GHB – Local Delta Correlation  Form delta correlations within each load’s local history.  For example, consider the local miss address stream: Addresses Deltas CorrelationPrefetch Predictions (1,1)6211 (1,62)1162 (62, 1)1621

February /19

February /19

February /19