Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.

Slides:



Advertisements
Similar presentations
Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh
Advertisements

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CSE502: Computer Architecture Prefetching. CSE502: Computer Architecture Prefetching (1/3) Fetch block ahead of demand Target compulsory, capacity, (&
CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache IV Steve Ko Computer Sciences and Engineering University at Buffalo.
ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.
1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
February 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
1 Lecture 25: Advanced Data Prefetching Techniques Prefetching and data prefetching overview, Stride prefetching, Markov prefetching, precomputation- based.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Lecture Topics: 11/24 Sharing Pages Demand Paging (and alternative) Page Replacement –optimal algorithm –implementable algorithms.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Cache Perf. CSE 471 Autumn 021 Tolerating/hiding Memory Latency One particular technique: prefetching Goal: bring data in cache just in time for its use.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
15-740/ Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011.
Value Prediction Kyaw Kyaw, Min Pan Final Project.
CS161 – Design and Architecture of Computer
CSE 502: Computer Architecture
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture: Cache Hierarchies
Improving cache performance of MPEG video codec
Lecture 14: Reducing Cache Misses
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Lecture: Cache Innovations, Virtual Memory
Lecture 10: Branch Prediction and Instruction Delivery
15-740/ Computer Architecture Lecture 27: Prefetching II
Siddhartha Chatterjee
15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up
15-740/ Computer Architecture Lecture 14: Prefetching
Lecture: Cache Hierarchies
CSC3050 – Computer Architecture
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Page Allocation and Replacement
Cache Performance Improvements
18-447: Computer Architecture Lecture 30A: Advanced Prefetching
Presentation transcript:

Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti

Prefetching Even “demand fetching” prefetches other words in block – Spatial locality Prefetching is useless – Unless a prefetch costs less than demand miss Ideally, prefetches should – Always get data before it is referenced – Never get data not used – Never prematurely replace data – Never interfere with other cache activity

Software Prefetching Use compiler to try to – Prefetch early – Prefetch accurately Prefetch into – Register (binding) Use normal loads? Stall-on-use (Alpha 21164) What about page faults? Exceptions? – Caches (non-binding) – preferred Needs ISA support

Software Prefetching For example: do j= 1, cols do ii = 1 to rows by BLOCK prefetch (&(x[i,j])+BLOCK) # prefetch one block ahead do i = ii to ii + BLOCK-1 sum = sum + x[i,j] How many blocks ahead should we prefetch? – Affects timeliness of prefetches – Must be scaled based on miss latency

Hardware Prefetching What to prefetch – One block spatially ahead – N blocks spatially ahead – Based on observed stride, track/prefetch multiple strides Training hardware prefetcher – On every reference (expensive) – On every miss (information loss) – Misses at what level of cache? – Prefetchers at every level of cache? Pressure for nonblocking miss support (MSHRs)

Prefetching for Pointer-based Data Structures What to prefetch – Next level of tree: n+1, n+2, n+? Entire tree? Or just one path – Next node in linked list: n+1, n+2, n+? – Jump-pointer prefetching How to prefetch – Software places jump pointers in data structure – Hardware scans blocks for pointers Content-driven data prefetching 0xafde0xfde0 0xde04

Stream or Prefetch Buffers Prefetching causes capacity and conflict misses (pollution) – Can displace useful blocks Aimed at compulsory and capacity misses Prefetch into buffers, NOT into cache – On miss start filling stream buffer with successive lines – Check both cache and stream buffer Hit in stream buffer => move line into cache (promote) Miss in both => clear and refill stream buffer Performance – Very effective for I-caches, less for D-caches – Multiple buffers to capture multiple streams (better for D-caches) Can use with any prefetching scheme to avoid pollution

Example: Global History Buffer K. Nesbit, J. Smith, “Prefetching using a global history buffer”, HPCA [slides from conference talk follow] Hardware prefetching scheme Monitors miss stream Learns correlations Issues prefetches for likely next address

© Nesbit, Smith9/19 Markov Prefetching Markov prefetching forms address correlations – Joseph and Grunwald (ISCA ‘97) Uses global memory addresses as states in the Markov graph Correlation Table approximates Markov graph B C B A B C Correlation Table 1st predict.2nd predict. miss address A B C A B C B C... AB C 1.5 Miss Address Stream 1.5 Markov Graph A

10/19 Correlation Prefetching Distance Prefetching forms delta correlations – Kandiraju and Sivasubramaniam (ISCA ‘02) Delta-based prefetching leads to much smaller table than “classical” Markov Prefetching Delta-based prefetching can remove compulsory misses Markov Prefetching Global Delta Stream Distance Prefetching Miss Address Stream global delta st predict.2nd predict. miss address 1st predict.2nd predict. © Nesbit, Smith

11/19 Global History Buffer (GHB) Holds miss address history in FIFO order Linked lists within GHB connect related addresses – Same static load – Same global miss address – Same global delta Global History Buffer miss addresses Index Table FI Load PC  Linked list walk is short compared with L2 miss latency FO © Nesbit, Smith

12/19 Miss Address Stream Global History Buffer miss addresspointer Index Table head pointer GHB - Example => Current => Prefetches Key Global Miss Address © Nesbit, Smith

13/19 GHB – Deltas Global Delta Stream Miss Address Stream => Current => Prefetches Key 844 WidthDepthHybrid Markov Graph => => 87 Prefetches => => 79 Prefetches => => 75 Prefetches © Nesbit, Smith

14/19 GHB – Hybrid Delta Width prefetching suffers from poor accuracy and short look-ahead Depth prefetching has good look-ahead, but may miss prefetch opportunities when a number of “next” addresses have similar probability The hybrid method combines depth and width © Nesbit, Smith

February / => => 75 Global History Buffer miss addresspointer Index Table head pointer GHB - Hybrid Example 1 => Current => Prefetches Key Global Delta Global Delta Stream Miss Address Stream => => 87 Prefetches

Summary Prefetching anticipates future memory references – Software prefetching – Next-block, stride prefetching – Global history buffer prefetching Issues/challenges – Accuracy – Timeliness – Overhead (bandwidth) – Conflicts (displace useful data)