1 Lecture 25: Advanced Data Prefetching Techniques Prefetching and data prefetching overview, Stride prefetching, Markov prefetching, precomputation- based.

Slides:

Advertisements

Similar presentations

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Advertisements

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Fetch Directed Prefetching - a Study

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

1 Lecture 5a: CPU architecture 101 boris.

Lecture: Large Caches, Virtual Memory

CSC 4250 Computer Architectures

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Lecture: Cache Hierarchies

Levels of Parallelism within a Single Processor

Lecture 14: Reducing Cache Misses

Chapter 5 Memory CSE 820.

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

ECE Dept., University of Toronto

Ka-Ming Keung Swamy D Ponpandi

Lecture: Cache Innovations, Virtual Memory

Advanced Computer Architecture

Lecture 20: OOO, Memory Hierarchy

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Lecture 20: OOO, Memory Hierarchy

Siddhartha Chatterjee

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

15-740/ Computer Architecture Lecture 14: Prefetching

Lecture: Cache Hierarchies

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Cache Performance Improvements

Conceptual execution on a processor which exploits ILP

18-447: Computer Architecture Lecture 30A: Advanced Prefetching

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

1 Lecture 25: Advanced Data Prefetching Techniques Prefetching and data prefetching overview, Stride prefetching, Markov prefetching, precomputation- based prefetching Zhao Zhang, CPRE 585 Fall 2003

2 Memory Wall Consider memory latency of 1000 processor cycles or a few thousands of instructions …

3 Where Are Solutions? 3.Reducing miss penalty or miss rates via parallelism Non-blocking caches Hardware prefetching Compiler prefetching 4.Reducing cache hit time Small and simple caches Avoiding address translation Pipelined cache access Trace caches 1. Reducing miss rates Larger block size larger cache size higher associativity victim caches way prediction and Pseudoassociativity compiler optimization 2. Reducing miss penalty Multilevel caches critical word first read miss first merging write buffers

4 Where Are Solutions? Consider an 4-way issue OOO processor 20-entry issue queue 80-entry ROB 100ns main memory access latency In how many cycles the processor will stall on cache miss to main memory? OOO processors may tolerate L2 latency but not main memory latency Increase cache size? More levels of memory hierarchy? Itanium: 2-4MB L3 cache IBM Power4: 32MB eDRAM cache Large caches are still very useful but may not help fully address the issue

5 Prefetching Evaluation Prefetch: Predict future accesses and fetch data before they are demanded Accuracy: How many prefetched items are really needed? False prefetching: fetched wrong data Cache pollution: replace “good” data with “bad” data Coverage: How many cache misses are removed? Timeliness: Does the data return before they are demanded? Other considerations: complexity and cost

6 Prefetching Targets Instruction prefetching  Stream buffer is very useful Data prefetching More complicated because of the diversities in data access pattern Prefetching for dynamic data (hashing, heap, sparse array, etc.) Usually with irregular access patterns Linked-list prefetching (Pointer chasing) A special type of data prefetching for data in linked-list

7 Prefetching Implementations Sequential and stride prefetching Tagged prefetching Simple stream buffer Stride prefetching Correlation-based prefetching Markov prefetching Dead-block correlating prefetching Precomputation-based Keep running programs on cache misses; or Use separate hardware for prefetching; or Use compiler-generated threads on multithreaded processors Other considerations Predict on miss addresses or reference address? Prefetch into cache or a temp. buffer? Demand-based or decoupled prefetching?

8 Recall Stream Buffer Diagram DataTags Direct mapped cache one cache block of data Stream buffer tag and comp tag head +1 a a a a from processorto processor tail Shown with a single stream buffer (way); multiple ways and filter may be used next level of cache Source: Jouppi ICS’90

9 Stride Prefetching Limits of streaming buffer: Program may access data in either direction; i.e. how about for (i = N-1; i >= 0; i --) … Data may be accessed in strides, i.e. for (i = 0; i < N; i ++) for (j = 0; j < N; j ++) sum[i] += X[i][j];

10 Stride Prefetching Diagram PCEffective address Inst tagPrevious addressstridestate - + Prefetch address Reference prediction table

11 Stride Prefetching Example float a[100][100], b[100][100], c[100][100];... for ( i = 0; i < 100; i++) for ( j = 0; j < 100; j++) for ( k = 0; k < 100; k++) a[i][j] += b[i][k] * c[k][j]; tagaddrstridestate Load b200044trans. Load c trans. Load a300000steady tagaddrstridestate Load b200084steady Load c steady Load a300000steady tagaddrstridestate Load b200000init Load c300000init Load a300000init Iteration 1 Iteration 2 Iteration 3

12 Markov Prefetching Miss addresses: A B C D C E A C F F E A A B C D E A B C D C Target irregular mem access pattern miss 1pred 1pred 2pred 3pred 4 …………… miss Npred 1pred 2pred 3pred 4 miss addr Predicted addresses Prefetch queue Markov model Joseph and Grunwald, ISCA 1997

13 Markov Prefetching Performance From left to right: number of addresses in table

14 Markov Prefetching Performance From left to right: stream, stride, correlation (Pomerene and Puzak), Markov, stream+stride+Markov serial, stream+stride+Markov parallel

15 Predictor-directed Stream Buffer Cons of existing approaches: Stride prefetching (using updated stream buffer): Only useful for strid access; being interfered by non-stride accesses Markov prefetching: Working for general access patterns but requiring large history storage (megabytes) PSB: Combining the two methods To improve coverage of stream buffer; and Keep the required storage low (several kilobytes) Sair et al., MICRO 2000

16 Predictor-directed Stream Buffer Markov prediction table filters out irregular address transitions (reduce stream buffer thrashing) Stream buffer filters out addresses used to train Markov prediction table (reduce storage) Which prefetching is used on each address?

17 Precomputation-based Prefetching Potential problems of stream buffer or Markov prefetching: Low accuracy => high memory bandwidth waste Another approach: use some computation resource for prefetching, because computation is increasingly cheaper Speculative execution for prefetching No architectural changes Not limited by hardware With high accuracy and good coverage for (i=0; i val[j]++; Loop: I1 load r1=[r2] I2 add r3=r3+1 I3 add r6=r3-100 I4 add r2=r2+8 I5 add r1=r4+r1 I6 load r5=[r1] I7 add r5=r5+1 I8 store [r1]=r5 I9 blt r6, lop Collins et al, MICRO 2001

18 Prefetching by Dynamically Building Data-dependence Graph Annavaram et al., “Data prefetching by dependence graph precomputation”, ISCA 2001 Needs external help to identify problematic loads Builds dependence graph in reverse order Uses separate prefetching engine IFPRE-DEDECODEEXEWBCOMMIT INSTOP1OP2IT DG generator DG Buffer EXE Engine prefetching Updated inst fetch queue

19 Using “Future” Threads for Prefetching Balasubramonian et al. “Dynamically allocating processor resources between nearby and distant ILP,” ISCA 2001 OOO processors stall on cache misses to DRAM because of exhausting some resources (IQ or ROB or registers) Why not keep the program run during the stall time for prefetching? Then, must reserve resources for “future” thread Future thread continues the execution for prefetching Using the existing OOO pipeline and FUs for execution May release registers or ROB speculatively, thus can examine a much larger instruction window Still accurate in producing reference addresses

20 Precomputation with SMT Supporting Speculative Threads Collins et al. “Speculative precomputation: long-range prefetching of delinquent loads,” ISCA Precomputation is done by an explicit speculative thread (p-thread) The code of p-threads may be constructed by compiler or hardware Main thread execution spawns p-threads on triggers (e.g. when an PC is encountered) Main thread some register values and initial PC for p-thread P-thread may trigger another p-thread for further prefetching For more complier issues, see Luk, “Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors”, ISCA 2001

21 Summary of Advanced Prefetching Being actively studied because of the increasing CPU-memory speed gap Improving cache performance beyond the limit of cache size Precomputation may be limited in prefetching distance (how good is the timeliness?) Note there is no perfect cache/prefetching solution, e.g. while (1) { myload (addr); addr = myrandom() + addr; } How to design complexity-effective memory systems for future processors?