Instruction Prefetching Smruti R. Sarangi. Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return.

Slides:



Advertisements
Similar presentations
Dynamic Branch Prediction
Advertisements

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.
1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Computer Organization and Architecture The CPU Structure.
Chapter 12 Pipelining Strategies Performance Hazards.
® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)
Dynamic Branch Prediction
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
CH12 CPU Structure and Function
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
CMPE 421 Parallel Computer Architecture
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Fetch Directed Prefetching - a Study
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Computer Architecture Chapter (14): Processor Structure and Function
Prof. Hsien-Hsin Sean Lee
Data Prefetching Smruti R. Sarangi.
CS203 – Advanced Computer Architecture
William Stallings Computer Organization and Architecture 8th Edition
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
CMSC 611: Advanced Computer Architecture
Smruti R. Sarangi IIT Delhi
Ka-Ming Keung Swamy D Ponpandi
Dynamic Branch Prediction
Lecture 10: Branch Prediction and Instruction Delivery
Data Prefetching Smruti R. Sarangi.
Computer Architecture
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Chapter 11 Processor Structure and function
10/18: Lecture Topics Using spatial locality
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Instruction Prefetching Smruti R. Sarangi

Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return Address Stack Directed Prefetching  Pentium 4 Trace Cache 2

Journey till now... Processor Optimized I-CacheD-Cache L2 Cache Main Memory 2-4 cycles cycles cycles Rest of the System

Misses in the Caches  We can incur a large performance penalty if there is a miss in the i- cache  For the next cycles, there will be no instructions to fetch if there is an L2 hit  If there is a miss in the L2, we have nothing to do for hundreds of cycles  The pipeline will have bubbles (idle cycles), IPC will suffer  What is the solution?  Prefetch the memory addresses  Meaning: Predict memory addresses that will be accessed in the future. Fetch them from the lower levels of the memory hierarchy before they are required.

Patterns in Programs  Temporal Locality  Tend to use the same data (cache lines) repeatedly  Spatial Locality  High probability of accessing data in the same memory region (similar range of addresses) once again

Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return Address Stack Directed Prefetching  Pentium 4 Trace Cache 6

Instruction Prefetching 1 Find patterns in the i-cache access sequence 2 Leverage this pattern for prefetching

Next line prefetching  Pattern: Spatial locality  If cache line X is accessed, there is a high probability of accessing lines: X+1 and X+2  Reason: The code of most functions spans multiple i-cache lines  Leverage this pattern: If a cache line X incurs an i-cache miss, read X and the next 2 lines from the L2 cache

Markov prefetching  Pattern: i-cache miss sequences have high repeatability  The miss sequence of a core looks like this:  High correlation between consecutive misses  Leverage this pattern for prefetching XY….XY…XY…XY… Miss sequence:

Markov prefetching  Record the i-cache miss sequence in a table  Given that cache line X incurs a miss, predict the line which will incur a miss next Cache line Option 1Option 2 AddressFrequencyAddressFrequency XY3Z1 XY….XY…XZ…XY… Miss sequence: higher probability Markov table:

Markov Predictors - II  We can instead have an n-history predictor that takes the last n misses into account  Whenever there is a miss, access a prefetch table. Find the next few addresses to prefetch. Issue prefetch instructions.  Also, update the frequencies of the entries for the last n-1 misses.  All prefetch requests go to a prefetch queue. These requests are subsequently sent to the L2 cache (lower priority).

Call Graph Prefetching  Pattern: The function access sequence has high repeatability  Leverage this pattern: predict and prefetch the function that may be called next

Call graph prefetching prepare_page( ) update_page( ) unlock_page( ) … if NOT IN MEMORY { read_from_disk( ) } lock_page( ) create_record prepare_page read_from_disk lock_page update_page unlock_page

Call graph prefetching create_record prepare_page read_from_disk lock_page update_page unlock_page

Call graph prefetching prefetch prepare_page …… prepare_page( ) prefetch update_page update_page( ) prefetch unlock_page unlock_page( ) create_record prepare_page( ) update_page( ) unlock_page( ) create_record becomes

Call Graph History Cache  Each entry in the data array contains a list of functions to be subsequently executed.  The index maintains a pointer to a function in the data array entry. Tag ArrayData Array Function idIndex func 1func 2func 3func 4func 5

Operation (Function Call) Function Call. X calls Y Access the entry of Y in the CGHC Does it exist? NO Create entry Prefetch the first function in Y’s call sequence YES Put Y in X’s call sequence at the appropriate index

Operation (Function return) Y returns to XReset the index of Y to 1Access X’s call sequence Increment the index Prefetch the next function in the call sequence

Patterns: TechniquePattern Next line prefetchingSpatial locality Markov prefetchingi-cache miss sequence has high repeatability Call graph prefetchingFunction access sequence has high repeatability

Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return Address Stack Directed Prefetching  Pentium 4 Trace Cache 20

Proactive Instruction Fetch  Insight:  Instructions on the wrong path lead to spurious prefetches  Corrupt the state of the prefetcher  Divide addresses into spatial regions  Each region has a trigger address (first instruction in the region) Commit Stage Compactor

 Operation  Compares the block address of the retiring instruction with the block address with the previously retired instruction. If same, discard.  The spatial compactor creates a region of blocks. A region has N blocks preceding the trigger and M blocks after the trigger.  The trigger is the first accessed block in a spatial region.  For each region we maintain a bit vector (1 bit for each block). When that block is accessed, we set the bit.  If the program access a block in another spatial region. Send the trigger PC, bit vector to the temporal compactor.  Example: Bit Vector A-2 A A+2 Access Block A-2 Access Block A Access Block A

Temporal Compactor  Maintains a list of spatial regions (let’s say last 4)  Codes typically have a lot of temporal locality  If the given record matches any one of the stored records, discard this record  AND promote this record (most recently used)  otherwise, discard the least recently used record  create a new record, and also send to the history buffer

History Buffer  Circular buffer  Stores retired instructions  Each entry stores  trigger PC, bit vector of the spatial region  Index table saves a mapping  Trigger PC  most recent record in the history buffer  Operation:  Two kinds of instruction fetches  Instructions that were prefetched, and instructions that were not prefetched  When a core fetches an instruction that was not prefetched  Trigger the prefetching mechanism  Search for the PC in the index table  If we find an entry prefetch the instruction stream (proactive)

Stream Address Buffer  The stream address buffer maintains a list of all the blocks that need to be prefetched  This information is calculated based on the information saved in the history buffer PC History Buffer Stream Address Buffer L1 Cache

Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return Address Stack Directed Prefetching  Pentium 4 Trace Cache 26

Motivation  I-cache misses are a crucial determinant of performance  Particularly in server based applications  Can determine 10-30% of the overall performance  PIF (proactive instruction fetching) has high storage overheads  History Buffer  Stream Address Buffer  Reduce storage overheads  Use the RAS (return address stack) instead  It effectively captures the context of the current program

RDIP: Basic Idea  Three steps  Record the i-cache misses in a given program context (sequence of function calls)  Predict the next program context based on the current context.  Prefetch cache misses associated with the next context  Representing program context  Take all the entries and XOR them together (32 bits)  Signature of the path taken to reach the current function  Assume a caller calls multiple callees. After returning from each callee the state of the RAS is identical.  Solution: Take signatures before processing return instructions.

Miss Table  Miss Table  Signature  Set of i-cache misses  Maintain a buffer of the last k misses  Whenever the call signature changes update the miss table  Compress each set of misses using a scheme similar to PIF  When updating a miss table entry: merge the sequence of misses with the sequence already stored

Program Context Prediction  Instead of logging misses with the current signature log them with a previous signature  To increase lookahead, log them with the previous k signatures  Maintain a circular buffer with the last k signatures  Keeps things simple: just issue prefetches for the entry corresponding to the current signature

Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return Address Stack Directed Prefetching  Pentium 4 Trace Cache 31

Trace Cache  Execution of a typical program Basic Block 1Basic Block 2Basic Block 3Basic Block 4  Basic block  A set of instructions with a single point of entry and a single point of exit  We fetch a sequence of basic blocks from the i-cache  Such a sequence is called a trace  Let us have a cache that stores such traces. We have the option of sequentially reading out a full trace  This is called a trace cache (first major use: Pentium 4, circa 2000)

Earlier approaches  Traditional approach  A cache line contains contiguous cache blocks  Peleg and Weiser  Store successive basic blocks per cache line  Trace segments cannot span multiple cache lines  Melvin et al.  CISC instruction sets decode instructions into micro-ops  The idea is to store decoded micro-ops for each instruction in a cache line  Saves some decoding effort  Can we combine and augment these solutions?

Structure of a Solution  A trace consists of multiple cache lines  It is a linked list of cache lines (each line is a trace segment)  We need a method to place a marker to terminate a trace, and start a new trace. Trace HeadTrace Body Trace Tail

Operation  Lookup a fetched address. See if it is a trace head  If yes, start prefetching the trace  Each line of the trace contains a pointer to its previous trace segment and successive trace segment  The entire data line need not contain valid instructions  Never distribute micro-ops of a macro instruction across cache lines  Terminate a data line if you encounter more branch microps than a threshold  Trace segment termination  Encounter an indirect branch  Interrupt or branch misprediction notification  Maximum length of trace (64 sets) reached

Bird’s eye view of traces Head Body Tail Way 0 Way 1 Way 2 Way 3 Set 1 Set 2 Set 3 Set 4 Set 5

Hardware Support  Augment each entry in the tag array  Trace head, body, and tail information  How to store the id of the next trace segment member  Store the index of the line  Store the way index, and the set index (if a different set)  There is also a need to do some predecoding within the same line  Save some additional pointers: uIP (identifying the next microinstruction), NLIP (next macro instruction),

Execution Mode  Starts in the idle state  Transition to the head lookup state.  Lookup the fetched address in the tag array. See if it is a trace head.  Once we find a head  Transition to the body lookup state  By default the next trace segment is in the next set. The way is specified in the current line  Keep reading the micro-ops and send to the processor  If there is any interrupt/branch misprediction, abandon, back to head lookup  Once we reach the tail  Transition to the tail state  Restart from the head lookup state

Other States  During this process several things can go wrong  Encounter a macro-instruction (very complex CISC instruction)  Transition to the MS (microcode sequencer state)  For that particular instruction, read its micro-operands from microcode memory and insert them into the pipeline  Then transition to the body lookup state  Body miss state  Enters this state when there is a cache miss while reading a trace segment  Abandon and start building a trace

Building a Trace  During trace building, we do not read from the trace cache  We start in the fetch request state  Send the PC to the lower level for its contents  After the line comes back wait for it to get decoded  After the instruction is decoded into several uOPs, place them in the fill buffer  Keep doing till a data line termination condition is encountered  After that write the contents of the fill buffer to the trace cache  There are many methods to find an entry in the trace cache to write to:  Next set  Next way in the current set  Other heuristic: least used entry  Keep building till a trace end condition is encountered

Important Claims  A separate trace segment reader, and a separate builder  The trace cache state machine  Notion of partially building a trace and storing it in the fill buffer