EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer Systems Lecture 11 Instruction Level Parallelism II
EENG449b/Savvides Lec /17/04 Announcements Midterm Next Thursday 02/19/04 TA extra office hour –Sobeeh will have an extra office hour tomorrow –Office hours 5:00 – 7:00pm, AKW 201 Reading for this lecture: Chapter 3 pages
EENG449b/Savvides Lec /17/04 Dynamic Hardware Prediction Last time: Tomasulo’s Algorithm for ILP –Dynamic scheduling –Register renaming –Dynamic memory disambiguation »Avoid conflicts in load and store instructions –Tomasulo’s algorithm deals with data dependences Today: Dynamic branch prediction –Deal with control dependences –Control dependences become the limiting factor in ILP optimizations »Remember from last lecture – basic block sizes between 4 – 7 instructions….
EENG449b/Savvides Lec /17/04 Predicting Branches In Appendix A: static techniques –Delay slot execution –Action taken does not depend on the dynamic behavior of a branch Dynamic branch prediction –Try to predict the outcome of a branch early on in order to avoid stalls –Branch prediction is critical for multiple issue processors »In an n-issue processor, branches will come n times faster than a single issue processor
EENG449b/Savvides Lec /17/04 Branch Prediction Metrics To evaluate the effectiveness of branch prediction you need to consider –Prediction accuracy –Penalties associated with branch taken and branch not taken –The associated penalties are artifacts of »Pipeline design »Type of predictor »Branch frequency »Strategy to deal with the misprediction
EENG449b/Savvides Lec /17/04 Basic Branch Predictor Use a 1-bit branch predictor buffer or branch history table 1 bit of memory stating whether the branch was recently taken or not –Indexed by the lower portion of the branch predict instruction Bit entry updated each time the branch instruction is executed Problem with 1-bit prediction –It will always give the wrong prediction twice –Imagine executing a loop »Predictor will be wrong on the first and last iteration
EENG449b/Savvides Lec /17/04 A 2-bit Prediction Scheme 2- bit prediction scheme –Generalization for n-bit prediction A prediction must miss twice before it is changed
EENG449b/Savvides Lec /17/04 Branch Prediction Implementation Implications Branch predictors held in branch predictor buffers –Implemented as small caches accessed with instruction address at the IF phase of a pipeline –OR it could be implemented as a pair of bits attached to each block in the instruction cache This branch prediction scheme does not help in the basic 5-stage pipeline –The decision whether a branch is taken and the target address are computed at the same stage…
EENG449b/Savvides Lec /17/04 Branch Prediction Accuracy on SPEC 89 Benchmark Using 2-bit prediction, 4KB cache FP programs Integer programs
EENG449b/Savvides Lec /17/04 Performance of SPEC 98 Benchmark Remember –To evaluate performance you need to know the branch frequencies and misprediction penalties FP programs typically come from scientific applications and are more loop based Branches harder to predict in integer programs –Typically have higher branch frequency How can this be improved? –Perhaps increase the cache buffer –Increase the effectiveness of the predictor
EENG449b/Savvides Lec /17/04 Effects of Cache Buffer Size
EENG449b/Savvides Lec /17/04 Correlating Bit Predictors What about considering the behavior of other branches than the ones we are trying to predict? Goal: Use correlating or 2-level predictors to exploit the correlation between consecutive branches…
EENG449b/Savvides Lec /17/04 Branch Correlation Example if (aa==2) aa=0; if (bb==2) bb=0; if (aa!=bb){ DSUBUI R3, R1, #2 BNEZ R3, L1; branch b1 DADD R1, R0, R0 L1:DSUBUI R3,R2,#2 BNEZ R3, L2; branch b2 DADD R2,R0,R0 L2:DSUBU R3,R1,R2 BEQZ R3, L3; branch b3 Branch b3 is correlated with b1 and b2
EENG449b/Savvides Lec /17/04 Correlated Branch Example Consider the following code: if (d==0) d=1; if (d==1) BNEZ R1, L1 ; branch b1 DADDUI R1,R0,#1 L1: DADDUI R3,R1, #-1 BNEZ R3,L2 ; branch b2 … L2: What are the possible execution sequences when d=0,1,2?
EENG449b/Savvides Lec /17/04 Using a 1-bit Predictor Consider a sequence of b=2,0,2,0 and a 1-bit predictor P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2 d=2 NT T T NT T T d=0 T NT NT T NT NT d=2 NT T T NTT T d=0 T NT NT T NT NT BNEZ R1, L1 ; branch b1 DADDUI R1,R0,#1 L1: DADDUI R3,R1, #-1 BNEZ R3,L2 ; branch b2 … L2:
EENG449b/Savvides Lec /17/04 Using a 1-bit Predictor Consider a sequence of b=2,0,2,0 and a 1-bit predictor P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2 d=2 NT T T NT T T d=0 T NT NT T NT NT d=2 NT T T NTT T d=0 T NT NT T NT NT All branches are mispredicted !!! BNEZ R1, L1 ; branch b1 DADDUI R1,R0,#1 L1: DADDUI R3,R1, #-1 BNEZ R3,L2 ; branch b2 … L2:
EENG449b/Savvides Lec /17/04 Using a 1-bit Predictor with 1-bit Correlation X/X Prediction if last branch was NOT taken Prediction if last branch was taken NOTE: last branch refers to the preceding branch instruction not the previous execution of the current branch instruction
EENG449b/Savvides Lec /17/04 Using a 1-bit Predictor with 1-bit Correlation Consider a sequence of b=2,0,2,0 and a 1-bit predictor P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2 d=2 NT/NT T T/NT NT/NT T NT/T d=0 T/NT NT T/NT NT/T NT NT/T d=2 T/NT T T/NT NT/T T NT/T d=0 T/NT NT T/NT NT/T NT NT/T BNEZ R1, L1 ; branch b1 DADDUI R1,R0,#1 L1: DADDUI R3,R1, #-1 BNEZ R3,L2 ; branch b2 … L2:
EENG449b/Savvides Lec /17/04 Using a 1-bit Predictor with 1-bit Correlation Consider a sequence of b=2,0,2,0 and a 1-bit predictor P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2 d=2 NT/NT T T/NT NT/NT T NT/T d=0 T/NT NT T/NT NT/T NT NT/T d=2 T/NT T T/NT NT/T T NT/T d=0 T/NT NT T/NT NT/T NT NT/T Misprediction only on the first iteration of d=2! BNEZ R1, L1 ; branch b1 DADDUI R1,R0,#1 L1: DADDUI R3,R1, #-1 BNEZ R3,L2 ; branch b2 … L2:
EENG449b/Savvides Lec /17/04 (m,n) Predictors Use the behavior of last m branches to choose from 2 m branch predictors. Each is an n-bit predictor for a single branch Ex. A (2,2) branch predictor
EENG449b/Savvides Lec /17/04 Tournament Predictors N-bit predictors – use local information (m,n) predictors – use global information Tournament predictors –Local + global – enhanced performance Example of tournament predictors –Multilevel branch predictors »Uses several levels of branch prediction table »Has an algorithm to select from multiple predictors
EENG449b/Savvides Lec /17/04 Comparing Predictors
EENG449b/Savvides Lec /17/04 High Performance Instruction Delivery What else can be done besides branch prediction? Need to have high bandwidth instruction delivery –Modern multiple issue processors require 4-8 instructions per CPI
EENG449b/Savvides Lec /17/04 Branch-Target Buffers (BTB) How can we further reduce branch penalty? We need to know what is the instruction of the next instruction to fetch If the instruction is a branch and we know the PC then the penalty would be zero Branch-target-buffer – stores the predicted address for the next instruction after a branch Advantage for a 5-stage pipeline –Know the predicted instruction address 1 cycle earlier IF stage instead of ID stage
EENG449b/Savvides Lec /17/04 BTB has a cache structure Note that only predicted taken branches need to be stored Represent addresses of known branches
EENG449b/Savvides Lec /17/04 Branch Target Buffer Operation
EENG449b/Savvides Lec /17/04 Integrated Instruction Fetch Units Instead of using instruction fetch as one of the pipeline phases, use a more advanced instruction fetch unit –To support the demands of multiple issue processors Integrated IF has 3 main units –Integrated Branch Prediction –Instruction Prefetch »autonomously fetching ahead the given instructions –Instruction memory access and buffering »Tries to hide the overhead associated with fetching instructions from multiple cache lines by buffering instructions
EENG449b/Savvides Lec /17/04 Return Address Predictors Predict the return address of jumps that are not known at compile time –Returns from procedure calls. »Procedures get called at different points in the code Use a small stack of return addresses –Before a procedure is called put the return address on a stack and pop the stack on return –If the stack has enough depth – optimal prediction
EENG449b/Savvides Lec /17/04 Prediction Stack Performance Results based on a number of SPEC benchmarks
EENG449b/Savvides Lec /17/04 Recap So far we have seen Dynamic Scheduling – reduce data dependences –Tomasulo’s algorithms Dynamic Branch Prediction – Trying to reduce control dependences –N-bit predictors, (m,n) predictors, Tournament Predictors Achieve and ideal CPI of 1 –Branch target buffer, integrated IF, return address prediction
EENG449b/Savvides Lec /17/04 Multiple Issue Processors Try to issue multiple instructions per clock cycle Two basic flavors –Superscalar Processors »Issue variable number of instructions per clock cycle »Can be statically or dynamically scheduled –VLIW (Very Large Instruction Set) Processors »Issue a constant number of instructions formatted as a packet of smaller instructions »Parallelism across instructions is specifically indicated »Statically scheduled by the compiler
EENG449b/Savvides Lec /17/04 Next Time Midterm Next Tuesday –Multiple Issue Processors