Presentation is loading. Please wait.

Presentation is loading. Please wait.

ILP: Advanced HWCSCE430/830 Instruction-level parallelism: Advanced HW Approaches CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Fall, 2006.

Similar presentations


Presentation on theme: "ILP: Advanced HWCSCE430/830 Instruction-level parallelism: Advanced HW Approaches CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Fall, 2006."— Presentation transcript:

1 ILP: Advanced HWCSCE430/830 Instruction-level parallelism: Advanced HW Approaches CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Fall, 2006

2 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: control dependences rapidly become the limiting factor as the amount of ILP to be exploited increases, which is particularly true when multiple instructions are to be issued per cycle. –Basic Branch Prediction and Branch-Prediction Buffers »A small memory indexed by the lower portion of the address of the branch instruction, containing a bit that says whether the branch was recently taken or not – simple, and useful only when the branch delay is longer than the time to calculate the target address »The prediction bit is inverted each time there is a wrong prediction – an accuracy problem (mispredict twice); a remedy: 2-bit predictor, a special case of n-bit predictor (saturating counter), which performs well (accuracy:99-82%)performs well Not taken Taken Not taken Taken Predict taken Predict not taken 11 01 10 00

3 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Correlating Branch Predictors »The behavior of branch b3 is correlated with the behavior of branches b1 and b2 (b1 & b2 both not taken  b3 will be taken); A predictor that uses only the behavior of a single branch to predict the outcome of that branch can never capture this behavior. correlating predictorstwo-level predictors »Branch predictors that use the behavior of other branches to make prediction are called correlating predictors or two-level predictors. If (aa==2) aa=0; If (bb==2) bb=0; If (aa!=bb){ Assign aa and bb to registers R1 and R2 DSUBUI R3,R1,#2 BNEZ R3,L1 ;branch b1 (aa!=2) DADD R1,R0,R0 ;aa=0 L1: DSUBUI R3,R2,#2 BNEZ R3,L2 ;branch b2 (bb!=2) DADD R2,R0,R0 ;bb=0 L2: DSUBUI R3,R1,R2 ;R3=aa-bb BEQZ R3,L3 ;branch b3 (aa==bb)

4 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: If (d==0) d=1; If (d==1) Assign d to register R1 BNEZ R1,L1 ;branch b1 (d!=0) DADDIU R1,R0,#1 ;d==0, so d=1 L1: DADDIU R3,R1, # -1 BNEZ R3,L2 ;branch b2 (d!=1) … L2: Initial value of dd==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken Behavior of a 1-bit Standard Predictor Initialized to Not Taken d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NTTT TT 0 2 0

5 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: If (d==0) d=1; If (d==1) Assign d to register R1 BNEZ R1,L1 ;branch b1 (d!=0) DADDIU R1,R0,#1 ;d==0, so d=1 L1: DADDIU R3,R1, # -1 BNEZ R3,L2 ;branch b2 (d!=1) … L2: Initial value of dd==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken Behavior of a 1-bit Standard Predictor Initialized to Not Taken d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NTTT TT 0T T 2 0

6 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: If (d==0) d=1; If (d==1) Assign d to register R1 BNEZ R1,L1 ;branch b1 (d!=0) DADDIU R1,R0,#1 ;d==0, so d=1 L1: DADDIU R3,R1, # -1 BNEZ R3,L2 ;branch b2 (d!=1) … L2: Initial value of dd==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken Behavior of a 1-bit Standard Predictor Initialized to Not Taken d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NTTT TT 0T T 2 TT TT 0

7 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: If (d==0) d=1; If (d==1) Assign d to register R1 BNEZ R1,L1 ;branch b1 (d!=0) DADDIU R1,R0,#1 ;d==0, so d=1 L1: DADDIU R3,R1, # -1 BNEZ R3,L2 ;branch b2 (d!=1) … L2: Initial value of dd==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken Behavior of a 1-bit Standard Predictor Initialized to Not Taken (100% wrong prediction) d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NTTT TT 0T T 2 TT TT 0T T

8 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Correlating Branch Predictors all »The standard predictor mispredicted all branches! The 2 Prediction bits (p1/p2)Prediction if last branch not taken (p1)Prediction if last branch taken (p2) NT/NTNT NT/TNTT T/NTTNT T/TTT The Action of the 1-bit Predictor with 1-bit correlation, Initialized to Not Taken/Not Taken d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NT/NTT 0 2 0 Initial value of dd==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken

9 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Correlating Branch Predictors all »The standard predictor mispredicted all branches! The 2 Prediction bits (p1/p2)Prediction if last branch not taken (p1)Prediction if last branch taken (p2) NT/NTNT NT/TNTT T/NTTNT T/TTT The Action of the 1-bit Predictor with 1-bit correlation, Initialized to Not Taken/Not Taken d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NT/NTTT/NT 0 2 0 Initial value of d d==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken

10 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches The 2 Prediction bits (p1/p2)Prediction if last branch not taken (p1)Prediction if last branch taken (p2) NT/NTNT NT/TNTT T/NTTNT T/TTT The Action of the 1-bit Predictor with 1-bit correlation, Initialized to Not Taken/Not Taken d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NT/NTTT/NTNT/NTT 0 2 0 Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Correlating Branch Predictors all »The standard predictor mispredicted all branches! Initial value of d d==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken

11 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches The 2 Prediction bits (p1/p2)Prediction if last branch not taken (p1)Prediction if last branch taken (p2) NT/NTNT NT/TNTT T/NTTNT T/TTT The Action of the 1-bit Predictor with 1-bit correlation, Initialized to Not Taken/Not Taken d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NT/NTTT/NTNT/NTTNT/T 0 2 0 Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Correlating Branch Predictors all »The standard predictor mispredicted all branches! Initial value of d d==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken

12 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches The 2 Prediction bits (p1/p2)Prediction if last branch not taken (p1)Prediction if last branch taken (p2) NT/NTNT NT/TNTT T/NTTNT T/TTT The Action of the 1-bit Predictor with 1-bit correlation, Initialized to Not Taken/Not Taken d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NT/NTTT/NTNT/NTTNT/T 0T/NTNT/T 2 0 Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Correlating Branch Predictors all »The standard predictor mispredicted all branches! Initial value of d d==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken

13 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches The 2 Prediction bits (p1/p2)Prediction if last branch not taken (p1)Prediction if last branch taken (p2) NT/NTNT NT/TNTT T/NTTNT T/TTT The Action of the 1-bit Predictor with 1-bit correlation, Initialized to Not Taken/Not Taken d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NT/NTTT/NTNT/NTTNT/T 0T/NTNTT/NTNT/TNTNT/T 2 0 Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Correlating Branch Predictors all »The standard predictor mispredicted all branches! Initial value of d d==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken

14 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches The 2 Prediction bits (p1/p2)Prediction if last branch not taken (p1)Prediction if last branch taken (p2) NT/NTNT NT/TNTT T/NTTNT T/TTT The Action of the 1-bit Predictor with 1-bit correlation, Initialized to Not Taken/Not Taken d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NT/NTTT/NTNT/NTTNT/T 0T/NTNTT/NTNT/TNTNT/T 2T/NTT NT/TT 0 Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Correlating Branch Predictors all »The standard predictor mispredicted all branches! Initial value of d d==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken

15 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches The 2 Prediction bits (p1/p2)Prediction if last branch not taken (p1)Prediction if last branch taken (p2) NT/NTNT NT/TNTT T/NTTNT T/TTT The Action of the 1-bit Predictor with 1-bit correlation, Initialized to Not Taken/Not Taken d=?b1 predictionb1 actionNew b1 predictionb2 predictionb2 actionNew b2 prediction 2NT/NTTT/NTNT/NTTNT/T 0T/NTNTT/NTNT/TNTNT/T 2T/NTT NT/TT 0T/NTNTT/NTNT/TNTNT/T Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Correlating Branch Predictors all »The standard predictor mispredicted all branches! Initial value of d d==0?b1Value of d before b2d==1?b2 0Yes Not taken 1Yes Not taken 1NoTaken1YesNot taken 2NoTaken2NoTaken

16 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Correlating Branch Predictors (1,1) predictor »With the 1-bit correlation predictor, also called a (1,1) predictor, the only misprediction is on the first iteration! »In general case an (m,n) predictor uses the behavior of the last m branches to choose from 2 m branch predictors, each of which is an n-bit predictor for a single branch. xx prediction xx 2-bit per-branch predictors 4 Lower-bits of Branch address 2-bit global branch history (shift register) »The number of bits in an (m,n) predictor is: 2 m *n *(number of prediction entries selected by the branch address)

17 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Performance of Correlating Branch Predictors

18 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Tournament Predictors: Adaptively Combining Local and Global Predictors »Takes the insight that adding global information to local predictors helps improve performance to the next level, by Using multiple predictors, usually one based on global information and one based on local information, and Combining them with a selector »Better accuracy at medium sizes (8K bits – 32K bits) and more effective use of very large numbers of prediction bits: the right predictor for the right branch »Existing tournament predictors use a 2-bit saturating counter per branch to choose among two different predictors: State Transition Diagram 0/1 1/0 Use predictor 1 Use predictor 2 Use predictor 1 Use predictor 2 0/1 0/0, 0/1,1/1 0/0, 1/0,1/1 0/0, 1/1 The counter is incremented whenever the “predicted” predictor is correct and the other predictor is incorrect, and it is decremented in the reverse situation

19 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –Performance of Tournament Predictors: Prediction due to local predictor Misprediction rate of 3 different predictors

20 ILP: Advanced HWCSCE430/830 Instruction-Level Parallelism Dynamic Hardware Branch PredictionDynamic Hardware Branch Prediction: –The Alpha 21264 Branch Predictor: »4K 2-bit saturating counters indexed by the local branch address to choose from among: A Global Predictor that has –4K entries that are indexed by the history of the last 12 branches; –Each entry is a standard 2-bit predictor A Local Predictor that consists of a two-level predictor –At the top level is a local history table consisting of 1024 10-bit entries, with each entry corresponding to the most recent 10 branch outcomes for the entry; –At the bottom level is a table of 1K entries, indexed by the 10-bit entry of the top level, consisting of 3-bit saturating counters which provide the local prediction »It uses a total of 29K bits for branch prediction, resulting in very high accuracy: 1 misprediction in 1000 for SPECfp95 and 11.5 in 1000 for SPECint95

21 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches High-Performance Instruction DeliveryHigh-Performance Instruction Delivery: –Branch-Target Buffers »Branch-prediction cache »Branch-prediction cache that stores the predicted address for the next instruction after a branch: Predicting the next instruction address before decoding the current instruction! Accessing the target buffer during the IF stage using the instruction address of the fetched instruction (a possible branch) to index the buffer. PC of instruction to fetch Predicted PCLook up Number of entries in branch- target buffer = No: instruction is not predicted to be branch; proceed normally Yes: then instruction is a taken branch and predicted PC should be used as the next PC Branch predicted taken or untaken

22 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Handling branch-target buffers : Integrated Instruction Fetch Units: to meet the demands of multiple-issue processors, recent designs have used an integrated instruction fetch unit that integrates several functions: –Integrated branch prediction –Integrated branch prediction – the branch predictor becomes part of the instruction fetch unit and is constantly predicting branches, so as to drive the fetch pipeline –Instruction prefetch –Instruction prefetch – to deliver multiple instructions per clock, the instruction fetch unit will likely need to fetch ahead, autonomously managing the prefetching of instructions and integrating it with branch prediction –Instruction memory access and buffering – encapsulates the complexity of fetching multiple instructions per clock, trying to hide the cost of crossing cache blocks, and provides buffering, acting as an on- demand unit to provide instructions to the issue stage as needed and in the quantity needed Send PC to memory and branch-target buffer Entry found in branch- target buffer? Is instruction a taken branch? Taken branch? Send out predicted PC Enter branch instruction address and next PC into branch-target buffer (2 cycle penalty) Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer (2 cycle penalty) Branch correctly predicted; continue execution with no stalls (0 cycle penalty) Normal instruction execution (0 cycle penalty) No Yes No Yes IF ID EX

23 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Taking Advantage of More ILP with Multiple Issue –Superscalar: –Superscalar: issue varying numbers of instructions per cycle that are either statically scheduled (using compiler techniques, thus in-order execution) or dynamically scheduled (using techniques based on Tomasulo ’ s algorithm, thus out-order execution); –VLIW (very long instruction word): EPIC, –VLIW (very long instruction word): issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (hence, they are also known as EPIC, explicitly parallel instruction computers). VLIW and EPIC processors are inherently statically scheduled by the compiler. Common Name Issue Structure Hazard Detection SchedulingDistinguishing Characteristics Examples Superscalar (static) Dynamic (IS packet <= 8) HardwareStaticIn-order executionSun UltraSPARC II/III Superscalar (dynamic) Dynamic (split&piped) HardwareDynamicSome out-of-order execution IBM Power2 Superscalar (speculative) DynamicHardwareDynamic with speculation Out-of-order execution with speculation Pentium III/4, MIPS R 10K, Alpha 21264, HP PA 8500, IBM RS64III VLIW/LIWStaticSoftwareStaticNo hazards between issue packets Trimedia, i860 EPICMostly staticMostly software Mostly staticExplicit dependences marked by compiler Itanium

24 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Taking Advantage of More ILP with Multiple Issue –Multiple Instruction Issue with Dynamic Scheduling: –Multiple Instruction Issue with Dynamic Scheduling: dual-issue with Tomasulo ’ s Iteration No. InstructionsIssues atExecutesMem AccessWrite CDBComments 1 L.D F0,0(R1)1234First issue 1 ADD.D F4,F0,F2158Wait for L.D 1 S.D F4,0(R1)239 Wait for ADD.D 1 DADDIU R1,R1,#-8245Wait for ALU 1 BNE R1,R2,Loop36 Wait for DADDIU 2 L.D F0,0(R1)4789Wait for BNE complete 2 ADD.D F4,F0,F2410 13Wait for L.D 2 S.D F4,0(R1)5814 Wait for ADD.D 2 DADDIU R1,R1,#-85910Wait for ALU 2 BNE R1,R2,Loop611 Wait for DADDIU 3 L.D F0,0(R1)7121314Wait for BNE complete 3 ADD.D F4,F0,F2715 18Wait for L.D 3 S.D F4,0(R1)81319 Wait for ADD.D 3 DADDIU R1,R1,#-881415Wait for ALU 3 BNE R1,R2,Loop916 Wait for DADDIU

25 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Taking Advantage of More ILP with Multiple Issue: resource usage Clock numberInteger ALUFP ALUData cacheCDBComments 21/L.D 31/S.D 1/L.D 41/DAADIU 1/L.D 5 1/ADD.D 1/DADDIU 6 72/L.D 82/S.D 2/L.D1/ADD.D 92/DADDIU 1/S.D2/L.D 10 2/ADD.D2/DADDIU 11 123/L.D 133/S.D 3/L.D2/ADD.D 143/DADDIU 2/S.D3/L.D 15 3/ADD.D 3/DADDIU 16 17 18 3/ADD.D 19 3/S.D 20

26 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Taking Advantage of More ILP with Multiple Issue –Multiple Instruction Issue with Dynamic Scheduling: –Multiple Instruction Issue with Dynamic Scheduling: + an adder and a CBD Iteration No. InstructionsIssues atExecutesMem AccessWrite CDBComments 1 L.D F0,0(R1)1234First issue 1 ADD.D F4,F0,F2158Wait for L.D 1 S.D F4,0(R1)239 Wait for ADD.D 1 DADDIU R1,R1,#-8234Executes earlier 1 BNE R1,R2,Loop35 Wait for DADDIU 2 L.D F0,0(R1)4678Wait for BNE complete 2 ADD.D F4,F0,F249 12Wait for L.D 2 S.D F4,0(R1)5713 Wait for ADD.D 2 DADDIU R1,R1,#-85610Executes earlier 2 BNE R1,R2,Loop68 Wait for DADDIU 3 L.D F0,0(R1)791011Wait for BNE complete 3 ADD.D F4,F0,F2712 15Wait for L.D 3 S.D F4,0(R1)81016 Wait for ADD.D 3 DADDIU R1,R1,#-88910Executes earlier 3 BNE R1,R2,Loop911 Wait for DADDIU

27 ILP: Advanced HWCSCE430/830 ILP: Advanced HW Approaches Taking Advantage of More ILP with Multiple Issue: more resource Clock numberInteger ALUAddress adderFP ALUData cacheCDB#1CDB#2 21/L.D 31/DAADIU1/S.D 1/L.D 4 1/DADDIU 5 1/ADD.D 6 2/DADDIU2/L.D 72/S.D 2/L.D2/DADDIU 8 1/ADD.D2/L.D 93/DADDIU3/L.D2/ADD.D1/S.D 10 3/S.D3/L.D3/DADDIU 11 3/L.D 12 3/ADD.D 2/ADD.D 13 2/S.D 14 15 3/DADDIU 16 3/S.D


Download ppt "ILP: Advanced HWCSCE430/830 Instruction-level parallelism: Advanced HW Approaches CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Fall, 2006."

Similar presentations


Ads by Google