Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

Similar presentations


Presentation on theme: "Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures"— Presentation transcript:

1 Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures
Henk Corporaal TUEindhoven 2007

2 Topics Introduction Hazards Branch prediction
Data dependences Control dependences Branch prediction Dependences limit ILP: scheduling Out-Of-Order execution: Hardware speculation Multiple issue How much ILP is there? 9/9/2019 ACA H.Corporaal

3 Introduction ILP = Instruction level parallelism
multiple operations (or instructions) can be executed in parallel Needed: Sufficient resources Parallel scheduling Hardware solution Software solution Application should contain ILP 9/9/2019 ACA H.Corporaal

4 Hazards Three types of hazards Hazards cause scheduling problems
Structural Data dependence Control dependence Hazards cause scheduling problems 9/9/2019 ACA H.Corporaal

5 Data dependences RaW read after write WaR write after read
WaW write after write 9/9/2019 ACA H.Corporaal

6 Control Dependences C input code: How real are control dependences?
if (a > b) { r = a % b; } else { r = b % a; } y = a*b; 1 sub t1, a, b bgz t1, 2, 3 CFG: 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 mul y,a,b ………….. How real are control dependences? 9/9/2019 ACA H.Corporaal

7 Branch Prediction 9/9/2019 ACA H.Corporaal

8 Branch Prediction Motivation
High branch penalties in pipelined processors: With on average one out of five instructions being a branch, the maximum ILP is five Situation even worse for multiple-issue processors, because we need to provide an instruction stream of n instructions per cycle. Idea: predict the outcome of branches based on their history and execute instructions speculatively 9/9/2019 ACA H.Corporaal

9 5 Branch Prediction Schemes
1-bit Branch Prediction Buffer 2-bit Branch Prediction Buffer Correlating Branch Prediction Buffer Branch Target Buffer Return Address Predictors + A way to get rid of those malicious branches 9/9/2019 ACA H.Corporaal

10 1-bit Branch Prediction Buffer
1-bit branch prediction buffer or branch history table: Buffer is like a cache without tags Does not help for simple MIPS pipeline because target address calculations in same stage as branch condition calculation PC 10… BHT 1 9/9/2019 ACA H.Corporaal

11 Branch Prediction Buffer: 1 bit prediction
Branch address 2 K entries (K bits) prediction bit Problems: Aliasing: lower K bits of different branch instructions could be the same Soln: Use tags (the buffer becomes a tag); however very expensive Loops are predicted wrong twice Soln: Use n-bit saturation counter prediction taken if counter  2 (n-1) not-taken if counter < 2 (n-1) A 2 bit saturating counter predicts a loop wrong only once 9/9/2019 ACA H.Corporaal

12 2-bit Branch Prediction Buffer
Solution: 2-bit scheme where prediction is changed only if mispredicted twice Can be implemented as a saturating counter: T NT Predict Taken Predict Not Taken 9/9/2019 ACA H.Corporaal

13 Correlating Branches Fragment from SPEC92 benchmark eqntott:
if (aa==2) aa = 0; if (bb==2) bb=0; if (aa!=bb){..} subi R3,R1,#2 b1: bnez R3,L1 add R1,R0,R0 L1: subi R3,R2,#2 b2: bnez R3,L2 add R2,R0,R0 L2: sub R3,R1,R2 b3: beqz R3,L3 9/9/2019 ACA H.Corporaal

14 Correlating Branch Predictor
4 bits from branch address Idea: behavior of current branch is related to taken/not taken history of recently executed branches Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction (2,2) predictor: 2-bit global, 2-bit local (k,n) predictor uses behavior of last k branches to choose from 2k predictors, each of which is n-bit predictor 2-bits per branch local predictors Prediction shift register 2-bit global branch history (01 = not taken, then taken) 9/9/2019 ACA H.Corporaal

15 Branch Correlation Using Branch History
Two schemes (a, k, m, n) PA: Per address history, a > 0 GA: Global history, a = 0 Pattern History Table 2m-1 n-bit saturating Up/Down Counter m 1 Prediction Branch Address 1 2k-1 k a Branch History Table Table size (usually n = 2): #bits = k * 2a + 2k * 2m *n Variant: Gshare (Scott McFarling’93): GA which takes logic OR of PC address bits and branch history bits 9/9/2019 ACA H.Corporaal

16 Accuracy (taking the best combination of parameters):
GA(0,11,5,2) 98 PA(10, 6, 4, 2) 97 96 95 Bimodal 94 Branch Prediction Accuracy (%) GAs 93 PAs 92 91 89 64 128 256 1K 2K 4K 8K 16K 32K 64K Predictor Size (bytes) 9/9/2019 ACA H.Corporaal

17 Accuracy of Different Branch Predictors
18% Mispredictions Rate 0% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT Entries (2,2) BHT 9/9/2019 ACA H.Corporaal

18 BHT Accuracy Mispredict because either:
Wrong guess for that branch Got branch history of wrong branch when index the table 4096 entry table: misprediction rates vary from 1% (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% For SPEC92, 4096 about as good as infinite table Real programs + OS more like gcc 9/9/2019 ACA H.Corporaal

19 Branch Target Buffer Branch condition is not enough !!
Branch Target Buffer (BTB): Tag and Target address PC 10… Tag branch PC PC if taken Yes: instruction is branch. Use predicted PC as next PC if branch predicted taken. =? Branch prediction (often in separate table) No: instruction is not a branch. Proceed normally 9/9/2019 ACA H.Corporaal

20 Instruction Fetch Stage
4 Instruction Memory PC Instruction register BTB found & taken target address Not shown: hardware needed when prediction was wrong 9/9/2019 ACA H.Corporaal

21 Special Case: Return Addresses
Register indirect branches: hard to predict target address MIPS instruction: jr r31 ; PC = r31 useful for implementing switch/case statements FORTRAN computed GOTOs procedure return (mainly) SPEC89: 85% such branches for procedure return Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has very high hit rate 9/9/2019 ACA H.Corporaal

22 Dynamic Branch Prediction Summary
Prediction important part of scalar execution Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Either different branches Or different executions of same branch Branch Target Buffer: include branch target address (& prediction) Return address stack for prediction of indirect jumps 9/9/2019 ACA H.Corporaal

23 Or: Avoid branches ! 9/9/2019 ACA H.Corporaal

24 Predicated Instructions
Avoid branch prediction by turning branches into conditional or predicated instructions: If false, then neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. IA-64/Itanium: conditional execution of any instruction Examples: if (R1==0) R2 = R3; CMOVZ R2,R3,R1 if (R1 < R2) SLT R9,R1,R2 R3 = R1; CMOVNZ R3,R1,R9 else CMOVZ R3,R2,R9 R3 = R2; 9/9/2019 ACA H.Corporaal

25 Dynamic Scheduling 9/9/2019 ACA H.Corporaal

26 Dynamic Scheduling Principle
What we examined so far is static scheduling Compiler reorders instructions so as to avoid hazards and reduce stalls Dynamic scheduling: hardware rearranges instruction execution to reduce stalls Example: DIV.D F0,F2,F4 ; takes 24 cycles and ; is not pipeline ADD.D F10,F0,F8 SUB.D F12,F8,F14 Key idea: Allow instructions behind stall to proceed Book describes Tomasulo algorithm, but we describe general idea This instruction cannot continue even though it does not depend on anything 9/9/2019 ACA H.Corporaal

27 Advantages of Dynamic Scheduling
Handles cases when dependences unknown at compile time e.g., because they may involve a memory reference It simplifies the compiler Allows code compiled for one or no pipeline to run efficiently on a different pipeline Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling 9/9/2019 ACA H.Corporaal

28 Superscalar Concept Instruction Memory Instruction Cache Decoder
Reservation Stations Branch Unit ALU-1 ALU-2 Logic & Shift Load Unit Store Unit Address Data Cache Data Reorder Buffer Data Register File Data Memory 9/9/2019 ACA H.Corporaal

29 Superscalar Issues How to fetch multiple instructions in time (across basic block boundaries) ? Predicting branches Non-blocking memory system Tune #resources(FUs, ports, entries, etc.) Handling dependencies How to support precise interrupts? How to recover from mis-predicted branch path? For the latter two issues we need to look at sequential look-ahead and architectural state Ref: Johnson 91 9/9/2019 ACA H.Corporaal

30 Example of Superscalar Processor Execution
Superscalar processor organization: simple pipeline: IF, EX, WB fetches 2 instructions each cycle 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier Instruction window (buffer between IF and EX stage) is of size 2 FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc Cycle L.D F6,32(R2) L.D F2,48(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 MUL.D F12,F2,F4 9/9/2019 ACA H.Corporaal

31 Example of Superscalar Processor Execution
Superscalar processor organization: simple pipeline: IF, EX, WB fetches 2 instructions each cycle 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier Instruction window (buffer between IF and EX stage) is of size 2 FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc Cycle L.D F6,32(R2) IF L.D F2,48(R3) IF MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 MUL.D F12,F2,F4 9/9/2019 ACA H.Corporaal

32 Example of Superscalar Processor Execution
Superscalar processor organization: simple pipeline: IF, EX, WB fetches 2 instructions each cycle 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier Instruction window (buffer between IF and EX stage) is of size 2 FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc Cycle L.D F6,32(R2) IF EX L.D F2,48(R3) IF EX MUL.D F0,F2,F4 IF SUB.D F8,F2,F6 IF DIV.D F10,F0,F6 ADD.D F6,F8,F2 MUL.D F12,F2,F4 9/9/2019 ACA H.Corporaal

33 Example of Superscalar Processor Execution
Superscalar processor organization: simple pipeline: IF, EX, WB fetches 2 instructions each cycle 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier Instruction window (buffer between IF and EX stage) is of size 2 FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc Cycle L.D F6,32(R2) IF EX WB L.D F2,48(R3) IF EX WB MUL.D F0,F2,F4 IF EX SUB.D F8,F2,F6 IF EX DIV.D F10,F0,F6 IF ADD.D F6,F8,F2 IF MUL.D F12,F2,F4 9/9/2019 ACA H.Corporaal

34 Example of Superscalar Processor Execution
Superscalar processor organization: simple pipeline: IF, EX, WB fetches 2 instructions each cycle 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier Instruction window (buffer between IF and EX stage) is of size 2 FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc Cycle L.D F6,32(R2) IF EX WB L.D F2,48(R3) IF EX WB MUL.D F0,F2,F4 IF EX EX SUB.D F8,F2,F6 IF EX EX DIV.D F10,F0,F6 IF ADD.D F6,F8,F2 IF MUL.D F12,F2,F4 stall because of data dep. cannot be fetched because window full 9/9/2019 ACA H.Corporaal

35 Example of Superscalar Processor Execution
Superscalar processor organization: simple pipeline: IF, EX, WB fetches 2 instructions each cycle 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier Instruction window (buffer between IF and EX stage) is of size 2 FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc Cycle L.D F6,32(R2) IF EX WB L.D F2,48(R3) IF EX WB MUL.D F0,F2,F4 IF EX EX EX SUB.D F8,F2,F6 IF EX EX WB DIV.D F10,F0,F6 IF ADD.D F6,F8,F2 IF EX MUL.D F12,F2,F4 IF 9/9/2019 ACA H.Corporaal

36 Example of Superscalar Processor Execution
Superscalar processor organization: simple pipeline: IF, EX, WB fetches 2 instructions each cycle 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier Instruction window (buffer between IF and EX stage) is of size 2 FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc Cycle L.D F6,32(R2) IF EX WB L.D F2,48(R3) IF EX WB MUL.D F0,F2,F4 IF EX EX EX EX SUB.D F8,F2,F6 IF EX EX WB DIV.D F10,F0,F6 IF ADD.D F6,F8,F2 IF EX EX MUL.D F12,F2,F4 IF cannot execute structural hazard 9/9/2019 ACA H.Corporaal

37 Example of Superscalar Processor Execution
Superscalar processor organization: simple pipeline: IF, EX, WB fetches 2 instructions each cycle 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier Instruction window (buffer between IF and EX stage) is of size 2 FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc Cycle L.D F6,32(R2) IF EX WB L.D F2,48(R3) IF EX WB MUL.D F0,F2,F4 IF EX EX EX EX WB SUB.D F8,F2,F6 IF EX EX WB DIV.D F10,F0,F6 IF EX ADD.D F6,F8,F2 IF EX EX WB MUL.D F12,F2,F4 IF ? 9/9/2019 ACA H.Corporaal

38 Register Renaming A technique to eliminate anti- and output dependencies Can be implemented by the compiler advantage: low cost disadvantage: “old” codes perform poorly in hardware advantage: binary compatibility disadvantage: extra hardware needed We describe general idea 9/9/2019 ACA H.Corporaal

39 Register Renaming there’s a physical register file larger than logical register file mapping table associates logical registers with physical register when an instruction is decoded its physical source registers are obtained from mapping table its physical destination register is obtained from a free list mapping table is updated add r3,r3,4 R8 R7 R5 R1 R9 R2 R6 before: mapping table: free list: r0 r1 r2 r3 r4 add R2,R1,4 R8 R7 R5 R2 R9 R6 after: mapping table: free list: r0 r1 r2 r3 r4 9/9/2019 ACA H.Corporaal

40 Eliminating False Dependencies
How register renaming eliminates false dependencies: Before: addi r1, r2, 1 addi r2, r0, 0 addi r1, r0, 1 After (free list: R7, R8, R9) addi R7, R5, 1 addi R8, R0, 0 addi R9, R0, 1 9/9/2019 ACA H.Corporaal

41 Limitations of Multiple-Issue Processors
Available ILP is limited (we’re not programming with parallelism in mind) Hardware cost adding more functional units is easy more memory ports and register ports needed dependency check needs O(n2) comparisons Limitations of VLIW processors Loop unrolling increases code size Unfilled slots waste bits Cache miss stalls pipeline Research topic: scheduling loads Binary incompatibility (not EPIC) 9/9/2019 ACA H.Corporaal

42 Measuring available ILP: How?
Using existing compiler Using trace analysis Track all the real data dependencies (RaWs) of instructions from issue window register dependence memory dependence Check for correct branch prediction if prediction correct continue if wrong, flush schedule and start in next cycle 9/9/2019 ACA H.Corporaal

43 Trace analysis How parallel can this code be executed? Trace set r1,0
set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Trace analysis Compiled code set r1,0 set r2,3 set r3,&A Loop: st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Program For i := 0..2 A[i] := i; S := X+3; Explain trace analysis using the trace on the right-hand side for different models. No renaming + oracle prediction + unlimited window size Full renaming + oracle prediction + unlimited window size (shown in next slide) Full renaming + 2-bit prediction (assuming back-edge taken) + unlimited window size etc. How parallel can this code be executed? 9/9/2019 ACA H.Corporaal

44 Trace analysis Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7
Parallel Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop add r1,r5,3 Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7 Is this the maximum? Note that with oracle prediction and renaming the last operation, add r1,r5,3, can be put in the first cycle. 9/9/2019 ACA H.Corporaal

45 Ideal Processor Assumptions for ideal/perfect processor:
1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided 2. Branch and Jump prediction – Perfect => all program instructions available for execution 3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal Also: unlimited number of instructions issued/cycle (unlimited resources), and unlimited instruction window perfect caches 1 cycle latency for all instructions (FP *,/) Programs were compiled using MIPS compiler with maximum optimization level 9/9/2019 ACA H.Corporaal

46 Upper Limit to ILP: Ideal Processor
Integer: FP: IPC 9/9/2019 ACA H.Corporaal

47 Window Size and Branch Impact
Change from infinite window to examine 2000 and issue at most 64 instructions per cycle FP: Integer: 6 – 12 IPC Perfect Tournament BHT(512) Profile No prediction 9/9/2019 ACA H.Corporaal

48 Impact of Limited Renaming Registers
Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor) FP: Integer: IPC Infinite 9/9/2019 ACA H.Corporaal

49 Memory Address Alias Impact
Changes: instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers FP: (Fortran, no heap) IPC Integer: 4 - 9 Perfect Global/stack perfect Inspection None 9/9/2019 ACA H.Corporaal

50 Window Size Impact Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window FP: IPC Integer: 9/9/2019 ACA H.Corporaal

51 How to Exceed ILP Limits of this Study?
WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not for memory operands Unnecessary dependences (compiler did not unroll loops so iteration variable dependence) Overcoming the data flow limit: value prediction, predicting values and speculating on prediction Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis 9/9/2019 ACA H.Corporaal

52 Workstation Microprocessors 3/2001
Max issue: 4 instructions (many CPUs) Max rename registers: 128 (Pentium 4) Max BHT: 4K x 9 (Alpha 21264B), 16Kx2 (Ultra III) Max Window Size (OOO): 126 intructions (Pent. 4) Max Pipeline: 22/24 stages (Pentium 4) Source: Microprocessor Report, 9/9/2019 ACA H.Corporaal

53 SPEC 2000 Performance 3/2001 Source: Microprocessor Report, www
SPEC 2000 Performance 3/2001 Source: Microprocessor Report, 1.6X 3.8X 1.2X 1.7X 1.5X 9/9/2019 ACA H.Corporaal

54 Conclusions 1985-2002: >1000X performance (55% / y)
Hennessy: industry has been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year Caches, (Super)Pipelining, Superscalar, Branch Prediction, Out-of-order execution, Trace cache After 2002 slowdown (about 20%/y) 9/9/2019 ACA H.Corporaal

55 Conclusions (cont'd) ILP limits: To make performance progress in future need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler/HW? Other problem: Processor-memory performance gap VLSI scaling problems (wiring) Energy / leakage problems However: other forms of parallelism come to rescue 9/9/2019 ACA H.Corporaal


Download ppt "Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures"

Similar presentations


Ads by Google