Download presentation
Presentation is loading. Please wait.
Published byAlfred Boast Modified over 9 years ago
2
Instruction-level Parallelism
3
Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on the given pipeline. Compiler must respect (True) Data dependencies (RAW) Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.
4
Compiler Perspectives on Code Movement Other kinds of dependence also called name (false) dependence: two instructions use same name but don’t exchange data Antidependence (WAR dependence) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Output dependence (WAW dependence) Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.
5
Control Dependence Example if (c1) I1; if (c2) I2; I1 is control dependent on c1 and I2 is control dependent on c2 but not on c1.
6
A sample loop Loop:LDF0,0(R1);F0=array element, R1=X[] MULDF4,F0,F2;multiply scalar in F2 SDF4, 0(R1);store result ADDIR1,R1,8;increment pointer 8B (DW) SEQ R3, R1, R2;R2 = &X[1001] BNEZR3,Loop;branch R3!=zero NOP;delayed branch slot OperationLatency (stalls) FP Mult6 (5) LD2 (1) Int ALU1 (0) Where are the dependencies and stalls?
7
Instruction Scheduling Loop:LDF0,0(R1) MULDF4,F0,F2 SD0(R1),F4 ADDIR1,R1,8 SEQ R3, R1, R2 BNEZR3,Loop NOP Number of cycle per iteration?
8
Instruction Scheduling Loop:LDF0,0(R1) MULDF4,F0,F2 SD0(R1),F4 ADDIR1,R1,8 SEQ R3, R1, R2 BNEZR3,Loop NOP Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Cycles/iteration?
9
Loop Unrolling Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Can extract more parallelism
10
Loop Unrolling Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 What is the problem here?
11
Loop Unrolling Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Unnecessary instructions and redundant instructions
12
Loop Unrolling Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Loop:LDF0,0(R1) MULDF4,F0,F2 SD0(R1),F4 LDF0,8(R1) ADDIR1,R1,16 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Still problems with scheduling? Hint
13
Register Renaming Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Loop:LDF0,0(R1) MULDF4,F0,F2 SD0(R1),F4 LDF10,8(R1) ADDIR1,R1,16 MULDF14,F10,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F14 Let’s schedule now
14
Register Renaming Loop:LDF0,0(R1) ADDIR1,R1,8 MULDF4,F0,F2 SEQ R3, R1, R2 BNEZR3,Loop SD-8(R1),F4 Loop:LDF0,0(R1) LDF10,8(R1) MULDF4,F0,F2 MULDF14,F10,F2 ADDIR1,R1,16 SEQ R3, R1, R2 SD0(R1),F4 BNEZR3,Loop SD-8(R1),F14 Cycles/iteration?
15
How easy is it to determine dependences? Easy to determine for registers (fixed names) Hard for memory: Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)? Another Example: ST R5, R6 LD R4, R3
16
Memory Disambiguation Problem: In many cases, it is likely but not certain that two memory instructions reference different addresses Disambiguation is much harder in languages with pointers Example: void annoy_compiler1(char *foo, char *bar) { foo[2] = bar[2]; bar[3] = foo[3]; } Memory references are independent unless foo = bar
17
Disambiguation 2 Making things worse, some programs have independent memory references some of the time Example: void annoy_compiler2(int *a, int *b) { int I; for (I = 0; I < 256; I++){ a[I] = b[f(I)]; } Conventional compiler needs to assume that any references that could be to the same location are to the same location and serialize them
18
HW Schemes: Instruction Parallelism Why in HW at run time? Works when can’t know dependence until run time Variable latency Control dependent data dependence Can schedule differently every time through the code. Compiler simpler Code for one machine runs well on another Hardware techniques to find/extract ILP Tomasulo’s Algorithm for Out-of-order Execution
19
Tomasulo’s Algorithm Developed for architecture of IBM 360/91 (1967) 360/91 system’s goal was to significantly improve performance (especially floating-point) without requiring people to change their code Sound familiar? 16MHz 2MB Mem 50X faster Than SOA
20
Tomasulo Organization
21
Tomasulo Algorithm Consider three input instructions Common Data Bus broadcasts results to all FUs RS’s (FU’s), registers, etc. responsible for collecting own data off CDB Load and Store Queues treated as FUs as well
22
Reservation Station Components Op—Operation to perform in the unit (e.g., + or –) Qj, Qk—Reservation stations producing source registers Vj, Vk—Value of Source operands Rj, Rk—Flags indicating when Vj, Vk are ready Busy—Indicates reservation station is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
23
Three Stages of Tomasulo Algorithm 1.Issue —get instruction from FP Op Queue If reservation station free, the scoreboard issues instr & sends operands (renames registers). 2.Execution —operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result 3.Write result —finish execution (WB) Write on Common Data Bus to all waiting units; mark reservation station available.
24
Tomasulo Example ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Multiply takes 10 clocks, add/sub take 4
25
Tomasulo – cycle 0 ADDD F4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 ADDD F6, F8, F6 MULD F8, F4, F2 ADDD F2, F8, F0 SUBD F8, F2, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0 4.0 6.0 8.0 FP addersFP mult’s 123123 1212
26
Tomasulo – cycle 1 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 ADDD F6, F8, F6 MULD F8, F4, F2 ADDD F2, F8, F0 SUBD F8, F2, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0 4.0add1 6.0 8.0 ADDD 2.00.0 FP addersFP mult’s 123123 1212
27
Tomasulo – cycle 2 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 ADDD F6, F8, F6 SUBD F8, F2, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0 4.0add1 6.0 8.0mult1 ADDD 2.00.0 FP adders MULD add12.0 FP mult’s 123123 1212 ADDD F2, F8, F0 MULD add1 - - 2.0 Y Op Qj Qk Vj Vk Busy
28
Tomasulo – cycle 2 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 ADDD F6, F8, F6 SUBD F8, F2, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0 4.0add1 6.0 8.0mult1 ADDD 2.00.0 FP adders MULD add12.0 FP mult’s 123123 1212 ADDD F2, F8, F0
29
Tomasulo – cycle 3 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 SUBD F8, F2, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0 4.0add1 6.0add2 8.0mult1 ADDD 2.00.0 ADDD mult16.0 FP adders MULD add12.0 FP mult’s 123123 1212
30
Tomasulo – cycle 4 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0 4.0add1 6.0add2 8.0add3 ADDD 2.00.0 ADDD mult16.0 SUBD 2.00.0 FP adders MULD add12.0 FP mult’s 123123 1212
31
Tomasulo – cycle 5 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0 - 6.0add2 8.0add3 ADDD 2.00.0 ADDD mult16.0 SUBD 2.00.0 FP adders MULD 2.0 FP mult’s 123123 1212 2.0 (add1 result)
32
Tomasulo – cycle 6 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0add1 2.0- 6.0add2 8.0add3 ADDD add30.0 ADDD mult16.0 SUBD 2.00.0 FP adders MULD 2.0 FP mult’s 123123 1212
33
Tomasulo – cycle 8 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0add1 2.0- 6.0add2 2.0- ADDD 2.00.0 ADDD mult16.0 SUBD 2.00.0 FP adders MULD 2.0 FP mult’s 123123 1212 2.0 (add3 result)
34
Tomasulo – cycle 9 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0add1 2.0 6.0add2 2.0 ADDD 2.00.0 ADDD mult16.0 FP adders MULD 2.0 FP mult’s 123123 1212
35
Tomasulo – cycle 12 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0- 6.0add2 2.0 ADDD 2.00.0 ADDD mult16.0 FP adders MULD 2.0 FP mult’s 123123 1212 2.0 (add1 result)
36
Tomasulo – cycle 15 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0- 6.0add2 2.0 ADDD 4.06.0 FP adders MULD 2.0 FP mult’s 123123 1212 4.0 (mult1 result)
37
Tomasulo – cycle 16 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0- 6.0add2 2.0 ADDD 4.06.0 FP addersFP mult’s 123123 1212
38
Tomasulo – cycle 19 ADDDF4, F2, F0 MULDF8, F4, F2 ADDDF6, F8, F6 SUBDF8, F2, F0 ADDDF2, F8, F0 Instruction Queue F0 F2 F4 F6 F8 0.0 2.0 10.0- 2.0 ADDD 4.06.0 FP addersFP mult’s 123123 1212 10.0 (add2 result)
39
Tomasulo Summary Prevents Register as bottleneck Avoids WAR, WAW hazards Lasting Contributions Dynamic scheduling Register renaming (in what way does the register name change?) Load/store disambiguation
40
Limitations Exceptions/interrupts Can’t identify a particular point in the program at which an interrupt/exception occurs How do you know where to go back to after an interrupt handler completes? OOO completion??? Interaction with pipelined ALUs Reservation station couldn’t be released until instruction completes, would need many reservation stations.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.