Presentation is loading. Please wait.

Presentation is loading. Please wait.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

Similar presentations


Presentation on theme: "Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture."— Presentation transcript:

1 Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture

2 Inf3 Computer Architecture - 2006-2007 2 Terminology  Instruction fetch: fetch instruction from memory  Instruction issue: decode instruction, check for structural hazards, and send to execution units  Instruction execution: execute instruction once dependences are cleared  Instruction completion (or retire or commit): finish instruction and update processor state  Some combinations are possible: –In-order issue, execution and completion –Out-of-order issue and execution and in-order completion –Out-of-order issue, execution and completion

3 Inf3 Computer Architecture - 2006-2007 3 Dynamic Scheduling 1: Scoreboarding  Handles all RAW, WAR, and WAW with proper stalls, but allows independent instructions to proceed  Step 1: Issue (part of original ID stage) –Issue instruction to functional unit iff functional unit is free and no earlier instruction writes to the same destination register (WAW)  Step 2: Read operands (part of original ID stage) –Wait until source registers become available from earlier instructions through register file (RAW)  Step 3: Execute (original EXE stage) –Execute instruction and notify scoreboard when done  Step 4: Write result (original WB stage) –Wait until earlier instructions read operands before writing to register file (WAR)

4 Inf3 Computer Architecture - 2006-2007 4 Scoreboard Organization  Instruction status: either one of the four steps of the instruction operation (Issue, Read Op, Execute, Write)  Functional unit status: –Busy – functional unit is being used –Op – type of operation to be performed (e.g., add, subtract, etc) –F i – destination register –F j, F k – source registers –Q j, Q k – functional units producing F j and F k –R j, R k – flag to indicate when F j and F k are ready (set to No after operand is read)  Register result status: indicate which functional unit will write the register next (one per register)

5 Inf3 Computer Architecture - 2006-2007 5  Instruction sequence:  Latencies: –Integer  1 cycle –FP add  2 cycles –FP multiply  10 cycles –FP divide  40 cycles  Functional units: 1 integer (also for ld/st), 1 FP adder, 2 FP multipliers, 1 FP divider L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Scoreboard Example

6 Inf3 Computer Architecture - 2006-2007 6 Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueRead OpExecuteWrite BusyOpFiFjName Integer Mult1 Mult2 Add Divide FkQjQkRjRk F0F2F4F6F8F10F12…F30 cycle 1  YLoadF6R2Y Integer Scoreboard Example

7 Inf3 Computer Architecture - 2006-2007 7 cycle 3cycle 4cycle 2 Scoreboard Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueRead OpExecuteWrite BusyOpFiFjName Integer Mult1 Mult2 Add Divide FkQjQkRjRk F0F2F4F6F8F10F12…F30  YLoadF6R2N Integer   MultYF0F2F4NY Mult1  Integer  SubYF8F6F2NN Add   YDivF10F0F6Mult1NIntegerN Divide

8 Inf3 Computer Architecture - 2006-2007 8 Scoreboard Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueRead OpExecuteWrite BusyOpFiFjName Integer Mult1 Mult2 Add Divide FkQjQkRjRk F0F2F4F6F8F10F12…F30 cycle 5  YLoadF2R3Y Integer   MultYF0F2F4NY Mult1   SubYF8F6F2YN Add   YDivF10F0F6Mult1NY Divide  Integer

9 Inf3 Computer Architecture - 2006-2007 9 cycle 6cycle 7cycle 8 Scoreboard Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueRead OpExecuteWrite BusyOpFiFjName Integer Mult1 Mult2 Add Divide FkQjQkRjRk F0F2F4F6F8F10F12…F30  YLoadF2R3N Integer   MultYF0F2F4NY Mult1   SubYF8F6F2YN Add   YDivF10F0F6Mult1NY Divide  Integer  H&P A.52

10 Inf3 Computer Architecture - 2006-2007 10 Scoreboard Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueRead OpExecuteWrite BusyOpFiFjName Integer Mult1 Mult2 Add Divide FkQjQkRjRk F0F2F4F6F8F10F12…F30 cycle 9   MultYF0F2F4NN Mult1   SubYF8F6F2NN Add   YDivF10F0F6Mult1NY Divide    cycle 10  cycle 11  cycle 12

11 Inf3 Computer Architecture - 2006-2007 11 Scoreboard Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueRead OpExecuteWrite BusyOpFiFjName Integer Mult1 Mult2 Add Divide FkQjQkRjRk F0F2F4F6F8F10F12…F30 cycle 13   MultYF0F2F4NN Mult1   AddYF6F8F2YY Add   YDivF10F0F6Mult1NY Divide    

12 Inf3 Computer Architecture - 2006-2007 12 cycle 14cycle 15cycle 16cycle 17cycle 18cycle 19cycle 20 Scoreboard Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueRead OpExecuteWrite BusyOpFiFjName Integer Mult1 Mult2 Add Divide FkQjQkRjRk F0F2F4F6F8F10F12…F30   MultYF0F2F4NN Mult1   AddYF6F8F2NN Add   YDivF10F0F6Mult1NY Divide      H&P A.53 

13 Inf3 Computer Architecture - 2006-2007 13 Scoreboard Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueRead OpExecuteWrite BusyOpFiFjName Integer Mult1 Mult2 Add Divide FkQjQkRjRk F0F2F4F6F8F10F12…F30 cycle 21     AddYF6F8F2NN Add   YDivF10F0F6NN Divide       cycle 22 

14 Inf3 Computer Architecture - 2006-2007 14 Scoreboard Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueRead OpExecuteWrite BusyOpFiFjName Integer Mult1 Mult2 Add Divide FkQjQkRjRk F0F2F4F6F8F10F12…F30 cycle 23       YDivF10F0F6NN Divide      H&P A.54    cycle 62 

15 Inf3 Computer Architecture - 2006-2007 15 Scoreboard Summary  Dynamically schedules instructions  Forces instructions to wait on RAW, WAR, WAW dependences and structural hazards  First used in the CDC 6600 in 1964 and yielded performance improvements of 1.7 to 2.5 times  Hardware cost (size) of scoreboard equivalent to one of the functional units  Many more buses required to move around operands, results, and instructions

16 Inf3 Computer Architecture - 2006-2007 16 Dynamic Scheduling 2: Tomasulo’s Algorithm  Handles RAW with proper stalls and eliminates WAR and WAW through register renaming  Step 1: Issue –Issue instruction to the reservation stations if there is a free reservation station –Read operands if available or rename operands if pending (WAR and WAW)  Step 2: Execute –Execute instruction when both operands are ready in the reservation station (RAW)  Step 3: Write result –Send results to register file and to all reservation stations requiring the values (RAW)  Results are communicated through common data bus (CDB) (forwarding)

17 Inf3 Computer Architecture - 2006-2007 17 Address unit IBM S/360 model 91 used Tomasulo’s Algorithm  Dynamic O-O-O execution  Tags used to name flow dependencies  5 reservation stations  6 load buffers  Issue instructions to reservation stations, load buffers and store buffers  Instructions wait in reservation stations or store buffers until all their operands are collected  Functional units broadcast result and tag on the Common Data Bus (CDB) for all reservation stations, store buffers and FP registers to pick up. FP addersFP multipliersMemory unit Address unit From instruction fetch unit Instruction Queue FP registers Store buffers Load buffers Reservation stations 1 2 3 4 5 6 11...... ld f1, 4(r1) st f4, 8(r2) mul f3, f1, f2 add f4, f5, f3

18 Inf3 Computer Architecture - 2006-2007 18 Data Structures for Tomasulo’s Algorithm  Reservation station (RS) fields: OpThe operation to be performed Qj, QkTags to identify the producing RS of each source operand Vj, VkActual values of the two source operands BusyIndicates if RS contains a valid instruction  Register file fields: QiTag to identify producing RS of next value for this register  Load buffers: AThe computed address from which to load  Store buffers: AThe computed address to which a store will take place QjTag to identify the producing RS of the store data

19 Inf3 Computer Architecture - 2006-2007 19 Operation of Tomasulo’s Algorithm  Instruction Issue: Get next instruction from head of the issue queue If reservation station RS is available then: For each p in { i, j } representing operand register u If Reg[u].Qi == 0 then RS.Vp = Reg[u].value // value ready now If Reg[u].Qi != 0 then RS.Qp = Reg[u].Qi // value not yet ready RS.Busy = 1 // reserve this RS RS.Op = instruction opcode // set the operation  Execution: Wait until (RS.Qj == 0) and (RS.Qk == 0), and whilst waiting: For each p in { i, j } If CDB.tag == RS.Qp then { RS.Vp = CDB.value; RS.Qp = 0 } When (RS.Qj == 0) and (RS.Qk == 0), perform operation in RS.Op  Write Result: When CDB is free, broadcast CDB = { tag = RS.id, value = RS.result } and clear RS.Busy

20 Inf3 Computer Architecture - 2006-2007 20 Tomasulo’s Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueExecuteWrite BusyOpVjName Add_1 Add_2 Add_3 Mult_1 Mult_2 VkQjQk F0F2F4F6F8F10F12…F30 cycle 1  Load_1  YMultR[F4]  Load_1 Mult_1 cycle 2cycle 3   YSubM[34+R[R2]]Load_1 Add_1

21 Inf3 Computer Architecture - 2006-2007 21 Tomasulo’s Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueExecuteWrite BusyOpVjName Add_1 Add_2 Add_3 Mult_1 Mult_2 VkQjQk F0F2F4F6F8F10F12…F30 cycle 4  Load_1  YMultR[F4]  Load_1 Mult_1   YSubM[34+R[R2]]Load_1 Add_1   YDivMult_1R[F6] Mult_2 cycle 5cycle 6   YAddM[45+R[R3]]Add_1 Add_2

22 Inf3 Computer Architecture - 2006-2007 22 Tomasulo’s Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueExecuteWrite BusyOpVjName Add_1 Add_2 Add_3 Mult_1 Mult_2 VkQjQk F0F2F4F6F8F10F12…F30 cycle 7  YMultR[F4]  Load_1 Mult_1   YSubM[34+R[R2]]Load_1 Add_1   YDivMult_1R[F6] Mult_2   YAddM[45+R[R3]]Add_1 Add_2  cycle 8 

23 Inf3 Computer Architecture - 2006-2007 23 Tomasulo’s Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueExecuteWrite BusyOpVjName Add_1 Add_2 Add_3 Mult_1 Mult_2 VkQjQk F0F2F4F6F8F10F12…F30 cycle 9  YMultR[F4]  Load_1 Mult_1     YDivMult_1R[F6] Mult_2   YAddM[45+R[R3]]Add_1 Add_2  cycle 10  cycle 11

24 Inf3 Computer Architecture - 2006-2007 24 Tomasulo’s Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueExecuteWrite BusyOpVjName Add_1 Add_2 Add_3 Mult_1 Mult_2 VkQjQk F0F2F4F6F8F10F12…F30 cycle 12  YMultR[F4]  Load_1 Mult_1     YDivMult_1R[F6] Mult_2     cycle 15  cycle 16 

25 Inf3 Computer Architecture - 2006-2007 25 Tomasulo’s Example Instruction L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 IssueExecuteWrite BusyOpVjName Add_1 Add_2 Add_3 Mult_1 Mult_2 VkQjQk F0F2F4F6F8F10F12…F30 cycle 55       YDivMult_1R[F6] Mult_2      

26 Inf3 Computer Architecture - 2006-2007 26 Tomasulo’s Advantages  Register renaming: –Q j and Q k can come from any reservation station independent of the register file  in fact we could have many more reservation stations than registers –V j and V k store the actual value to be used  Parallel release of all instructions dependent as soon as the earlier instruction completes (both SUB.D and ADD.D get the value from Load_1 )  No need to wait on WAR and WAW (notice that ADD.D has issued before DIV.D has read its f6 operand and will execute as soon as the SUB.D finishes)  Dynamic unrolling of loops in hardware with dynamic register renaming


Download ppt "Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture."

Similar presentations


Ads by Google