Download presentation
Presentation is loading. Please wait.
1
CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation
2
MIPS FP Unit using Tomasulo’s Algorithm
3
MIPS Processor with Scoreboard
4
Three Steps in Execution for Tomasulo’s Alg. 1. Issue ─ if no structural hazards 2. Execute ─ if both operands are available 3. Write result on CDB (from there into reservation stations waiting for results) Recall that for Scoreboard: Four Steps in Execution 1. Issue ─ if no structural nor WAW hazards 2. Read operands ─ if no RAW hazards 3. Execute ─ if both operands are received 4. Write result ─ if no WAR hazards
5
How Hazards are Handled Structural Hazards ─ Reservation stations allow more instructions to be issued RAW Hazards ─ An instruction is executed only when its operands are available WAR and WAW Hazards ─ Register renaming eliminates these hazards by renaming all destination registers, including those with a pending read or write for an earlier instruction, so that the out-of-order write does not affect any instruction that depends on an earlier value of an operand
6
Tags Tag is a 4-bit quantity that denotes one of five reservation stations or one of six load buffers Tag fields are found in the reservation stations, the register file, and the store buffers
7
Example L.DF6,34(R2) L.DF2,45(R3) MUL.DF0,F2,F4 SUB.DF8,F2,F6 DIV.DF10,F0,F6 ADD.DF6,F8,F2
8
Three Tables (1st table is not part of hardware; 2nd and 3rd tables are distributed) 1. Instruction status ─ indicates which of three steps of instruction 2. Reservation stations ─ busy, op, Vj, Vk, Qj, Qk, A (V = value; Q = reservation station) 3. Register status ─ indicates which reservation station will write this register
9
Figure 0.0 InstructionIssueExecuteWrite Result L.D F6,34(R2)√√ L.D F2,45(R3)√√ MUL.D F0,F2,F4√ SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 NameBusyOpVjVkQjQkA Load1YesLoad34+Reg[R2] Load2YesLoad45+Reg[R3] Add1No Add2No Add3No Mult1YesMultReg[F4]Load2 Mult2No F0F2F4F6F8F10F12…F30 QiMult1Load2Load1
10
Figure 0.1 InstructionIssueExecuteWrite Result L.D F6,34(R2)√√ L.D F2,45(R3)√√ MUL.D F0,F2,F4√ SUB.D F8,F2,F6√ DIV.D F10,F0,F6 ADD.D F6,F8,F2 NameBusyOpVjVkQjQkA Load1YesLoad34+Reg[R2] Load2YesLoad45+Reg[R3] Add1YesSubLoad2Load1 Add2No Add3No Mult1YesMultReg[F4]Load2 Mult2No F0F2F4F6F8F10F12…F30 QiMult1Load2Load1Add1
11
Figure 0.2 (Suppose LD is slow) InstructionIssueExecuteWrite Result L.D F6,34(R2)√√ L.D F2,45(R3)√√ MUL.D F0,F2,F4√ SUB.D F8,F2,F6√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2 NameBusyOpVjVkQjQkA Load1YesLoad34+Reg[R2] Load2YesLoad45+Reg[R3] Add1YesSubLoad2Load1 Add2No Add3No Mult1YesMultReg[F4]Load2 Mult2YesDivMult1Load1 F0F2F4F6F8F10F12…F30 QiMult1Load2Load1Add1Mult2
12
Figure 0.3 (Suppose LD is slow) InstructionIssueExecuteWrite Result L.D F6,34(R2)√√ L.D F2,45(R3)√√ MUL.D F0,F2,F4√ SUB.D F8,F2,F6√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2√ NameBusyOpVjVkQjQkA Load1YesLoad34+Reg[R2] Load2YesLoad45+Reg[R3] Add1YesSubLoad2Load1 Add2YesAddAdd1Load2 Add3No Mult1YesMultReg[F4]Load2 Mult2YesDivMult1Load1 F0F2F4F6F8F10F12…F30 QiMult1Load2Add2Add1Mult2
13
Figure 3.3 InstructionIssueExecuteWrite Result L.D F6,34(R2)√√√ L.D F2,45(R3)√√ MUL.D F0,F2,F4√ SUB.D F8,F2,F6√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2√ NameBusyOpVjVkQjQkA Load1No Load2YesLoad45+Reg[R3] Add1YesSubMem[34+Reg[R2]]Load2 Add2YesAddAdd1Load2 Add3No Mult1YesMultReg[F4]Load2 Mult2YesDivMem[34+Reg[R2]]Mult1 F0F2F4F6F8F10F12…F30 QiMult1Load2Add2Add1Mult2
14
Figure 0.4 (2 nd load just completes) InstructionIssueExecuteWrite Result L.D F6,34(R2)√√√ L.D F2,45(R3)√√√ MUL.D F0,F2,F4√√ SUB.D F8,F2,F6√√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2√ NameBusyOpVjVkQjQkA Load1No Load2No Add1YesSubMem[45+Reg[R3]]Mem[34+Reg[R2]] Add2YesAddMem[45+Reg[R3]]Add1 Add3No Mult1YesMultMem[45+Reg[R3]]Reg[F4] Mult2YesDivMem[34+Reg[R2]]Mult1 F0F2F4F6F8F10F12…F30 QiMult1Add2Add1Mult2
15
Figure 3.4 InstructionIssueExecuteWrite Result L.D F6,34(R2)√√√ L.D F2,45(R3)√√√ MUL.D F0,F2,F4√√ SUB.D F8,F2,F6√√√ DIV.D F10,F0,F6√ ADD.D F6,F8,F2√√√ NameBusyOpVjVkQjQkA Load1No Load2No Add1No Add2No Add3No Mult1YesMultMem[45+Reg[R3]]Reg[F4] Mult2YesDivMem[34+Reg[R2]]Mult1 F0F2F4F6F8F10F12…F30 QiMult1Mult2
16
Loop-Based Example Loop:L.DF0,0(R1) MUL.DF4,F0,F2 S.DF4,0(R1) DADDIUR1,R1,#−8 BNER1,R2,Loop
17
Figure 0.5. One active iteration of loop InstructionIterationIssueExecuteWrite Result L.D F0,0(R1)1√√ MUL.D F4,F0,F21√ S.D F4,0(R1)1√ L.D F0,0(R1)2 MUL.D F4,F0,F22 S.D F4,0(R1)2 NameBusyOpVjVkQjQkA Load1YesLoadReg[R1] Load2No Add1No Add2No Add3No Mult1YesMultReg[F2]Load1 Mult2No Store1YesStoreMult1Reg[R1] Store2No F0F2F4F6F8F10F12…F30 QiLoad1Mult1
18
Figure 0.6. One+ active iteration of loop InstructionIterationIssueExecuteWrite Result L.D F0,0(R1)1√√ MUL.D F4,F0,F21√ S.D F4,0(R1)1√ L.D F0,0(R1)2√ MUL.D F4,F0,F22 S.D F4,0(R1)2 NameBusyOpVjVkQjQkA Load1YesLoadReg[R1] Load2YesLoadReg[R1]-8 Add1No Add2No Add3No Mult1YesMultReg[F2]Load1 Mult2No Store1YesStoreMult1Reg[R1] Store2No F0F2F4F6F8F10F12…F30 QiLoad2Mult1
19
Figure 0.7. One++ active iteration of loop InstructionIterationIssueExecuteWrite Result L.D F0,0(R1)1√√ MUL.D F4,F0,F21√ S.D F4,0(R1)1√ L.D F0,0(R1)2√√ MUL.D F4,F0,F22√ S.D F4,0(R1)2 NameBusyOpVjVkQjQkA Load1YesLoadReg[R1] Load2YesLoadReg[R1]-8 Add1No Add2No Add3No Mult1YesMultReg[F2]Load1 Mult2YesMultReg[F2]Load2 Store1YesStoreMult1Reg[R1] Store2No F0F2F4F6F8F10F12…F30 QiLoad2Mult12
20
Figure 3.6. Two active iterations of loop InstructionIterationIssueExecuteWrite Result L.D F0,0(R1)1√√ MUL.D F4,F0,F21√ S.D F4,0(R1)1√ L.D F0,0(R1)2√√ MUL.D F4,F0,F22√ S.D F4,0(R1)2√ NameBusyOpVjVkQjQkA Load1YesLoadReg[R1] Load2YesLoadReg[R1]-8 Add1No Add2No Add3No Mult1YesMultReg[F2]Load1 Mult2YesMultReg[F2]Load2 Store1YesStoreMult1Reg[R1] Store2YesStoreMult2Reg[R1]-8 F0F2F4F6F8F10F12…F30 QiLoad2Mult12
21
IBM 360/91 Great ideas: Data tagging Register renaming Dynamic detection of memory hazards Generalized forwarding Ideas broadly used now in microprocessors Was 360/91 successful commercially?
22
IBM 360/85 (1968) First commercial computer with a cache: Slower clock time (80ns versus 60ns) Less memory interleaving (4 versus 16) Slower main memory (1.04 μs versus 0.75 μs) Cheaper in price Which machine was faster on applications?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.