Presentation is loading. Please wait.

Presentation is loading. Please wait.

HY425 – Αρχιτεκτονική Υπολογιστών Διάλεξη 05

Similar presentations


Presentation on theme: "HY425 – Αρχιτεκτονική Υπολογιστών Διάλεξη 05"— Presentation transcript:

1 HY425 – Αρχιτεκτονική Υπολογιστών Διάλεξη 05
Δημήτρης Νικολόπουλος, Αναπληρωτής Καθηγητής Τμήμα Επιστήμης Υπολογιστών Πανεπιστήμιο Κρήτης CS252 S05

2 Example: Device Interrupt (Say, arrival of network message)
add r1,r2,r3 subi r4,r1,#4 slli r4,r4,#2 Hiccup(!) lw r2,0(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 Raise priority Reenable All Ints Save registers lw r1,20(r0) lw r2,0(r1) addi r3,r0,#5 sw 0(r1),r3 Restore registers Clear current Int Disable All Ints Restore priority RTE Disable All Ints PC saved Supervisor Mode Restore PC User Mode External Interrupt “Interrupt Handler” 7/18/2018 HY425 - Διάλεξη 05

3 Alternative: Polling (again, for arrival of network message)
Disable Network Intr Subi r4,r1,#4 Slli r4,r4,#2 lw r2,0(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 lw r1,12(r0) beq r1,no_mess lw r1,20(r0) lw r2,0(r1) addi r3,r0,#5 sw 0(r1),r3 Clear Network Intr External Interrupt Polling Point (check device register) “Handler” no_mess: 7/18/2018 HY425 - Διάλεξη 05

4 Polling is faster/slower than Interrupts.
Polling can be faster than interrupts because Compiler knows which registers in use at polling point. Hence, no need to save and restore registers Other interrupt overhead avoided (pipeline flush, trap priorities, etc) Polling is slower than interrupts because Overhead of polling instructions is incurred regardless of whether or not handler is run. This could add to inner-loop delay. Device may have to wait for service for a long time When to use one or the other Multi-axis tradeoff Frequent/regular events good for polling, as long as device can be controlled at user level Interrupts good for infrequent/irregular events Interrupts good for ensuring regular/predictable service of events. 7/18/2018 HY425 - Διάλεξη 05

5 Exception/Interrupt classifications
Exceptions: relevant to the current process Faults, arithmetic traps, and synchronous traps Invoke software on behalf of the currently executing process Interrupts: caused by asynchronous, outside events I/O devices requiring service (DISK, network) Clock interrupts (real time scheduling) Machine Checks: caused by serious hardware failure Not always restartable Indicate that bad things have happened. Non-recoverable ECC error Machine room fire Power outage 7/18/2018 HY425 - Διάλεξη 05

6 A related classification: Synchronous vs. Asynchronous
Synchronous: means related to the instruction stream, i.e. during the execution of an instruction Must stop an instruction that is currently executing Page fault on load or store instruction Arithmetic exception Software Trap Instructions Asynchronous: means unrelated to the instruction stream, i.e. caused by an outside event. Does not have to disrupt instructions that are already executing Interrupts are asynchronous Machine checks are asynchronous Semisynchronous (or high-availability interrupts): Caused by external event but may have to disrupt current instructions in order to guarantee service 7/18/2018 HY425 - Διάλεξη 05

7 Interrupt controller hardware and mask levels
Interrupt disable mask may be multi-bit word accessed through some special memory address Operating system constructs a hierarchy of masks that reflects some form of interrupt priority Software needs to identify source of interrupt and select the appropriate interrupt handler Typically done with a table lookup 7/18/2018 HY425 - Διάλεξη 05

8 Recap: Device Interrupt (Say, arrival of network message)
add r1,r2,r3 subi r4,r1,#4 slli r4,r4,#2 Hiccup(!) lw r2,0(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 Raise priority Reenable All Ints Save registers lw r1,20(r0) lw r2,0(r1) addi r3,r0,#5 sw 0(r1),r3 Restore registers Clear current Int Disable All Ints Restore priority RTE Disable All Ints PC saved Supervisor Mode Restore PC User Mode External Interrupt Could be interrupted by disk “Interrupt Handler” Note that priority must be raised to avoid recursive interrupts! 7/18/2018 HY425 - Διάλεξη 05

9 SPARC (and RISC I) had register windows
On interrupt or procedure call, simply switch to a different set of registers Really saves on interrupt overhead Interrupts can happen at any point in the execution, so compiler can not help with knowledge of live registers. Conservative handlers must save all registers Short handlers might be able to save only a few, but this analysis is complicated Not a major limitation with procedure calls Good compilers can allocate registers across procedure boundaries Good compilers know what registers are live at any one time 7/18/2018 HY425 - Διάλεξη 05

10 Supervisor State Typically, processors have some amount of state that user programs are not allowed to touch. Page mapping hardware/TLB TLB prevents one user from accessing memory of another TLB protection prevents user from modifying mappings Interrupt controllers -- User code prevented from crashing machine by disabling interrupts. Ignoring device interrupts, etc. Real-time clock interrupts ensure that users cannot lockup/crash machine even if they run code that goes into a loop: “Preemptive Multitasking” vs “non-preemptive multitasking” Access to hardware devices restricted Prevents malicious user from stealing network packets Prevents user from writing over disk blocks Distinction made with at least two-levels: USER/SYSTEM (one hardware mode-bit) x86 architectures actually provide 4 different levels, only two usually used by OS (or only 1 in older Microsoft OSs) Virtualization technology (VM monitors or hypervisors) use >2 privilege levels 7/18/2018 HY425 - Διάλεξη 05

11 Entry into Supervisor Mode
Entry into supervisor mode typically happens on interrupts, exceptions, and special trap instructions. Entry goes through kernel instructions: interrupts, exceptions, and trap instructions change to supervisor mode, then jump (indirectly) through table of instructions in kernel OS “System Calls” are just trap instructions: read(fd,buffer,count) => st 20(r0),r st 24(r0),r st 28(r0),r trap $READ OS overhead can be serious concern for achieving fast interrupt behavior. 7/18/2018 HY425 - Διάλεξη 05

12 Precise Interrupts/Exceptions
An interrupt or exception is considered precise if there is a single instruction (or interrupt point) for which all instructions before that instruction have committed their state and no following instructions including the interrupting instruction have not modified any state. This means, effectively, that you can restart execution at the interrupt point and “get the right answer” Implicit in our previous example of a device interrupt: Interrupt point is at first lw instruction Disable All Ints PC saved Supervisor Mode add r1,r2,r3 subi r4,r1,#4 slli r4,r4,#2 lw r2,0(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 External Interrupt Int handler Restore PC User Mode 7/18/2018 HY425 - Διάλεξη 05

13 Precise interrupt point requires multiple PCs to describe in presence of delayed branches
addi r4,r3,#4 sub r1,r2,r3 bne r1,there and r2,r3,r5 <other insts> PC: PC+4: Interrupt point described as <PC,PC+4> addi r4,r3,#4 sub r1,r2,r3 bne r1,there and r2,r3,r5 <other insts> Interrupt point described as: <PC+4,there> (branch was taken) or <PC+4,PC+8> (branch was not taken) PC: PC+4: 7/18/2018 HY425 - Διάλεξη 05

14 Why are precise interrupts desirable?
Restartability doesn’t require preciseness. However, preciseness makes it a lot easier to restart. Simplify the task of the operating system a lot Less state needs to be saved away if unloading process. Many types of interrupts/exceptions need to be restartable. Easier to figure out what actually happened: I.e. TLB faults. Need to fix translation, then restart load/store IEEE gradual underflow, illegal operation, etc: . 7/18/2018 HY425 - Διάλεξη 05

15 Precise Exceptions in simple 5-stage pipeline:
Exceptions may occur at different stages in pipeline (i.e. out of order): Arithmetic exceptions occur in execution stage TLB faults can occur in instruction fetch or memory stage What about interrupts? “do no harm” applies here: try to interrupt the pipeline as little as possible All of this solved by tagging instructions in pipeline as “cause exception or not” and wait until end of memory stage to flag exception Interrupts become marked NOPs (like bubbles) that are placed into pipeline instead of an instruction. Assume that interrupt condition persists in case NOP flushed Clever instruction fetch might start fetching instructions from interrupt vector, but this is complicated by need for supervisor mode switch, saving of one or more PCs, etc 7/18/2018 HY425 - Διάλεξη 05

16 Another look at the exception problem
Time Data TLB IFetch Dcd Exec Mem WB Bad Inst IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Inst TLB fault Program Flow IFetch Dcd Exec Mem WB Overflow Use pipeline to sort this out! Pass exception status along with instruction. Keep track of PCs for every instruction in pipeline. Don’t act on exception until it reaches WB stage Handle interrupts through “faulting noop” in IF stage When instruction reaches WB stage: Save PC  EPC, Interrupt vector addr  PC Turn all instructions in earlier stages into noops! 7/18/2018 HY425 - Διάλεξη 05

17 Approximations to precise interrupts
Hardware has imprecise state at time of interrupt Exception handler must figure out how to find a precise PC at which to restart program. Done by emulating instructions that may remain in pipeline Example: SPARC allows limited parallelism between FP and integer core: possible that integer instructions #1 - #4 have already executed at time that the first floating instruction gets a recoverable exception Interrupt handler code must fixup <float 1>, then emulate both <float 1> and <float 2> At that point, precise interrupt point is integer instruction #5 Vax had string move instructions that could be in middle at time that page-fault occurred. Could be arbitrary processor state that needs to be restored to restart execution. <float 1> <int 1> <int 2> <int 3> <float 2> <int 4> <int 5> 7/18/2018 HY425 - Διάλεξη 05

18 How to achieve precise interrupts when instructions executing in arbitrary order?
Jim Smith’s classic paper discusses several methods for getting precise interrupts: In-order instruction completion Reorder buffer History buffer We will discuss some of these after we see the advantages of out-of-order execution. 7/18/2018 HY425 - Διάλεξη 05

19 Changes in the flow of instructions make pipelining difficult
Must avoid adding too much overhead in pipeline startup and drain. Branches and Jumps cause fast alteration of PC. Things that get in the way: Instructions take time to decode, introducing delay slots. The next PC takes time to compute For conditional branches, the branch direction takes time to compute. Interrupts and Exceptions also cause problems Must make decisions about when to interrupt flow of instructions Must preserve sufficient pipeline state to resume execution 7/18/2018 HY425 - Διάλεξη 05

20 Jumps and Calls (JAL) (unconditional branches)
Even though we know that we will change PC, still require delay slot because of: Instruction Decode - Pretty fast PC Computation - Could fix with absolute jumps/calls (not necessarily a good solution) Basically, there is a decision being made, which takes time. This suggests single delay slot: i.e. next instruction after jump or JAL is always executed 7/18/2018 HY425 - Διάλεξη 05

21 Achieving “zero-cycle” jump
What really has to be done at runtime? Once an instruction has been detected as a jump or JAL, we might recode it in the internal cache. Very limited form of dynamic compilation? Use of “Pre-decoded” instruction cache Called “branch folding” in the Bell-Labs CRISP processor. Original CRISP cache had two addresses and could thus fold a complete branch into the previous instruction Notice that JAL introduces a structural hazard on write and r3,r1,r5 addi r2,r3,#4 sub r4,r2,r1 jal doit subi r1,r1,#1 A: doit addi r2,r3,#4 A+8 N L --- - -- -- A+4 subi r1,r1,#1 A+20 Internal Cache state: Folding attempts to assign a value to the PC, while the PC is being incremented by 4. At the same time, the instruction updates a register (r4). Therefore, we have a structural hazard. 7/18/2018 HY425 - Διάλεξη 05

22 Why not do this for branches? (original CRISP idea, applied to DLX)
Delay slot eliminated (good) Branch has been “folded” into sub instruction (good). Increases size of instruction cache (not so good) Requires another read port in register file (BAD) Potentially doubles clock period to access reg and mem (BAD) Internal Cache state: A+16 A+8 --- A+4 A+20 N/A loop Next Branch and r3,r1,r5 N BnR4 -- and r3,r1,r5 addi r2,r3,#4 sub r4,r2,r1 bne r4,loop subi r1,r1,#1 A: addi r2,r3,#4 sub r4,r2,r1 sub r4,r2,r1 --- A+16: subi r1,r1,#1 7/18/2018 HY425 - Διάλεξη 05

23 Way of looking at timing:
Clock: Instruction Cache Access Beginning of IFetch Ready to latch new PC Branch Register Lookup Mux Register file access time might be close to original clock period 7/18/2018 HY425 - Διάλεξη 05

24 However, one could use the first technique to reflect PREDICTIONS and remove delay slots
This causes the next instruction to be immediately fetched from branch destination (predict taken) If branch ends up being not taken, then squash destination instruction and restart pipeline at address A+16 Internal Cache state: and r3,r1,r5 addi r2,r3,#4 sub r4,r2,r1 bne r4,loop subi r1,r1,#1 A: addi r2,r3,#4 bne loop subi r1,r1,#1 N A+12 A+8 loop A+4 A+20 Next A+16: 7/18/2018 HY425 - Διάλεξη 05

25 Can we use HW to get CPI closer to 1?
Why in HW at run time? Works when can’t know real dependence at compile time Compiler simpler Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Out-of-order execution => out-of-order completion. 7/18/2018 HY425 - Διάλεξη 05

26 Problems? How do we prevent WAR and WAW hazards?
How do we deal with variable latency? Forwarding for RAW hazards harder. 7/18/2018 HY425 - Διάλεξη 05

27 Scoreboard: a bookkeeping technique
Out-of-order execution divides ID stage: 1.Issue—decode instructions, check for structural hazards 2.Read operands—wait until no data hazards, then read operands Scoreboards date to CDC6600 in 1963 Instructions execute whenever not dependent on previous instructions and no hazards. CDC 6600: In order issue, out-of-order execution, out-of-order commit (or completion) No forwarding! Imprecise interrupt/exception model for now 7/18/2018 HY425 - Διάλεξη 05

28 Scoreboard Architecture (CDC 6600)
FP Mult FP Divide FP Add Integer Registers Functional Units SCOREBOARD Memory 7/18/2018 HY425 - Διάλεξη 05

29 Scoreboard Implications
Out-of-order completion => WAR, WAW hazards? Solutions for WAR: Stall writeback until registers have been read Read registers only during Read Operands stage Solution for WAW: Detect hazard and stall issue of new instruction until other instruction completes No register renaming! Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies between instructions that have already issued. Scoreboard replaces ID, EX, WB with 4 stages 7/18/2018 HY425 - Διάλεξη 05

30 Four Stages of Scoreboard Control
Issue—decode instructions & check for structural hazards (ID1) Instructions issued in program order (for hazard checking) Don’t issue if structural hazard Don’t issue if instruction is output dependent on any previously issued but uncompleted instruction (no WAW hazards) Read operands—wait until no data hazards, then read operands (ID2) All real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data. No forwarding of data in this model! 7/18/2018 HY425 - Διάλεξη 05

31 Four Stages of Scoreboard Control
Execution—operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. Write result—finish execution (WB) Stall until no WAR hazards with previous instructions: Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands 7/18/2018 HY425 - Διάλεξη 05

32 Three Parts of the Scoreboard
Instruction status: Which of 4 steps the instruction is in Functional unit status:—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy: Indicates whether the unit is busy or not Op: Operation to perform in the unit (e.g., + or –) Fi: Destination register Fj,Fk: ource-register numbers Qj,Qk: Functional units producing source registers Fj, Fk Rj,Rk: Flags indicating when Fj, Fk are ready Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks 7/18/2018 HY425 - Διάλεξη 05

33 Scoreboard Example 7/18/2018 HY425 - Διάλεξη 05

34 Detailed Scoreboard Pipeline Control
Read operands Execution complete Instruction status Write result Issue Rj and Rk Functional unit done Wait until f((Fj(f)Fi(FU) or Rj(f)=No) & (Fk(f)Fi(FU) or Rk( f )=No)) Not busy (FU) and not result(D) Bookkeeping Rj No; Rk No f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rk(f) Yes); Result(Fi(FU)) 0; Busy(FU) No Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’; Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj; Rk not Qk; Result(‘D’) FU; 1.Issue - if no structural haards AND non wAW (no Funtional Unit is going to write this destination register; 1 per clock cycle 2. Read -(RAW) if no instructions is going to write a source register of this instruction (alternatively, no write signal this clock cycle) +> gein exection of the instruction; many read ports, so can read many times 3. Execution Complete; multiple during clock cyle 4. Write result - (WAR) If no instructiion is watiing to read the destination register; assume multiple wriite ports; wait for clock cycle to write and tehn read the results; assume can oerlap issue & write show clock cyclesneed 20 or so Latency: minimum is 4 through 4 stages 7/18/2018 HY425 - Διάλεξη 05

35 Scoreboard Example: Cycle 1
7/18/2018 HY425 - Διάλεξη 05

36 Scoreboard Example: Cycle 2
Issue 2nd LD? 7/18/2018 HY425 - Διάλεξη 05

37 Scoreboard Example: Cycle 3
Issue MULT? 7/18/2018 HY425 - Διάλεξη 05

38 Scoreboard Example: Cycle 4
7/18/2018 HY425 - Διάλεξη 05

39 Scoreboard Example: Cycle 5
7/18/2018 HY425 - Διάλεξη 05

40 Scoreboard Example: Cycle 6
7/18/2018 HY425 - Διάλεξη 05

41 Scoreboard Example: Cycle 7
Read multiply operands? 7/18/2018 HY425 - Διάλεξη 05

42 Scoreboard Example: Cycle 8a (First half of clock cycle)
7/18/2018 HY425 - Διάλεξη 05

43 Scoreboard Example: Cycle 8b (Second half of clock cycle)
7/18/2018 HY425 - Διάλεξη 05

44 Scoreboard Example: Cycle 9
Note Remaining Read operands for MULT & SUB? Issue ADDD? 7/18/2018 HY425 - Διάλεξη 05

45 Scoreboard Example: Cycle 10
7/18/2018 HY425 - Διάλεξη 05

46 Scoreboard Example: Cycle 11
7/18/2018 HY425 - Διάλεξη 05

47 Scoreboard Example: Cycle 12
Read operands for DIVD? 7/18/2018 HY425 - Διάλεξη 05

48 Scoreboard Example: Cycle 13
7/18/2018 HY425 - Διάλεξη 05

49 Scoreboard Example: Cycle 14
7/18/2018 HY425 - Διάλεξη 05

50 Scoreboard Example: Cycle 15
7/18/2018 HY425 - Διάλεξη 05

51 Scoreboard Example: Cycle 16
7/18/2018 HY425 - Διάλεξη 05

52 Scoreboard Example: Cycle 17
WAR Hazard! Why not write result of ADD??? 7/18/2018 HY425 - Διάλεξη 05

53 Scoreboard Example: Cycle 18
7/18/2018 HY425 - Διάλεξη 05

54 Scoreboard Example: Cycle 19
7/18/2018 HY425 - Διάλεξη 05

55 Scoreboard Example: Cycle 20
7/18/2018 HY425 - Διάλεξη 05

56 Scoreboard Example: Cycle 21
WAR Hazard is now gone... 7/18/2018 HY425 - Διάλεξη 05

57 Scoreboard Example: Cycle 22
7/18/2018 HY425 - Διάλεξη 05

58 Scoreboard Example: Cycle 22

59 Faster than light computation (skip a couple of cycles)

60 Scoreboard Example: Cycle 61

61 Scoreboard Example: Cycle 62

62 Review: Scoreboard Example: Cycle 62
In-order issue; out-of-order execute & commit

63 CDC 6600 Scoreboard Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) limits benefit Limitations of 6600 scoreboard: No forwarding hardware Limited to instructions in basic block (small window) Small number of functional units (structural hazards), especially integer/load store units Do not issue on structural hazards Wait for WAR hazards Prevent WAW hazards

64 Another Dynamic Algorithm: Tomasulo Algorithm
For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600 IBM has memory-register ops Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

65 Tomasulo Algorithm vs. Scoreboard
Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; FU buffers called “reservation stations”; have pending operands Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can’t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue

66 Tomasulo Organization
FP Op Queue FP Registers From Mem Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 Reservation Stations To Mem Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel FP adders FP multipliers Common Data Bus (CDB)

67 Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks

68 Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (“go to” bus) Common data bus: data + source (“come from” bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast

69 Tomasulo Example

70 Tomasulo Example Cycle 1

71 Tomasulo Example Cycle 2
Note: Unlike 6600, can have multiple loads outstanding

72 Tomasulo Example Cycle 3
Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard Load1 completing; what is waiting for Load1?

73 Tomasulo Example Cycle 4
Load2 completing; what is waiting for Load1?

74 Tomasulo Example Cycle 5

75 Tomasulo Example Cycle 6
Issue ADDD here vs. scoreboard?

76 Tomasulo Example Cycle 7
Add1 completing; what is waiting for it?

77 Tomasulo Example Cycle 8

78 Tomasulo Example Cycle 9

79 Tomasulo Example Cycle 10
Add2 completing; what is waiting for it?

80 Tomasulo Example Cycle 11
Write result of ADDD here vs. scoreboard? All quick instructions complete in this cycle!

81 Tomasulo Example Cycle 12

82 Tomasulo Example Cycle 13

83 Tomasulo Example Cycle 14

84 Tomasulo Example Cycle 15

85 Tomasulo Example Cycle 16

86 Faster than light computation (skip a couple of cycles)

87 Tomasulo Example Cycle 55

88 Tomasulo Example Cycle 56
Mult2 is completing; what is waiting for it?

89 Tomasulo Example Cycle 57
Once again: In-order issue, out-of-order execution and completion.

90 Compare to Scoreboard Cycle 62
Why take longer on scoreboard/6600? Structural Hazards Lack of forwarding

91 Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600)
Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷) window size: ≤ 14 instructions ≤ 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall issue Broadcast results from FU Write/read registers Control: reservation stations central scoreboard

92 Tomasulo Drawbacks Complexity delays of 360/91, MIPS 10000, IBM 620?
Many associative stores (CDB) at high speed Performance limited by Common Data Bus Multiple CDBs => more FU logic for parallel assoc stores

93 Summary #1 HW exploiting ILP
Works when can’t know dependence at compile time. Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) Enables out-of-order execution => out-of-order completion ID stage checked both for structural & data dependencies Original version didn’t handle forwarding. No automatic register renaming

94 Summary #2 Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264


Download ppt "HY425 – Αρχιτεκτονική Υπολογιστών Διάλεξη 05"

Similar presentations


Ads by Google