Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,

Similar presentations


Presentation on theme: "1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,"— Presentation transcript:

1 1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91, and CDC 6600

2 2 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 12, 2009 Topic: Instruction-Level Parallelism I (Dynamic Scheduling: Scoreboarding)

3 3Outline  A more complex pipeline, the MIPS R4000 Look at the effects of memory with longer latency Look at the effects of memory with longer latency Also long floating point instructions Also long floating point instructions  Dynamic scheduling Scoreboarding Scoreboarding

4 4 R4000 Pipeline  From early 90s  Just before SGI bought MIPS  Superpipelined Approx. 2 instructions per cycle Approx. 2 instructions per cycle  Caches were pipelined Which is what most of the book’s discussion is about Which is what most of the book’s discussion is about  R4000 – 100MHz, 1.3M transistors, 2 levels of cache  R4400 – up to 250 MHz, larger caches

5 5 Block Diagram

6 6 Pipeline Diagram  Same logic as before, but now multiple cycles for memory access  Deeper pipeline will lead to more hazards More forwarding More forwarding Longer branch delays Longer branch delays Decode Address calculation, branching

7 7 Forwarding, 2 cycle delay

8 8 Or a 2 cycle stall  ADD stalled for R1  SUB uses forwarded value, OR from reg

9 9 Branch Delay = 3 Cycles

10 10 Predicted not Taken  If branch taken, need to stall for 2 cycles beyond delay slot

11 11 8 Stages in FP pipeline  Stages are used one or more times, depending on instruction (next)

12 12 Some FP Instructions  Note latencies and initiation intervals  Individual stages may result in structural hazards

13 13 Structural Hazard Example 1  Units needed at same time highlighted

14 14 Structural Hazard Example 2  The shorter ADD instruction clears the pipeline fast so doesn’t stall MUL

15 15 Structural Hazard Example 3  Notice how these long instructions can have long-lasting effects

16 16Performance  CPI for base case (1.0), and with stalls  Left 4 programs integer  Cache effects not included  Load stalls – 2 cycles now  Branch stalls now more expensive  FP result is a RAW hazard  Structural not a big problem

17 17 What Do We Have So Far?  Multiple instructions in flight at one time  If data hazard, no new instructions issue until hazard cleared (stall)  Could minimize stalls by reordering instructions static scheduling static scheduling  a smart complier could reorder instructions to minimize stall  using a detailed description of the architecture dynamic scheduling … next topic dynamic scheduling … next topic  or, add hardware to do this at run time

18 18 Out of Order Execution  With dynamic scheduling, we can do out of order execution Execute instructions with no dependencies Execute instructions with no dependencies Implies out of order completion Implies out of order completion  Today discuss one method: scoreboarding  So far, instructions issued in order Later we’ll look at out of order issue Later we’ll look at out of order issue

19 19 Decode Stage  Split the ID stage into 2 stages 1 st = issue stage 1 st = issue stage  decode and check for structural hazards 2 nd = read operand stage 2 nd = read operand stage  wait until operands available, read and proceed

20 20Scoreboarding  Use a new hardware unit called the scoreboard hardware data structure hardware data structure  Keeps track of dependencies, and executes out of order…  … operands become available First used on CDC 6600 First used on CDC 6600  16 functional units

21 21 MIPS with Scoreboard  Complex EX stage  Each functional unit has 2 inputs 2 inputs 1 output 1 output

22 22 What is a Scoreboard? A Scoreboard is a table maintained by the hardware: keeps track of instructions being fetched, issued, executed etc. keeps track of instructions being fetched, issued, executed etc. keeps track of the resources (functional units and operands) they use/need keeps track of the resources (functional units and operands) they use/need keeps track of which instructions modify which registers keeps track of which instructions modify which registers  uses this information to dynamically schedule instructions  very similar to a pen and paper calculation  simple step-by-step procedure easily implemented in hardware

23 23 Dynamic Scheduling with a Scoreboard  Original development in CDC 6600  Simplified example in HP4 for MIPS FP operations Using neither renaming nor forwarding Using neither renaming nor forwarding  Values always move from registers to function units, and from function units back to registers However, write-back of results happen as soon as possible, not in a statically scheduled slot However, write-back of results happen as soon as possible, not in a statically scheduled slot  Out-of-order completion can give rise to WAR and WAW hazards  Remember: machine “knows” original program order (needed for hazard detection) Machine model Machine model  2 FP multipliers (10 cycles), 1 FP adder (2 cycles), 1 FP divider (40 cycles), all non-pipelined  1 integer unit for everything else (incl. memory references)

24 24 New Worry: WAR Hazards  Didn’t exist before, because read occurred early  Example DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14 ADD could easily stall for DIV’s F0 ADD could easily stall for DIV’s F0 If SUB allowed to execute, then ADD might use wrong value for F8 If SUB allowed to execute, then ADD might use wrong value for F8  SUB has a WAR hazard with ADD through register F8!

25 25 Scoreboard Implications  Out-of-order completion  WAW, WAR hazards? for WAW: stall in Issue until previous write completes for WAW: stall in Issue until previous write completes for WAR: stall in Write Result until previous read completes for WAR: stall in Write Result until previous read completes  Need to have multiple instructions in execution phase  multiple execution units or pipelined execution units  Scoreboard keeps track of dependences, state of operations  Scoreboard replaces ID, EX, WB with 4 stages

26 26 New Stages  The fetch is same, others have changed.  Let’s look at them one by one FetchIssue Read Operands EX WB

27 27Issue  If the required functional unit is available, and the required functional unit is available, and no other unit is pending a write to same register no other unit is pending a write to same register  Then an instruction is issued  Moves to “read operands” stage  The register restriction prevents WAW hazards FetchIssue Read Operands EX WB

28 28 Read Operands  By now, the functional unit is assigned  If operands are available, allows functional unit to read operands from register file  This design has no forwarding So one extra cycle of latency So one extra cycle of latency FetchIssue Read Operands EX WB

29 29EX  Has more functional units  Notifies scoreboard when done FetchIssue Read Operands EX WB

30 30 Write Result  Prevent WAR hazards  In this case DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14  Will stall the WB of the SUB.D until ADD.D reads F8 FetchIssue Read Operands EX WB

31 31 Components of Scoreboard  Hardware data structure  Look at pieces, one by one  Instructions (in order) listed on top left

32 32 Instruction Status  All but last issued (ADD is waiting in Issue stage)  First LD complete  MUL, SUB waiting for register F2 (LD)  DIV waiting for F0 (result of MUL)

33 33 Status of Each Functional Unit  Fi is destination; j, k sources  Q lists producers of inputs  R column indicates that input registers are ready, but not yet read (set to No after read)

34 34 Register Result  Shows which unit is producing which register  Needed by Issue stage

35 35 Later in Execution  LD and SUB (fast ops) have completed  ADD and MUL in process  DIV waiting for MUL to write F0

36 36 Almost Done  DIV about ready to write  Most everything complete and pipeline almost flushed

37 37 Cost of Extra Performance  Scoreboard hardware  Extra functional units  Extra buses Which may result in structural hazard Which may result in structural hazard Hardware needs to assign buses Hardware needs to assign buses  Performance depends on Amount of parallelism in code sequence Amount of parallelism in code sequence Window size of the scoreboard Window size of the scoreboard Size of basic block (i.e., code without branches), … next Size of basic block (i.e., code without branches), … next

38 38 Status – Our Pipeline Now  Can execute instructions out of order  Have not discussed out of order issue Could extend our scoreboarding to do this Could extend our scoreboarding to do this  Still, the opportunities in basic block limited Basic blocks tend to be short Basic blocks tend to be short  Would like to issue past branches

39 39Next  We’ll first look at techniques to increase issue potential Compiler techniques Compiler techniques  Then look at branch prediction  Look at Tomasulo’s algorithm for dynamic scheduling  Begin reading Chapter 2 of HP

40 40 Self-Study Material  Summary of scoreboarding algorithm  One long scoreboarding example  Formal logic equations for scoreboarding logic

41 41 Four Stages of Scoreboard Control 1. Issue: decode instr. & check for structural hazards (ID1) If functional unit is free and no WAW hazard with other active instruction … If functional unit is free and no WAW hazard with other active instruction …  … scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists … If a structural or WAW hazard exists …  … instruction issue stalls –unless there is buffering between fetch and issue, no further instructions can issue until these hazards are cleared. 2. Read operands: wait until no data hazards, then read (ID2) A source operand is available if no earlier issued active instruction is going to write it. A source operand is available if no earlier issued active instruction is going to write it. When all source operands are available … When all source operands are available …  … scoreboard tells the functional unit to proceed to read the operands from registers and begin execution. Thus, scoreboard resolves RAW hazards dynamically in this step Thus, scoreboard resolves RAW hazards dynamically in this step  instructions may be sent into execution out of order

42 42 Four Stages of Scoreboard Control (cont.) 3. Execution: operate on operands The functional unit begins execution upon receiving operands The functional unit begins execution upon receiving operands When result is ready, it notifies the scoreboard When result is ready, it notifies the scoreboard 4. Write Result: finish execution (WB) Once scoreboard is aware that functional unit has completed execution, scoreboard checks for WAR hazards. Once scoreboard is aware that functional unit has completed execution, scoreboard checks for WAR hazards. If no WAR hazard … If no WAR hazard …  … it writes results If WAR hazard … If WAR hazard …  … it stalls the completing instruction Example: Example: DIV.DF0,F2,F4 ADD.DF10,F0,F8 SUB.DF8,F8,F14  CDC 6600 scoreboard would stall SUB.D until ADD.D reads ops

43 43 Three Parts of the Scoreboard 1. Instruction status: Which of 4 steps instruction is in 2. Functional unit (FU) status: Indicates state of FU Nine fields for each functional unit Nine fields for each functional unit  Busy: Indicates whether the unit is busy or not  Op: Operation to perform in the unit (e.g., + or -)  Fi: Destination register  Fj, Fk: Source registers  Qj, Qk: Functional units producing source registers Fj, Fk  Rj, Rk: Flags indicating when Fj, Fk are ready 3. Register result status: Indicates which functional unit will write each register, if any blank when no pending instructions will write that register blank when no pending instructions will write that register

44 44 Scoreboard Example Cycle 0

45 45 Scoreboard Example Cycle 1 First LD issues

46 46 Scoreboard Example Cycle 2 Structural hazard on Integer unit; second LD stalls in IF stage

47 47 Scoreboard Example Cycle 3 Second LD is still stalled

48 48 Scoreboard Example Cycle 4 Second LD still stalled; first LD done

49 49 Scoreboard Example Cycle 5 Second LD issues as the structural hazard on Integer unit has cleared

50 50 Scoreboard Example Cycle 6 MULT issues

51 51 Scoreboard Example Cycle 7 SUBD issues; MULT stalled on LD

52 52 Scoreboard Example Cycle 8a DIVD issues; SUBD stalled on LD

53 53 Scoreboard Example Cycle 8b LD writes F2; MULT and SUBD enabled

54 54 Scoreboard Example Cycle 9 MULT and SUBD read operands and enter execution

55 55 Scoreboard Example Cycle 10 Structural hazard on Add unit stalls the final ADDD

56 56 Scoreboard Example Cycle 11 SUBD and MULT are still in execution

57 57 Scoreboard Example Cycle 12 SUBD writes results; Add unit free; structural hazard resolves

58 58 Scoreboard Example Cycle 13 Note WAR hazard between DIVD and ADDD

59 59 Scoreboard Example Cycle 14 MULT still executing; DIVD stalled on F0 (RAW hazard)

60 60 Scoreboard Example Cycle 15 MULT still executing

61 61 Scoreboard Example Cycle 16 ADDD completes execution, ready to write result into F6

62 62 Scoreboard Example Cycle 17 WAR hazard : ADDD stalls in Write Result stage

63 63 Scoreboard Example Cycle 18 DIVD stalled (RAW hazard on F0), ADDD stalled (WAR hazard on F6)

64 64 Scoreboard Example Cycle 19 MULT completes execution

65 65 Scoreboard Example Cycle 20 MULT writes result; DIVD can proceed to read operands at next cycle

66 66 Scoreboard Example Cycle 21 DIVD reads operands; WAR hazard on F6 is resolved

67 67 Scoreboard Example Cycle 22 40 cycle Divide! ADDD completes writing of result

68 68 Scoreboard Example Cycle 61 DIVD completes execution; ready to write result

69 69 Scoreboard Summary  CDC designers measured performance improvement of 1.7 for compiled FORTRAN code, 2.5 for assembly No pipeline scheduling in software No pipeline scheduling in software Slow memory (no cache) Slow memory (no cache)  Limitations of 6600 scoreboard No forwarding No forwarding Limited to instructions in basic block (small issue window) Limited to instructions in basic block (small issue window) Number of functional units (structural hazards) Number of functional units (structural hazards) Wait for WAR hazards Wait for WAR hazards Prevent WAW hazards Prevent WAW hazards

70 70 Scoreboard: Bookkeeping Actions Instruction Status Wait Until Bookkeeping Issue Not Busy[FU] and not Result[D] Busy[FU]  yes; Op[FU]  op; Fi[FU]  D; Fj[FU]  S1; Fk[FU]  S2; Qj[FU]  Result[S1]; Qk[FU]  Result[S2]; Rj  not Qj; Rk  not Qk; Result[D]  FU Read Operands Rj and Rk Rj  No; Rk  No; Qj  0; Qk  0 Execution Complete Functional unit done Write Result  f ((Fj[f]≠Fi[FU] or Rj[f]=No) & (Fk[f]≠Fi[FU] or Rk[f]=No))  f (if Qj[f]=FU then Rj[f]  yes);  f (if Qk[f]=FU then Rk[f]  yes); Result[Fi[FU]]  0; Busy[FU]  No;


Download ppt "1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,"

Similar presentations


Ads by Google