1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

CSCI 4717/5717 Computer Architecture
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Computer Organization and Architecture
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 1: Bits, bytes and a simple processor dr.ir. A.C. Verschueren.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Computer Organization and Architecture The CPU Structure.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
Chapter 12 Pipelining Strategies Performance Hazards.
Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
Chapter 14 Instruction Level Parallelism and Superscalar Processors
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
CH12 CPU Structure and Function
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 4:Caches, pipelines and superscalar.
Chapter 5 Basic Processing Unit
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
CMPE 421 Parallel Computer Architecture
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Chapter Six.
Computer Organization CS224
William Stallings Computer Organization and Architecture 8th Edition
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 5: Pipelining Basics
Lecture 18: Pipelining Today’s topics:
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Chapter Six.
Chapter Six.
RTL for the SRC pipeline registers
Presentation transcript:

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital Information Systems

1/1/ / faculty of Electrical Engineering eindhoven university of technology Giving more time to a pipeline stage A pipeline stage which cannot handle the next instruction in one clock cycle, has to 'stall' the stages in front of it F r1 := r2 + 3 r3 := r4 x r5 D F r6 := r4 - 2 E D F r7 := r2 - r5 W E D F r0 := r5 + 22(must wait for F) W E D F W E D W E W (must wait for E) (must wait for D) multiply uses 2 extra clocks in ALU EE Time ‘stall cycles’

1/1/ / faculty of Electrical Engineering eindhoven university of technology The bad thing about pipeline stalls Stalls force 'no operation' cycles on the expensive hardware of the previous stages The following instructions finish later than absolutely necessary Pipeline stalls should be avoided whenever possible !

1/1/ / faculty of Electrical Engineering eindhoven university of technology Another pipeline problem: ‘dependencies’ In the standard pipeline, instructions which depend upon eachother's results give problems r1 := r2 + 3 r4 := r3 – r1 r1 := r4 := r1 = 11 r2 = 22 initial values: r3 = 34 9 r r3-r –11 F D F E D W E W wrong value! '25' not written yet...

1/1/ / faculty of Electrical Engineering eindhoven university of technology Solving the dependency problem Compare D, E and W stage operands, stall the pipeline if a match is found r1 := r2 + 3 r4 := r3 - r1 9 9 r4 := r r3-r DEW D F F E D W D r1 := 25 D source = E destination D source = W destination

1/1/ / faculty of Electrical Engineering eindhoven university of technology stage 2 Decode stage 1 Fetch Result forwarding to solve dependencies + 1 ALU program memory PC data registers s1s2 I1 S1 S2 I2 stage 3 Execute stage 4 Write { source operand control and multiplexer specification: } IF I3.dest = I2.source1 THEN s1 := D ELSE s1 := S1; IF I3.dest = I2.source2 THEN s2 := D ELSE s2 := S2; D s1 s2 I3 d control multiplexers result forwarding ‘path’

1/1/ / faculty of Electrical Engineering eindhoven university of technology Parallel pipelines to speed things up No need to wait for the completion of slow operations, if handled by separate hardware r1 := r2 + 3 F r3 := [r4] D F r6 := r4 - 2 E D F r7 := r2 - r5 W M D F r0 := r M E D F W W E D M W E W memory pipeline hardware forwarding ! 2 write stages write order reversed !

1/1/ / faculty of Electrical Engineering eindhoven university of technology The ‘order of completion’ In this example, we have 'out-of-order completion' r6 is written before r3, the instruction ordering suggests r3 before r6 ! The normal case is called 'in-order completion' Shorthand: ‘OOO’

1/1/ / faculty of Electrical Engineering eindhoven university of technology Dependencies with OOO completion sourcedestsourcedest write/read or 'true data dependency' sourcedestsourcedest write/write dependency sourcedest reading 2nd source must wait for 1st destination write, otherwise wrong source value in 2nd instruction writing 2nd destination must be done after writing 1st destination, otherwise leaves wrong result in destination at end writing 2nd destination must be done after reading first source value, otherwise wrong source value in 1st instruction read/write dependency or 'antidependency' sourcedest time order

1/1/ / faculty of Electrical Engineering eindhoven university of technology ‘Scoreboarding’ instead of forwarding Result forwarding helps in a simple pipeline –It becomes rather complex in a multiple pipeline with out-of-order completion –One of the earlier DEC Alpha processors used more than 40 result forwarding paths A 'register scoreboard' can be used to make sure that dependency relations are kept in order

1/1/ / faculty of Electrical Engineering eindhoven university of technology Operation of a register scoreboard All registers have a 'scoreboard' bit, initially reset Instructions wait in the Decode stage until all their source and destination scoreboard bits are reset (to zero) Instructions which exit the Decode stage set the scoreboard bit in their destination register(s) A scoreboard bit is reset during the writing of a destination register in any Writeback stage

1/1/ / faculty of Electrical Engineering eindhoven university of technology Scoreboard performance A simple scoreboard is very conservative in it's stalling decisions –It stalls the pipeline for true data dependencies But removes all forwarding paths in return ! Write-write and antidependencies are stalled much longer than absolutely necessary –They should be stalled in the Writeback stage, not the Decode stage !

1/1/ / faculty of Electrical Engineering eindhoven university of technology The real reason for some dependencies Write-write and antidependencies exist because a register is re-used to hold another value ! If we use a different destination register for the each write action, these dependencies vanish –This requires changing the program, which is not always possible –The amount of available registers may not be enough every result a different register ?

1/1/ / faculty of Electrical Engineering eindhoven university of technology Register ‘renaming’ as solution Write-write and antidependencies can be removed by writing each result in a different hardware register –This removes the direct relation between a register number in the program and a real register Register numbers are renamed into something else ! –Have to make sure that source register references always use the correct (renamed) hardware register

1/1/ / faculty of Electrical Engineering eindhoven university of technology Register renaming example before renaming:after renaming: 1)R1 := R2 + 3R1b := R2a + 3 2)R3 := R1 x 2R3b := R1b x 2 3)R1 := R6 + R2R1c := R6a + R2a 4)R2 := R1 - 15R2b := R1c - 15 True dependencies Anti- dependencies Write-write dependencies All registers start as R..a

1/1/ / faculty of Electrical Engineering eindhoven university of technology An implementation of register renaming Use a lookup table in the Decode stage which indicates the 'current' hardware register for each of the software-visible registers –Source values are read from the hardware registers currently referenced from the lookup table –Each destination register, gets a 'fresh' hardware register whose reference is placed in the lookup table –Later pipeline stages all use the hardware register references for result forwarding and/or writeback

1/1/ / faculty of Electrical Engineering eindhoven university of technology The problem with register renaming When is a hardware register not needed anymore ? OR, in other words When can a hardware register be re-used ? –There must be another hardware register assigned for its software register number AND –All source value references to it must have been done Will be soved later

1/1/ / faculty of Electrical Engineering eindhoven university of technology Flow control instructions in the pipeline When the PC is changed by an instruction, the Fetch stage must wait for the actual update –For instance: a relative jump calculated by the ALU, with PC updated in the Writeback stage W PC := PC + 5 FWED r3 := r4 x r5 D F E PC updated here is a jump fetch at wrong address! F r8 := r1 -22

1/1/ / faculty of Electrical Engineering eindhoven university of technology Improving the flow control handling The number of stall cycles can be reduced a lot: update the PC earlier in the pipeline –For instance in the Decode stage PC := 25 FWE– D r3 := r4 x r5 D F is a jump fetch at wrong address! – F r8 := r1 -22 PC updated here No-operation: NOP

1/1/ / faculty of Electrical Engineering eindhoven university of technology Another method: use ‘delay slots’ The pipeline stall can be removed by executing instructions following the flow control instruction – These are executed before the actual jump is made X: PC := 25 F X+1: r3 := r4 x r5 D F is a jump execute anyway... PC updated here W W EE – D D – F 25: r8 := r 'delay slot'

1/1/ / faculty of Electrical Engineering eindhoven university of technology Delay slots: to have or not to have Using delay slots changes processor behaviour old programs will not run anymore ! Compilers try to find useful instructions for delay slots –Able to fill  75% of the first delay slots –But only filling  40% of the second delay slots If no useful instruction can be found, insert a NOP

1/1/ / faculty of Electrical Engineering eindhoven university of technology An alternative to delay slots Sometimes several stages between fetching and execution (PC update) of a jump instruction –Would lead to many (unfillable) delay slots Alternative solution: a 'branch target cache' (BTC) –This cache contains for out-of-sequence jumps the new PC value and the first (few) instruction(s) –Is indexed on the address of the jump instruction the BTC ‘knows’ a jump is coming before it is fetched !

1/1/ / faculty of Electrical Engineering eindhoven university of technology D F 11: r2 := r6 + 3 Operation of the Branch Target Cache If the Branch Target Cache hits, the fetch stage starts fetching after the target address –The BTC provides the first (few) instruction(s) itself 10: PC := 22 F Hit ! W W EE – D D – F 23: r8 := r BTC checks address 10 PC updated to BTC provides instruction 22: r3 := r4 x r5

1/1/ / faculty of Electrical Engineering eindhoven university of technology Jump prediction saves time By predicting the outcome of a conditional jump, no need to wait until test outcome is known –Example: condition test outcome known in W stage D F 11: r2 := 3 10: JNZ r1,22 F Prediction: taken D E F 22: r3 := r4 x r5 Prediction correct ! W W E W E D E W D F 23: r7 := r9 W E D F 24: r6 := 5 'delay slot' Prediction wrong ! 12: r8 := 0 W – – F – – D – E W Must avoid wrong predictions !

1/1/ / faculty of Electrical Engineering eindhoven university of technology How to predict a test outcome (1) Prediction may be given with bit in instruction –Shifts prediction problem to the assembler/compiler –Instruction set must be changed to hold this flag The prediction may be based upon the type of test and/or jump direction –End of loop jumps are taken most of the time –A single bit test is generally unpredictable...

1/1/ / faculty of Electrical Engineering eindhoven university of technology How to predict a test outcome (2) Prediction can be based upon the previous outcome(s) of the condition test –This is done with a 'branch history buffer’ A cache which holds information for the most recently executed conditional jumps May be based solely on last execution or more complex (statistical) algorithms Implemented in separate hardware or combined with branch target/instruction caches Combination can achieve a 'hit rate' of > 90%!

1/1/ / faculty of Electrical Engineering eindhoven university of technology CALL and RETURN handling A subroutine CALL can be seen as a jump combined with a memory write –Is not more problematic than a normal JUMP A subroutine RETURN gives more problems –The new PC value cannot be determined from the instruction location and contents –Special tricks exist to bypass the memory stack read (for instance a ‘return address cache’)

1/1/ / faculty of Electrical Engineering eindhoven university of technology Calculated and indirect jumps These give huge problems in a pipeline –The new PC value must be determined before fetching can continue Most of the speedup tricks break down on this problem –A Branch Target Cache can help a littlebit, but only if the actual target remains stable –The predicted target must be checked afterwards !