Download presentation
Presentation is loading. Please wait.
Published byNigel Summers Modified over 9 years ago
1
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital Information Systems
2
1/1/ / faculty of Electrical Engineering eindhoven university of technology Giving more time to a pipeline stage A pipeline stage which cannot handle the next instruction in one clock cycle, has to 'stall' the stages in front of it F r1 := r2 + 3 r3 := r4 x r5 D F r6 := r4 - 2 E D F r7 := r2 - r5 W E D F r0 := r5 + 22(must wait for F) W E D F W E D W E W (must wait for E) (must wait for D) multiply uses 2 extra clocks in ALU EE Time ‘stall cycles’
3
1/1/ / faculty of Electrical Engineering eindhoven university of technology The bad thing about pipeline stalls Stalls force 'no operation' cycles on the expensive hardware of the previous stages The following instructions finish later than absolutely necessary Pipeline stalls should be avoided whenever possible !
4
1/1/ / faculty of Electrical Engineering eindhoven university of technology Another pipeline problem: ‘dependencies’ In the standard pipeline, instructions which depend upon eachother's results give problems r1 := r2 + 3 r4 := r3 – r1 r1 := 25 23 r4 := 23 25 r1 = 11 r2 = 22 initial values: r3 = 34 9 r2+3 22+3 r3-r1 25 34–11 F D F E D W E W wrong value! '25' not written yet...
5
1/1/ / faculty of Electrical Engineering eindhoven university of technology Solving the dependency problem Compare D, E and W stage operands, stall the pipeline if a match is found r1 := r2 + 3 r4 := r3 - r1 9 9 r4 := 9 25 9 r2+3 22+3 r3-r1 34-25 DEW D F F E D 25 34-11 W D r1 := 25 D source = E destination D source = W destination
6
1/1/ / faculty of Electrical Engineering eindhoven university of technology stage 2 Decode stage 1 Fetch Result forwarding to solve dependencies + 1 ALU program memory PC data registers s1s2 I1 S1 S2 I2 stage 3 Execute stage 4 Write { source operand control and multiplexer specification: } IF I3.dest = I2.source1 THEN s1 := D ELSE s1 := S1; IF I3.dest = I2.source2 THEN s2 := D ELSE s2 := S2; D s1 s2 I3 d control multiplexers result forwarding ‘path’
7
1/1/ / faculty of Electrical Engineering eindhoven university of technology Parallel pipelines to speed things up No need to wait for the completion of slow operations, if handled by separate hardware r1 := r2 + 3 F r3 := [r4] D F r6 := r4 - 2 E D F r7 := r2 - r5 W M D F r0 := r3 + 22 M E D F W W E D M W E W memory pipeline hardware forwarding ! 2 write stages write order reversed !
8
1/1/ / faculty of Electrical Engineering eindhoven university of technology The ‘order of completion’ In this example, we have 'out-of-order completion' r6 is written before r3, the instruction ordering suggests r3 before r6 ! The normal case is called 'in-order completion' Shorthand: ‘OOO’
9
1/1/ / faculty of Electrical Engineering eindhoven university of technology Dependencies with OOO completion sourcedestsourcedest write/read or 'true data dependency' sourcedestsourcedest write/write dependency sourcedest reading 2nd source must wait for 1st destination write, otherwise wrong source value in 2nd instruction writing 2nd destination must be done after writing 1st destination, otherwise leaves wrong result in destination at end writing 2nd destination must be done after reading first source value, otherwise wrong source value in 1st instruction read/write dependency or 'antidependency' sourcedest time order
10
1/1/ / faculty of Electrical Engineering eindhoven university of technology ‘Scoreboarding’ instead of forwarding Result forwarding helps in a simple pipeline –It becomes rather complex in a multiple pipeline with out-of-order completion –One of the earlier DEC Alpha processors used more than 40 result forwarding paths A 'register scoreboard' can be used to make sure that dependency relations are kept in order
11
1/1/ / faculty of Electrical Engineering eindhoven university of technology Operation of a register scoreboard All registers have a 'scoreboard' bit, initially reset Instructions wait in the Decode stage until all their source and destination scoreboard bits are reset (to zero) Instructions which exit the Decode stage set the scoreboard bit in their destination register(s) A scoreboard bit is reset during the writing of a destination register in any Writeback stage
12
1/1/ / faculty of Electrical Engineering eindhoven university of technology Scoreboard performance A simple scoreboard is very conservative in it's stalling decisions –It stalls the pipeline for true data dependencies But removes all forwarding paths in return ! Write-write and antidependencies are stalled much longer than absolutely necessary –They should be stalled in the Writeback stage, not the Decode stage !
13
1/1/ / faculty of Electrical Engineering eindhoven university of technology The real reason for some dependencies Write-write and antidependencies exist because a register is re-used to hold another value ! If we use a different destination register for the each write action, these dependencies vanish –This requires changing the program, which is not always possible –The amount of available registers may not be enough every result a different register ?
14
1/1/ / faculty of Electrical Engineering eindhoven university of technology Register ‘renaming’ as solution Write-write and antidependencies can be removed by writing each result in a different hardware register –This removes the direct relation between a register number in the program and a real register Register numbers are renamed into something else ! –Have to make sure that source register references always use the correct (renamed) hardware register
15
1/1/ / faculty of Electrical Engineering eindhoven university of technology Register renaming example before renaming:after renaming: 1)R1 := R2 + 3R1b := R2a + 3 2)R3 := R1 x 2R3b := R1b x 2 3)R1 := R6 + R2R1c := R6a + R2a 4)R2 := R1 - 15R2b := R1c - 15 True dependencies Anti- dependencies Write-write dependencies All registers start as R..a
16
1/1/ / faculty of Electrical Engineering eindhoven university of technology An implementation of register renaming Use a lookup table in the Decode stage which indicates the 'current' hardware register for each of the software-visible registers –Source values are read from the hardware registers currently referenced from the lookup table –Each destination register, gets a 'fresh' hardware register whose reference is placed in the lookup table –Later pipeline stages all use the hardware register references for result forwarding and/or writeback
17
1/1/ / faculty of Electrical Engineering eindhoven university of technology The problem with register renaming When is a hardware register not needed anymore ? OR, in other words When can a hardware register be re-used ? –There must be another hardware register assigned for its software register number AND –All source value references to it must have been done Will be soved later
18
1/1/ / faculty of Electrical Engineering eindhoven university of technology Flow control instructions in the pipeline When the PC is changed by an instruction, the Fetch stage must wait for the actual update –For instance: a relative jump calculated by the ALU, with PC updated in the Writeback stage W PC := PC + 5 FWED r3 := r4 x r5 D F E PC updated here is a jump fetch at wrong address! F r8 := r1 -22
19
1/1/ / faculty of Electrical Engineering eindhoven university of technology Improving the flow control handling The number of stall cycles can be reduced a lot: update the PC earlier in the pipeline –For instance in the Decode stage PC := 25 FWE– D r3 := r4 x r5 D F is a jump fetch at wrong address! – F r8 := r1 -22 PC updated here No-operation: NOP
20
1/1/ / faculty of Electrical Engineering eindhoven university of technology Another method: use ‘delay slots’ The pipeline stall can be removed by executing instructions following the flow control instruction – These are executed before the actual jump is made X: PC := 25 F X+1: r3 := r4 x r5 D F is a jump execute anyway... PC updated here W W EE – D D – F 25: r8 := r1 - 22 'delay slot'
21
1/1/ / faculty of Electrical Engineering eindhoven university of technology Delay slots: to have or not to have Using delay slots changes processor behaviour old programs will not run anymore ! Compilers try to find useful instructions for delay slots –Able to fill 75% of the first delay slots –But only filling 40% of the second delay slots If no useful instruction can be found, insert a NOP
22
1/1/ / faculty of Electrical Engineering eindhoven university of technology An alternative to delay slots Sometimes several stages between fetching and execution (PC update) of a jump instruction –Would lead to many (unfillable) delay slots Alternative solution: a 'branch target cache' (BTC) –This cache contains for out-of-sequence jumps the new PC value and the first (few) instruction(s) –Is indexed on the address of the jump instruction the BTC ‘knows’ a jump is coming before it is fetched !
23
1/1/ / faculty of Electrical Engineering eindhoven university of technology D F 11: r2 := r6 + 3 Operation of the Branch Target Cache If the Branch Target Cache hits, the fetch stage starts fetching after the target address –The BTC provides the first (few) instruction(s) itself 10: PC := 22 F Hit ! W W EE – D D – F 23: r8 := r1 - 22 BTC checks address 10 PC updated to 23... BTC provides instruction 22: r3 := r4 x r5
24
1/1/ / faculty of Electrical Engineering eindhoven university of technology Jump prediction saves time By predicting the outcome of a conditional jump, no need to wait until test outcome is known –Example: condition test outcome known in W stage D F 11: r2 := 3 10: JNZ r1,22 F Prediction: taken D E F 22: r3 := r4 x r5 Prediction correct ! W W E W E D E W D F 23: r7 := r9 W E D F 24: r6 := 5 'delay slot' Prediction wrong ! 12: r8 := 0 W – – F – – D – E W Must avoid wrong predictions !
25
1/1/ / faculty of Electrical Engineering eindhoven university of technology How to predict a test outcome (1) Prediction may be given with bit in instruction –Shifts prediction problem to the assembler/compiler –Instruction set must be changed to hold this flag The prediction may be based upon the type of test and/or jump direction –End of loop jumps are taken most of the time –A single bit test is generally unpredictable...
26
1/1/ / faculty of Electrical Engineering eindhoven university of technology How to predict a test outcome (2) Prediction can be based upon the previous outcome(s) of the condition test –This is done with a 'branch history buffer’ A cache which holds information for the most recently executed conditional jumps May be based solely on last execution or more complex (statistical) algorithms Implemented in separate hardware or combined with branch target/instruction caches Combination can achieve a 'hit rate' of > 90%!
27
1/1/ / faculty of Electrical Engineering eindhoven university of technology CALL and RETURN handling A subroutine CALL can be seen as a jump combined with a memory write –Is not more problematic than a normal JUMP A subroutine RETURN gives more problems –The new PC value cannot be determined from the instruction location and contents –Special tricks exist to bypass the memory stack read (for instance a ‘return address cache’)
28
1/1/ / faculty of Electrical Engineering eindhoven university of technology Calculated and indirect jumps These give huge problems in a pipeline –The new PC value must be determined before fetching can continue Most of the speedup tricks break down on this problem –A Branch Target Cache can help a littlebit, but only if the actual target remains stable –The predicted target must be checked afterwards !
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.