1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 4:Caches, pipelines and superscalar.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 4:Caches, pipelines and superscalar machines dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital Information Systems

1/1/ / faculty of Electrical Engineering eindhoven university of technology The memory speed ‘gap’ High-performance processors are much too fast for the main memory they are connected to –Processors running at 1000 MegaHerz would like a memory read/write cycle time of 1 nanosecond –Large memories with (relatively) cheap RAM’s have cycle times on the order of 100 nanoseconds 100 times slower, this speed gap continues to grow...

1/1/ / faculty of Electrical Engineering eindhoven university of technology Wide words and memory banking The gap can be closed IF the processor tolerates a long delay between the start and end of a cycle 4..7 1 2 3 4 5 6 7 1 1 2 2 3 3 4 4 5 5 6 6 7 7 0..3read 0 use 0 read 0 use 1) Wide memory words2) Multiple memory 'banks' 4 words in parallel 4 accesses in parallel Lots of pins Comple x timing

1/1/ / faculty of Electrical Engineering eindhoven university of technology The big IF in closing the gap Long memory access delays can be tolerated IF addresses are known in advance –True for sequential instruction reads –NOT true for most of the other read operations Memory reading MUST become quicker! Not interested in (timing of) write operations –Data & address to memory, then forget about it...

1/1/ / faculty of Electrical Engineering eindhoven university of technology Small-scale virtual memory: the cache A 'cache' is a small but very fast memory which contains the 'most active' memory words IFa requested memory word is in the cache THENsupply the word from the cache{very fast} ELSEsupply the word from main memory{rather slow} and place it in the cache for later references (throwing out not used words when needed) –An ideal cache knows which words will be used soon –A good cache reaches 95% THEN and only 5% ELSE ‘Cache’ is French: ‘secret hiding place’

1/1/ / faculty of Electrical Engineering eindhoven university of technology Keeping the cache hidden The cache must keep a copy of memory words Memory mapped I/O ports are problematic –These can spontaneously change their value ! –Have to be made'non-cacheable’ at all times Shared memory is problematic too –Make it non-cacheable (from all sides), or better –Inform all attached caches of changes (write actions)

1/1/ / faculty of Electrical Engineering eindhoven university of technology Cache writing policies 'write-through’: written data copied into memory Option: write to cache only if word is already present  The amount of data in the cache can be reduced –Read after non-cached write requires true memory read 'posted write’: writes buffered until the bus is free  Gives priority to reads, allows high speed write bursts –More hardware, delay between CPU and memory write 'late write’: write only to make free space in cache  Reduces the amount of memory write cycles drastically –Complex cache control, especially with shared memory! Pentium

1/1/ / faculty of Electrical Engineering eindhoven university of technology An example of a cache To reduce the amount of administration memory, a single cache 'line' administrates 8 word blocks system bus cache memory CPU (80386) CPU bus main memory data address control data address control bus switchcache controller (82385) administration

1/1/ / faculty of Electrical Engineering eindhoven university of technology 32 bits address: Intel 82385 'direct mapped’ cache mode Also known as '1-way set associative’ prone to ‘tag clashing’ ! word 3 17 bit tags 1024 lines 'line valid' Line select word select byte 2 line 10 'tag' 17 32 bit data 'word valid' 32 bit data 'hit' word #0word #7

1/1/ / faculty of Electrical Engineering eindhoven university of technology Intel 82385 ’2-way set associative’ mode –’Least Recently Used' bits indicate which set in each line has been used last (the other is replacement target) 32 bits address: word 3 17 bit tags 1024 lines 'line valid' Line select word select byte 2 line 10 'tag' 17 32 bit data 'word valid' 32 bit data 'hit' word #0word #7 918 18 bit tags 512 lines 'hit' hit logic LRU bits

1/1/ / faculty of Electrical Engineering eindhoven university of technology The MESI protocol Late write and shared memory combine badly –The 'MESI' protocol solves this with four states for each of the cache words (or lines) Modified: cached data differs from the main memory and is only located in this cache Exclusive: cached data is the same as main memory and is only located in this cache Shared: cached data is the same as main memory and also located in one or more other caches Invalid: cache word/line not loaded with memory data

1/1/ / faculty of Electrical Engineering eindhoven university of technology State changes in the MESI protocol Induced by processor read/write actions and actions of other cache controllers Caches keep track of other read/write actions –Uses ’bus snooping’: monitoring the address and control buses when they are driven by someone else –During a memory access, other cache controllers indicate if one of them contains the accessed location Needed to decide between the Shared/Exclusive states!

1/1/ / faculty of Electrical Engineering eindhoven university of technology Intel 82496 CPU accesses A read hit reads the cache, does not change state A read miss reads memory, other controllers check if they also contain the address read A write hit handling depends on the state –If Shared, write is done in main memory too –If Exclusive or Modified, write is only done in cache A write miss writes to memory, but not the cache Other caches may change their state! Pentium Normal MESI: write cache too

1/1/ / faculty of Electrical Engineering eindhoven university of technology Intel 82496 state diagram Invalid Modified Shared Exclusive snoop write snoop write snoop write read miss & somewhere else read miss, only here write hit (setup for late write) snoop read (*) snoop read write hit (write to memory) (*): This controller copies local data to memory immediately read hit read/write hit write miss snoop read any snoop

1/1/ / faculty of Electrical Engineering eindhoven university of technology Final remarks on caches (1) High performance processors rely on caches –Main memory must be accessed in a single clock cycle At 1 GHz, the cache must be on the CPU chip –But a large & fast cache takes a lot of chip space! CPU chip main memory huge & very slow CPU off-chip cache large(r) & slow(er) on-chip cache small & fast First level cache Second level cache

1/1/ / faculty of Electrical Engineering eindhoven university of technology Final remarks on caches (2) The off-chip cache becomes as slow as main memory was some time ago... Second level cache placed on the CPU chip too –Examples: power-PC, Crusoe (both > 256 KiloByte!) –The external cache becomes a third-level cache –Data transfer between on-chip caches can be done a complete cache line in parallel: a huge speedup

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up: which speed ? It is nice to talk for hours on how to increase the speed of a processor, but... what do we actually want ? We first have to look at the application side, where speed is more measured in terms of algorithm execution performancethan processor performance

1/1/ / faculty of Electrical Engineering eindhoven university of technology Different applications, different speeds Fixed function (control) applications: the required algorithms must be executed in a given amount of time, expensive and unnecessary to go any faster ! Transaction processing and databases: the algorithms must be executed at a speed so that the system is not perceived to be 'slow' by human standards ’Number crunching' and simulations: the algorithms must be executed as fast as possible The last one is used the least !

1/1/ / faculty of Electrical Engineering eindhoven university of technology Number crunching and simulation (1) The only applications where algorithm processing speed is of major concern –A single operation may take hours or even days! It may be worthwhile to spend a lot of money to increase processing speed by 'only' 10% –These users are willing to upgrade their computer once a year to follow the latest technology trend...

1/1/ / faculty of Electrical Engineering eindhoven university of technology Number crunching and simulation (2) 'No holds barred' - all tricks in the book are used –Massively parallel processor systems –Special purpose hardware –Vector processors and 'systolic arrays’ ('Single Instruction Multiple Data’ machines) –'Normal' processors speeded up by all kinds of tricks often based upon the type of operations to be performed We will focus on some of these tricks

1/1/ / faculty of Electrical Engineering eindhoven university of technology Algorithm processing speed The clock speed of a processor doesn't say much –The Rekursiv machine (vCISC) at 10 MHz beats a TI 'LISP engine' (RISC) at 40 MHz to run LISP The reason: Rekursiv can ‘Malloc’ in one clock cycle It is possible to optimise a processor architecture to fit the programming language –which may give tremendous speedups (f.i. LISP, Smalltalk or Java)

1/1/ / faculty of Electrical Engineering eindhoven university of technology The problem with ‘benchmarks’ 'Million Instructions Per Second’ is an empty measurement unless scaled to some normalised instruction set and 'mix' 'Standard' benchmark programs are not representative of real applications their instruction mix is non-standard and results are influenced by the compiler which is used Meaningless Information about Processor Speed

1/1/ / faculty of Electrical Engineering eindhoven university of technology Reduced Instruction Set Computers Execute a 'simple' instruction set (load/store philosophy: operations between registers only) Have fixed length instructions with a few formats (easy to decode but sometimes space inefficient) Use a large number of general purpose registers (needed for calculating addresses and reduce reads/writes) Tuned for high-speed instruction execution –But not high speed 'C' execution, as some believe

1/1/ / faculty of Electrical Engineering eindhoven university of technology Complex Instruction Set Computers Execute a complex instruction set doing much more in one instruction, difficult to decode Have variable length instructions gives higher storage efficiency and shorter programs Use a moderate number of registers (some of them special purpose) Tuneable towards high-level language execution –f.i. 'ENTER' and 'LEAVE' instructions in the 80286 –or even operating system support (task switching)

1/1/ / faculty of Electrical Engineering eindhoven university of technology The RISC/CISC boundary fades fast 'RISC' is sometimes a completely misplaced label –The IBM 'POWER' architecture knows more instructions than an average CISC RISC speed ('one instruction per clock') can also be reached by modern CISC processors –Which then perform the equivalent of several RISC instructions in that same 'single clock'

1/1/ / faculty of Electrical Engineering eindhoven university of technology The number of instructions per clock 'one instruction per clock' (1 IPC) is hardly ever reached, even for RISC CPU's Early RISC's reached 0.3.. 0.5 IPC –it takes a lot of hardware to reach 0.6.. 0.7 IPC when running normal programs ! Only 'Superscalar' processors can reach (and even exceed) 1 IPC

1/1/ / faculty of Electrical Engineering eindhoven university of technology Standard CISC instruction execution In the old days, a CISC processor took a lot of clocks to execute a single instruction 1:fetch the (first part of the) instruction 2:decode the (first part of the) instruction 3:fetch more parts of the instruction if needed 4:fetch operands from memory (after address calculations) and/or registers 5:perform ALU operation (may take several cycles) 6:write result(s) to registers and/or memory

1/1/ / faculty of Electrical Engineering eindhoven university of technology A program to execute programs These old machines interpret the actual program –They ran a lower-level ('microcode') program! Hardware was expensive, so it was re-used for different purposes during different clock cycles –A single bus to transfer data inside the processor –One ALU for addresses and actual operations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Streamlining the execution on a RISC Early RISC processors could break instruction execution into four basic steps 1:Fetch instruction (always the same size) 2:Decode instruction and read source operands (s1, s2) 3:Execute the actual operation in the ALU 4:Write result to destination operand (d) We will denote these four steps with the letters FDEW from now on...

1/1/ / faculty of Electrical Engineering eindhoven university of technology Single clock RISC instruction execution The basic intruction execution steps can be executed within one clock ALU program memory PC data registers + 1 clock instruction s1s2 d

1/1/ / faculty of Electrical Engineering eindhoven university of technology Single clock RISC execution timing This is a bit slow in terms of clock speed prog. addr. instruction source ops ALU result clock cycle clock delays setup time to clock

1/1/ / faculty of Electrical Engineering eindhoven university of technology Extra registers for the one clock RISC The clock speed can be increased by adding extra registers + 1 ALU program memory PC data registers instruction s1s2 clock control unit I S1 S2 D d

1/1/ / faculty of Electrical Engineering eindhoven university of technology Timing of RISC with extra registers The control unit tells all registers when to load 1:Read program memory and store in 'I', PC++ 2:Read source registers and store in 'S1'/'S2’ 3:Perform ALU operation and store result in 'D’ 4:Write 'D' contents into destination register Less done in each clock cycle: clock speed higher but the number of clocks per instruction goes up and the total instruction execution time increases !

1/1/ / faculty of Electrical Engineering eindhoven university of technology Reducing hardware costs The previous solution can be optimised a lot to reduce hardware costs + 1 clock control unit ALU program memory PC data regs s1 s2 d instruction I S1 S2 D multiplexer

1/1/ / faculty of Electrical Engineering eindhoven university of technology The ‘reduced hardware costs’ timing Separate clock cycles for reading operands –The data registers have become single ported (much less hardware than 3-ported) –It is even possible to do PC++ with the ALU Back at square one: this is how they used to do it in the old days... VERY slow

1/1/ / faculty of Electrical Engineering eindhoven university of technology stage 3 Execute stage 2 Decode stage 1 Fetch Splitting the processor in ‘stages’ By adding even more registers, we can split the processor in 'stages' + 1 ALU program memory PC data registers s1s2 I1 S1 S2 D I2 I3 stage 4 Write d

1/1/ / faculty of Electrical Engineering eindhoven university of technology The stages form a ‘pipeline’ Each stages uses independent hardware –Performs one of the basic instruction execution steps The stages can all work at the same time –In general on different instructions ! This way of splitting a processor in stages is called 'pipelining'

1/1/ / faculty of Electrical Engineering eindhoven university of technology stage 1 stage 2 stage 3 stage 4 The timing of a pipeline These stages handle 4 instructions in parallel –At roughly four times the clock speed of the first hardware implementation! X + 3 fetch inst N+3 read source N+2 ALU op N+1 write dest N clockX fetch inst N ? ? ? X + 1 fetch inst N+1 read source N ? ? X + 2 fetch inst N+2 read source N+1 ALU op N ?

1/1/ / faculty of Electrical Engineering eindhoven university of technology Giving more time to a pipeline stage A pipeline stage which cannot handle the next instruction in one clock cycle, has to 'stall' the stages in front of it F r1 := r2 + 3 r3 := r4 x r5 D F r6 := r4 - 2 E D F r7 := r2 - r5 W E D F r0 := r5 + 22(must wait for F) W E D F W E D W E W (must wait for E) (must wait for D) multiply uses 2 extra clocks in ALU EE Time ‘stall cycles’

1/1/ / faculty of Electrical Engineering eindhoven university of technology The bad thing about pipeline stalls Stalls force 'no operation' cycles on the expensive hardware of the previous stages The following instructions finish later than absolutely necessary Pipeline stalls should be avoided whenever possible !

1/1/ / faculty of Electrical Engineering eindhoven university of technology Another pipeline problem: ‘dependencies’ In the standard pipeline, instructions which depend upon eachother's results give problems r1 := r2 + 3 r4 := r3 – r1 r1 := 25 23 r4 := 23 25 r1 = 11 r2 = 22 initial values: r3 = 34 9 r2+3 22+3 r3-r1 25 34–11 F D F E D W E W wrong value! '25' not written yet...

1/1/ / faculty of Electrical Engineering eindhoven university of technology Solving the dependency problem Compare D, E and W stage operands, stall the pipeline if a match is found r1 := r2 + 3 r4 := r3 - r1 9 9 r4 := 9 25 9 r2+3 22+3 r3-r1 34-25 DEW D F F E D 25 34-11 W D r1 := 25 D source = E destination D source = W destination

1/1/ / faculty of Electrical Engineering eindhoven university of technology stage 2 Decode stage 1 Fetch Result forwarding to solve dependencies + 1 ALU program memory PC data registers s1s2 I1 S1 S2 I2 stage 3 Execute stage 4 Write { source operand control and multiplexer specification: } IF I3.dest = I2.source1 THEN s1 := D ELSE s1 := S1; IF I3.dest = I2.source2 THEN s2 := D ELSE s2 := S2; D s1 s2 I3 d control multiplexers result forwarding ‘path’

1/1/ / faculty of Electrical Engineering eindhoven university of technology Parallel pipelines to speed things up No need to wait for the completion of slow operations, if handled by separate hardware r1 := r2 + 3 F r3 := [r4] D F r6 := r4 - 2 E D F r7 := r2 - r5 W M D F r0 := r3 + 22 M E D F W W E D M W E W memory pipeline hardware forwarding ! 2 write stages write order reversed !

1/1/ / faculty of Electrical Engineering eindhoven university of technology The ‘order of completion’ In this example, we have 'out-of-order completion' r6 is written before r3, the instruction ordering suggests r3 before r6 ! The normal case is called 'in-order completion' Shorthand: ‘OOO’

1/1/ / faculty of Electrical Engineering eindhoven university of technology Dependencies with OOO completion sourcedestsourcedest write/read or 'true data dependency' sourcedestsourcedest write/write dependency sourcedest reading 2nd source must wait for 1st destination write, otherwise wrong source value in 2nd instruction writing 2nd destination must be done after writing 1st destination, otherwise leaves wrong result in destination at end writing 2nd destination must be done after reading first source value, otherwise wrong source value in 1st instruction read/write dependency or 'antidependency' sourcedest time order

1/1/ / faculty of Electrical Engineering eindhoven university of technology ‘Scoreboarding’ instead of forwarding Result forwarding helps in a simple pipeline –It becomes rather complex in a multiple pipeline with out-of-order completion –One of the earlier DEC Alpha processors used more than 40 result forwarding paths A 'register scoreboard' can be used to make sure that dependency relations are kept in order

1/1/ / faculty of Electrical Engineering eindhoven university of technology Operation of a register scoreboard All registers have a 'scoreboard' bit, initially reset Instructions wait in the Decode stage until all their source and destination scoreboard bits are reset (to zero) Instructions which exit the Decode stage set the scoreboard bit in their destination register(s) A scoreboard bit is reset during the writing of a destination register in any Writeback stage

1/1/ / faculty of Electrical Engineering eindhoven university of technology Scoreboard performance A simple scoreboard is very conservative in it's stalling decisions –It stalls the pipeline for true data dependencies But removes all forwarding paths in return ! Write-write and antidependencies are stalled much longer than absolutely necessary –They should be stalled in the Writeback stage, not the Decode stage !

1/1/ / faculty of Electrical Engineering eindhoven university of technology The real reason for some dependencies Write-write and antidependencies exist because a register is re-used to hold another value ! If we use a different destination register for the each write action, these dependencies vanish –This requires changing the program, which is not always possible –The amount of available registers may not be enough every result a different register ?

1/1/ / faculty of Electrical Engineering eindhoven university of technology Register ‘renaming’ as solution Write-write and antidependencies can be removed by writing each result in a different hardware register –This removes the direct relation between a register number in the program and a real register Register numbers are renamed into something else ! –Have to make sure that source register references always use the correct (renamed) hardware register

1/1/ / faculty of Electrical Engineering eindhoven university of technology Register renaming example before renaming:after renaming: 1)R1 := R2 + 3R1b := R2a + 3 2)R3 := R1 x 2R3b := R1b x 2 3)R1 := R6 + R2R1c := R6a + R2a 4)R2 := R1 - 15R2b := R1c - 15 True dependencies Anti- dependencies Write-write dependencies All registers start as R..a

1/1/ / faculty of Electrical Engineering eindhoven university of technology An implementation of register renaming Use a lookup table in the Decode stage which indicates the 'current' hardware register for each of the software-visible registers –Source values are read from the hardware registers currently referenced from the lookup table –Each destination register, gets a 'fresh' hardware register whose reference is placed in the lookup table –Later pipeline stages all use the hardware register references for result forwarding and/or writeback

1/1/ / faculty of Electrical Engineering eindhoven university of technology The problem with register renaming When is a hardware register not needed anymore ? OR, in other words When can a hardware register be re-used ? –There must be another hardware register assigned for its software register number AND –All source value references to it must have been done Will be soved later

1/1/ / faculty of Electrical Engineering eindhoven university of technology Flow control instructions in the pipeline When the PC is changed by an instruction, the Fetch stage must wait for the actual update –For instance: a relative jump calculated by the ALU, with PC updated in the Writeback stage W PC := PC + 5 FWED r3 := r4 x r5 D F E PC updated here is a jump fetch at wrong address! F r8 := r1 -22

1/1/ / faculty of Electrical Engineering eindhoven university of technology Improving the flow control handling The number of stall cycles can be reduced a lot: update the PC earlier in the pipeline –For instance in the Decode stage PC := 25 FWE– D r3 := r4 x r5 D F is a jump fetch at wrong address! – F r8 := r1 -22 PC updated here No-operation: NOP

1/1/ / faculty of Electrical Engineering eindhoven university of technology Another method: use ‘delay slots’ The pipeline stall can be removed by executing instruction(s) following the flow control instruction – These are executed before the actual jump is made X: PC := 25 F X+1: r3 := r4 x r5 D F is a jump execute anyway... PC updated here W W EE – D D – F 25: r8 := r1 - 22 'delay slot'

1/1/ / faculty of Electrical Engineering eindhoven university of technology Delay slots: to have or not to have Using delay slots changes processor behaviour old programs will not run anymore ! Compilers try to find useful instructions for delay slots –Able to fill  75% of the first delay slots –But only filling  40% of the second delay slots If no useful instruction can be found, insert a NOP

1/1/ / faculty of Electrical Engineering eindhoven university of technology An alternative to delay slots Sometimes several stages between fetching and execution (PC update) of a jump instruction –Would lead to many (unfillable) delay slots Alternative solution: a 'branch target cache' (BTC) –This cache contains for out-of-sequence jumps the new PC value and the first (few) instruction(s) –Is indexed on the address of the jump instruction the BTC ‘knows’ a jump is coming before it is fetched !

1/1/ / faculty of Electrical Engineering eindhoven university of technology D F 11: r2 := r6 + 3 Operation of the Branch Target Cache If the Branch Target Cache hits, the fetch stage starts fetching after the target address –The BTC provides the first (few) instruction(s) itself 10: PC := 22 F Hit ! W W EE – D D – F 23: r8 := r1 - 22 BTC checks address 10 PC updated to 23... BTC provides instruction 22: r3 := r4 x r5

1/1/ / faculty of Electrical Engineering eindhoven university of technology Jump prediction saves time By predicting the outcome of a conditional jump, no need to wait until test outcome is known –Example: condition test outcome known in W stage D F 11: r2 := 3 10: JNZ r1,22 F Prediction: taken D E F 22: r3 := r4 x r5 Prediction correct ! W W E W E D E W D F 23: r7 := r9 W E D F 24: r6 := 5 'delay slot' Prediction wrong ! 12: r8 := 0 W – – F – – D – E W Must avoid wrong predictions !

1/1/ / faculty of Electrical Engineering eindhoven university of technology How to predict a test outcome (1) Prediction may be given with bit in instruction –Shifts prediction problem to the assembler/compiler –Instruction set must be changed to hold this flag The prediction may be based upon the type of test and/or jump direction –End of loop jumps are taken most of the time –A single bit test is generally unpredictable...

1/1/ / faculty of Electrical Engineering eindhoven university of technology How to predict a test outcome (2) Prediction can be based upon the previous outcome(s) of the condition test –This is done with a 'branch history buffer’ A cache which holds information for the most recently executed conditional jumps May be based solely on last execution or more complex (statistical) algorithms Implemented in separate hardware or combined with branch target/instruction caches Combination can achieve a 'hit rate' of > 90%!

1/1/ / faculty of Electrical Engineering eindhoven university of technology CALL and RETURN handling A subroutine CALL can be seen as a jump combined with a memory write –Is not more problematic than a normal JUMP A subroutine RETURN gives more problems –The new PC value cannot be determined from the instruction location and contents –Special tricks exist to bypass the memory stack read (for instance a ‘return address cache’)

1/1/ / faculty of Electrical Engineering eindhoven university of technology Calculated and indirect jumps These give huge problems in a pipeline –The new PC value must be determined before fetching can continue Most of the speedup tricks break down on this problem –A Branch Target Cache can help a littlebit, but only if the actual target remains stable –The predicted target must be checked afterwards !

1/1/ / faculty of Electrical Engineering eindhoven university of technology Moving instructions around It is possible to change the execution order of instructions which do not have dependencies without renaming:with renaming: 1)R1 := R2 + 3R1b := R2a + 3 2)R3 := R1 x 2R3b := R1b x 2 3)R1 := R6 + R2R1c := R6a + R2a 4)R2 := R1 - 15R2b := R1c - 15 True dependencies: 2) comes after 1), 4) comes after 3) With renaming, these are the only sequence restrictions ! 3)R1c := R6a + R2a 4)R2b := R1c - 15 1)R1b := R2a + 3 2)R3b := R1b x 2 3)R1c := R6a + R2a 1)R1b := R2a + 3 2)R3b := R1b x 2 4)R2b := R1c - 15

1/1/ / faculty of Electrical Engineering eindhoven university of technology Out-of-order (OOO) execution Changing the order of instruction execution can remove pipeline stalls and/or fill delay slots: increase the performance –Instructions can be re-ordered in the program, but this is not OOO execution ! OOO execution: instructions are sent to the operational units (ALU, load/store...) in a different order than the program specifies OOO memory accessing is not discussed here

1/1/ / faculty of Electrical Engineering eindhoven university of technology Instruction buffers for OOO execution To be able to change the execution order, fetched instructions must be buffered fetch & decode ALU load/ store scheduler reservation station reservation station (renamed) registers scheduler central instruction window program memory fetch & decode ALU load/ store (renamed) registers 1) Separate instruction buffers for each functional unit 2) Central instruction buffer program memory

1/1/ / faculty of Electrical Engineering eindhoven university of technology Differences between buffer strategies Reservation stations have advantages +Smaller buffers, schedulers are simpler +Buffer entries can be tailored to instruction format +Routing of instructions across chip simpler The central instruction window also has advantages +Total number of buffered instructions can be smaller +The single scheduler can take better decisions +No ‘false locking’ with identical functional units

1/1/ / faculty of Electrical Engineering eindhoven university of technology False locking between functional units Instruction sequence: A1, B1, A2, B2, A3, B3, A4, B4 1 2 scheduler reservation stations (renamed) registers ALU’s fetch & decode program memory A4 A3 A2 A1 B4 B3 B2 B1 A1 B1 A4 A3 A2 B4 B3 B2 A1 B2 A4 A3 A2 B4 B3 A1 B3 A4 A3 A2 B4 A1 B4 A4 A3 A2A1 A4 A3 A2 A1 B1 A4 A3 A2 B4 B3 B2 locked A4 A3 B2 A1 B4 B3 A2 B1 A1 B1 A4 A3 B2 B4 B3 A2 A1 B1 A4 A3 B2 B4 B3 A2 locked A4 A3 B2 B4 B3 A2 A1 A4 A3 B2 B4 B3 A2 A1 false locking This will not happen with a central instruction window ! Hybrid solution:one reservation station + one scheduler for multiple identical functional units

1/1/ / faculty of Electrical Engineering eindhoven university of technology Scheduler operation The schedulers actually have only a simple task Pick ready-to-execute instructions from their buffers and send them to the appropriate operational units 'ready-to-execute'  with all source values known Try to calculate conditional jump results ASAP Otherwise: oldest instructions first

1/1/ / faculty of Electrical Engineering eindhoven university of technology ‘Ready to execute’ determination The scheduler(s) depend on other system parts to determine which instructions can be executed The fetch unit knows the original order of the instructions and must determine the dependencies The operational units signal the end of a dependency when writing a result operand The instruction buffer(s) determine from this information which instructions are ready to execute and store this knowledge in status flags

1/1/ / faculty of Electrical Engineering eindhoven university of technology The ‘scoreboard’, again A simple scoreboarding technique can be used for ‘ready to execute’ determination –Renamed registers get a flag bit which indicates the register does not contain a result yet –Each renamed destination register write sets the attached flag bit to indicate the result is available An instruction is ready to execute when all the flag bits of it's renamed source registers are set

1/1/ / faculty of Electrical Engineering eindhoven university of technology The problem with interrupts and traps OOO completion means instructions results may be written in an order which differs from the instruction sequence in the program –If an instruction generates a trap, instructions following it may already have changed registers (and/or memory locations !) –If an interrupt must break off processing, some instructions may not complete while later ones in the program have already completed

1/1/ / faculty of Electrical Engineering eindhoven university of technology Solution: a ‘safe state’ register set With these imprecise interrupts and traps, it is almost impossible to get the processor in a state from which it can be safely restarted We must find a way to maintain the 'visible' set of processor registers in a 'safe state': updated in the normal program order –We don't care if this updating of the safe state lags behind the normal updating of the renamed set

1/1/ / faculty of Electrical Engineering eindhoven university of technology 'reorder buffer' renamed registers safe register set Implementation of the safe state One common way to provide this 'safe' register set is by using a so-called 'reorder buffer' result bus(es) renamed register number read pointerwrite pointer 'head' 'tail' simulated FIFO renamed real register number in-order updates source operand 0 1 valid flags operand valid

1/1/ / faculty of Electrical Engineering eindhoven university of technology Safe register set Operation of the reorder buffer Four instructions writing to (real) registers R2, R1, R2 & R3 renamed renamed register real register value valid real register result renamed register number ‘head ’ NR2?6 N?R1:12 Y6R2:47 N?R3:114 N?R4:0 NR1?7 NR2?6 Y7R1:12 Y6R2:47 N?R3:114 N?R4:0 Y7R1:12 Y8R2:47 N?R3:114 N?R4:0 NR2?8 NR1?7 NR2?6 Y7R1:12 Y8R2:47 Y9R3:114 N?R4:0 NR3?9 NR2?8 NR1?7 NR2?6 N?R1:12 N?R2:47 N?R3:114 N?R4:0 Y7R1:12 Y8R2:47 Y9R3:114 N?R4:0 NR3?9 NR2?8 YR1337 NR2?63 Y7R1:12 Y8R2:47 Y9R3:114 N?R4:0 NR3?9 NR2?8 YR1337 YR26786 678 Y7R1:12 Y8R2:678 Y9R3:114 N?R4:0 NR3?9 NR2?8 YR1337 N?R1:33 Y8R2:678 Y9R3:114 N?R4:0 NR3?9 NR2?8 YR1337 N?R1:33 Y8R2:678 Y9R3:114 N?R4:0 NR3?9 NR2?8 6 N?R1:33 Y8R2:678 Y9R3:114 N?R4:0 YR369 NR2?8 10 N?R1:33 Y8R2:678 Y9R3:114 N?R4:0 YR369 YR2108 N?R1:33 N?R2:10 Y9R3:114 N?R4:0 YR369 YR2108 N?R1:33 N?R2:10 Y9R3:114 N?R4:0 YR369 N?R1:33 N?R2:10 N?R3:6 N?R4:0 YR369 N?R1:33 N?R2:10 N?R3:6 N?R4:0 Y7R1:12 Y8R2:678 Y9R3:114 N?R4:0 NR3?9 NR2?8 YR1337 YR26786 ‘retiring’ Reorder buffer FIFO

1/1/ / faculty of Electrical Engineering eindhoven university of technology Other solutions and variations Both 'history buffer' and 'future file' are (minor) variations/extensions on the reorder buffer A central instruction window can combine the reorder buffer and instruction buffer functions 'Checkpoint repair' makes backups of the complete register set when problems may occur –Only instructions which were already in execution at the time of the backup modify the backup's state (these must complete execution)

1/1/ / faculty of Electrical Engineering eindhoven university of technology OOO execution & conditional jumps Machines uncapable to move instructions across (conditional) jumps will not perform well –Basic block sizes of 4..6 instructions are normal for CISC's (6..8 instructions for RISC's) –Around half of the jumps is conditional ! The problem with conditional jumps –If the prediction is wrong, the processor state must be restored to the point of the jump instruction In fact, the same as if a trap occurred

1/1/ / faculty of Electrical Engineering eindhoven university of technology ‘Speculative’ OOO conditional jumps (1) 'Speculative fetching’ fetches and decodes instructions after the conditional jump, but does not take them in execution 'Speculative execution’ also executes instructions in the predicted path, using renaming as buffer for the in-order (safe) state –The speculative renamed registers are discarded when the prediction was incorrect –Rename indexes must be restored ! (checkpoint repair ?)

1/1/ / faculty of Electrical Engineering eindhoven university of technology ‘Speculative’ OOO conditional jumps (2) 'Multi-path speculative execution’ extends speculative execution to handle both paths following a conditional branch –may also allow multiple condition tests to be unresolved (needs more checkpointing buffers) Retiring of renamed registers is frozen for speculative renamed registers until the branch outcome is known

1/1/ / faculty of Electrical Engineering eindhoven university of technology Handling more instructions per clock Fetching more than one instruction per clock is generally not such a problem –Make the bus to the instruction memory wider ! Need more than one functional unit to actually execute the instructions in parallel Must also decode more than one instruction per clock to get a 'superscalar' processor

1/1/ / faculty of Electrical Engineering eindhoven university of technology Superscalar parts we have already seen Instruction decoders can easily send multiple instructions to separate reservation stations –With a minor increase in complexity even multiple instructions to the same reservation station The central instruction window can be modified to receive multiple instructions in a single cycle –The scheduler can be changed to handle multiple instructions in parallel

1/1/ / faculty of Electrical Engineering eindhoven university of technology Superscalar dependency detection Instruction dependency determination must now be partially implemented in a parallel form –Renamed register indexes must be forwarded between concurrently decoded instructions –It must be possible to create multiple renamed registers in a single cycle It must also be possible to update multiple in-order (safe) registers in parallel !

1/1/ / faculty of Electrical Engineering eindhoven university of technology Another method to go superscalar Very Large Instruction Word (VLIW) machines pack several ‘normal’ instructions in a single ‘superinstruction’ –They execute this superinstruction using separate functional units With all scheduling done by the compiler ! –Programming VLIW machines in assembly language is virtually impossible faster

1/1/ / faculty of Electrical Engineering eindhoven university of technology VLIW, but not exactly The Intel 80860 processor uses another trick which resembles VLIW operation –It always fetches two instructions at a time –If the first one is a floating point operation, it checks a flag in this instruction –If this flag is set, it assumes the second one is not a floating point operation and executes both in parallel Intel Pentium ‘pairs’ instructions without flags

1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 4:Caches, pipelines and superscalar.

Similar presentations

Presentation on theme: "1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 4:Caches, pipelines and superscalar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 4:Caches, pipelines and superscalar.

Similar presentations

Presentation on theme: "1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 4:Caches, pipelines and superscalar."— Presentation transcript:

Similar presentations

About project

Feedback