COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept.
©S.Maciulevičius Instruction execution Computer executes sequences of instructions I 1, I 2, I 3,..., I n. Every instruction I i consists from several steps or phases, which can be described as follows: F – instruction fetch, D - instruction decoding, O – operand fetch, X - operation executing, W – result storing. Of course, partition can be different (it depends on processor)
©S.Maciulevičius Sequential execution In case of sequential execution, (i+1)-th instruction starts after finishing execution of i-th instruction: Phases have different time-span FDOXW FDOXW FD
©S.Maciulevičius Pipeline Pipelined execution of instructions requires rhythmic functioning of pipeline: FDOXWFDOXW t F t D t O t X t W = max(t F, t D, t O, t X, t W ) ? Duration of stage (phase):
©S.Maciulevičius Pipeline Then execution of the (i +1)-th instruction starts by one step later than the i-th: i) i+1) i+2) i+3) FDOXWFDOXWFDOXWFDOXW
©S.Maciulevičius Pipeline implementation Pipelined execution of instructions requires correct transmitting of information between the stages: Stage circuits Latch Stage circuits Latch Stage circuits Latch Data Clock Of course, the latches between levels of memory cells may be excluded, however, the pipeline design complexity will be higher, but the pipeline can be accelerated
©S.Maciulevičius Example of pipeline 4 stage pipeline can be as in this picture: PC ADD R3R2R1 Register file Address Instruction ADD R3 Value R3 Result R1 R2 Values ALU X OF W R3, Result Cock
©S.Maciulevičius PowerPC pipelines IQ-7 IQ-6 IQ-5 IQ-4 IQ-3 IQ-2 IQ-1 IQ-0 (IU decode) IU buffer IU execute Write FPU buffer FPU decoding FPU execute 1 FPU execute 2 Write BPU decode/ execute Write Load/Store From cache
©S.Maciulevičius PowerPC pipeline – IU 1 and 2 mul 3 cmp 4 add 0 add dedoding execution writing waiting in IQ waiting in IU buffer IQ-1 IQ-0 (decoding) IU buffer IU execution Writing
©S.Maciulevičius PowerPC pipeline – IU Decoding each IU instruction requires 1 cycle After decoding follows execution of operation in integer pipeline mul instruction requires for execution 5 cycles, thus cmp can not be executed in 5-th cycle, so it falls to the IU buffer and stays there, till functional unit gets free Thus add instruction stays in decoding stage
©S.Maciulevičius Pipeline hazards Pipeline work really is not as perfect as previously depicted. There are typically three types of hazards: structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time data hazard refers to a situation where an instruction needs as operand result of previous instruction control hazard occurs when processor executes branch or jump operation; pipeline must be filled from target address
©S.Maciulevičius Data hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine ADDR1, R2, R3FDOXW SUBR4, R5, R1 FDOXW ANDR6, R1, R7 FDOXW ORR8, R1, R9 FDOXW XORR10, R1, R11 FDOXW
©S.Maciulevičius Data hazards Let we have such two instructions: add r1, r2, r3; r1 := r2 + r3 sub r4, r1, r5; r4 := r1 – r5 add: sub: Similar occurs is a such case: ld r1, a; r1 := ATM[a] add r4, r1, r5; r4 := r1 + r5 FDOX W: r1 FD O: r1 XW
©S.Maciulevičius Data hazards Data hazards can be eliminated using : Software tools: – inserting NOOP – changing order of instructions Hardware tools: – stalling the pipeline – adding special data lines – bypassing FDOX W: r1 FD O: r1 X
©S.Maciulevičius Data hazards - Bypassing Data bus Main memory Register file Mux Result buffer ALU Bypass for data load Bypass for result
©S.Maciulevičius Control hazards Control hazards can cause a greater performance loss for pipeline than data hazards When a branch is executed, it may or may not change the PC (program counter) to something other than its current value plus 4 If a branch changes the PC to its target address, it is a taken branch; if it falls through, it is not taken
©S.Maciulevičius Control hazards Branches and jumps branch FDO X: PC W F Stall XW XW FDO FD FDOXW FDOX i+1 i+2 i+3 i+4 Stall After recognizing branch, pipeline is stalled until branch target address is calculated
©S.Maciulevičius Control hazards What to do in order to reduce possible time losses? As soon as possible find out whether branch occurs As soon as possible calculate new value of PC Measures to reduce the delay time: Using branch prediction Changing instruction order Using multithreading Using buffers for storing unused instructions
©S.Maciulevičius Superpipelining Superpipelining simply refers to pipelining that uses a longer pipeline (with more stages) than "regular" pipelining In theory, a design with more stages, each doing less work, can be scaled to higher clock frequency However, this depends a lot on other design characteristics, and it isn't true by default that a processor claiming superpipelining is "better"
©S.Maciulevičius Superpipeline Pipeline rhythm can be achieved otherwise either: FDOXW Duration of stage (phase): t F t D t O t X t W = max(t X /2, t D ) F1DOX1W1F2X2W2
©S.Maciulevičius Superpipeline Such superpipeline looks so: F1DOX1W1F2X2W2 F1DOX1W1F2X2W2 F1DOX1W1F2X2W2 F1DOX1W1F2X2W2 F1DOX1W1F2X2W2 F1DOX1W1F2X2W2
©S.Maciulevičius Superpipeline in Pentium II IFU – Instruction Fetch Unit ID – Instruction Decode RAT – Register Allocator ROB – Reorder Buffer DIS – Dispatcher EX – Execute Stage RET – Retire Unit RET2RET1 EXDISROBRATID2ID1IFU3IFU2IFU1
Haswell pipeline Haswell pipeline can be seen on two next slides: First part of pipeline - Front End Second part of pipeline - Back End; this part usually is presented as Haswell Execution Engine ©S.Maciulevičius
©S.Maciulevičius
©S.Maciulevičius