Introduction The speed of execution of program is influenced by many factors. i) One way is to build faster circuit technology to build the processor and the main memory. ii) another way is to arrange the hardware so that more than one operation can be performed at the same time. In this way, the number of operations performed per second, is increased even though the elapsed time needed to perform any one operation is not changed.
What Is Pipelining ? Pipelining is a key implementation technique whereby multiple instructions are overlapped in execution. Pipelining offers an economical way to realize temporal parallelism in digital computers. Pipelining has led to the tremendous improvement of system throughput in digital computer.
Pipelining-contd. In this technique a sequential process is decomposed into sub operations - called segments. Each segment performs partial processing. The result obtained from one segment is transferred to the next segment.
Implementation of Pipelining Temporal parallelism is obtained in execution of instructions. It is an important method to increase the speed of PE(processing element). But this is an effective method of increasing the speed provided some ideal conditions are satisfied.
Ideal conditions for pipelining Successive instructions are such that the work done during the execution of an instruction can be effectively used by the next and successive instructions. Successive instructions are independent of one another. Sufficient resources are available in a processor so that if a resource is required by successive instructions in the pipeline, it is readily available.
Tasks are broken into number of independent sub-tasks.Those sub-tasks should take nearly equal time to execute. There should be locality in instruction execution i.e.instructions are executed in sequence one after other in the order in which they are written. But if instructions are not in sequence rather involving many branches and jump instructions –then pipelining is not effective. This technique is efficient for those applications that need to repeat the same task.
Practically it is not possible to breakup all tasks consuming exactly same time. For example execution of floating point division will take much more time than say decoding an instruction. However, delays are introduced (if required ), so that different parts of an instruction will take equal amount of time. All real programs have branch instructions which disturb the sequential execution of instructions. But statistical studies show that probability of encountering branches is less than 17%. Practical Situation
Successive instructions are not always independent. The results produced by an instruction may be required by the instruction and must be made available at the right time. There are always resource conflicts in the processor due to limitation of chip size.It is not cost effective to duplicate some resources indiscriminately.for example it is not useful to have more than two floating point arithmetic units.
Example: X i * Y i + C i for i=1,2,…7 A i B i C i R1 R2 Multiplier R3 R4 Adder R5 seg1 seg2 seg3
clk R1R2 A1B1 A2B2 A3B3 R3R4 - - A1*B1C1 A2*B2C2 R5 - - A1*B1+C A4B4 A5B5 A6B6 A7B7 A3*B3C3 A4*B4C4 A5*B5C5 A6*B6C6 A2*B2+C2 A3*B3+C3 A4*B4+C4 A5*B5+C5 seg1 seg2 seg3
clk 8 9 R1R2 R3R4 A7*B7C7 - - R5 A6*B6+C6 A7*B7+C7 -
Space-time diagram clk1234 seg1 T1T2T3T4 2 T1T2T3 3 T1T T5T6T7 T4T5T6T7 T3T4T5T6T7
Hardware Organization The simplest way of viewing the basic pipeline structure is to imagine that each segment consist of an input register and a combinational circuit. The register holds the data and the combinational circuit performs the sub processes in the particular segment. The output of the combinational circuit in a given segment is connected to the input register of the next segment.
Instruction fetch unit Execution unit Interstage Buffer (b) Hardware Organization
Let a computer having two separate hardware units, one for fetching instructions and another for executing them. The instructions fetched by the fetch unit is deposited in an intermediate storage buffer, B1. this buffer is needed to enable the execution unit to execute the instruction while the fetch unit is fetching the next instruction.
Let F¡ and E¡ refer to the fetch and execute steps for instruction I¡. Execution of program consists of a sequence of fetch and execute steps. F1F1E1E1F2F2E2E2F3F3E3E3F4F4E4E4 (a) Sequential execution time I1 I2 I3 I4
clock cycle F1F1E1E1 Instruction I1I1 I2I2 F2F2E2E2 I3I3 F3F3E3E3 time (c) Pipelined Execution
Space –time diagram The behaviour of a pipeline can be illustrated using space-time diagram. In this diagram segment utilization is shown as a function of time. Space-time diagram for 4 segment p.l. is shown in fig.
Pipelining Hazards DEFINITION –There are some situations which opposes the parallel execution of subtasks within the pipe and are called pipelining hazards. CLASSIFICATION –There are 3 classes of p.l. hazards Structural Hazards Data Hazards Control Hazards
Structural Hazards Basic reason : Resource Conflict Definition: When a machine is pipelined, overlapped execution of instruction requires more number of resources concurrently, which may not be available to support the pipeline due to resource conflict, and then the machine is said to have structural hazards.
Contd. If some resources are not been duplicated enough to allow all combinations of instructions in the pipeline to execute concurrently that resource –then that may cause some structural hazard. –For example : suppose a system is having single- memory pipeline for data and instructions –thus if an instruction contains a data-memory reference, there will be a conflict of resource and hence it may face hazard.
Solution: The common solution is to stall the pipeline for one clock cycle when the data memory access occurs. Stall : It is nothing but a pipeline bubble or just bubble –it floats through the pipeline but carrying no useful work.
Data Hazards Definition: Data hazards are hazards in which an instruction modifies a register or memory location, and a succeeding instruction attempts to access the data in that register or memory location before the register or memory location has actually been updated. There are three types of data hazards: i) RAW(Read After Write): j tries to read before I writes-so j incorrectly gets the old values.
ii)WAW ( Write After Write):j tries to write an operand before it is written by I- i.e.writes end up being performed in the wrong order, leaving the value written by I rather than j in the destination. iii) WAR (Write After Read ) :j tries to write a destination before it is by I., I incorrectly gets the new value- this never happens here because here all reads are early (ID) and all writes are late (in WB). Example : ADD R1, R2,R3 SUB R4, R1,R5 AND R6, R1, R7 Here ADD instruction writes the sum of contents of reg R2 and R3 into the reg R1 the SUB instruction uses the result of addition in reg R1
Now ADD instruction will not modify (update) R1 until the end of stage 5, while the SUB instruction needs the value in stage 2. Hence there will be data hazard (RAW)
What is the solution? Special hardware is required to detect the hazard and separation required between two instructions to avoid the hazard- then we can put stalls accordingly. But SUB is to be stopped at stage 1 until ADD has completed stage5- hence not at all desirable solution.
Contd. Data forwarding : If the result can be moved from where ADD produces it, the EX/MEM reg, to where the SUB needs it. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the reg. file.
Performance issues of pipelining Pipelining increases the CPU instruction throughput. Instruction throughput: The number of instructions completed per unit time. Pipelining does not reduce the execution time of each instruction due to overhead in the control of pipelining.
Pipelined instruction processing in pentium Pentium has five pipelines Integer Pipeline (5 stage) FP pipeline (8stage ) Load pipeline (5stage) Store pipeline (4stages) Branch pipeline( 5stages ) For all pipelines two stages are common: Instruction fetch (F) First decode (D1) After D1, Instructions are decoded and either a single control word (for simple operation ) is generated or a sequence of control words are initiated. These control words are decoded in the D2 stage of the PL
i t t+1 t+2 t+3 t+5 IM RF ALU RF DM IM RF ALU RF DM IM RF ALU RF DM IM RF ALU RF DM
Six tasks: T1,T2,T3,T4,T5,T6 During 1 st clk : segment 1 is busy with T1 During 2 nd clk : segment 1 is busy with T2 segment 2 is busy with T1 ……..
Speedup
What is Hazard? Any condition that causes the pipelining to stall is called hazard. There are situations that prevent the next instruction in the instruction stream from executing during its designated clock cycle. This is due to hazards.
Classification of Hazards Hazard may occur for different cause there are three types of hazards. 1)Structural hazard 2) Data hazard 3) Control hazard or instruction hazard
Structural hazard – arise from resource conflicts. –Hardware can not support simultaneous overlapped instruction execution. Data hazard – due to dependency of sequential instructions. Control hazard – due to branch instruction.
Structural Hazard This is a situation when two instructions require to use a hardware resource at the same time. When some functional units are not fully pipelined i.e. un-pipelined unit not enough duplication. single memory (used for both data & instruction.)
How to tackle structural hazard? Duplication of important resource. Using modular memory : –Data memory –Instruction memory
Data hazard There are three types of data hazards: –RAW (Read After Write): Destination reg of writing for I1 &source reg of I2 are same –WAW (Write After Write): Destination reg of I1 & I2 are same –WAR (Write After Read) : Source reg of I1 & destination reg of I2 are same.
Example RAW : add r0, r1,r2 sub r4, r3, r0 WAW : add r0, r1, r2 sub r0, r4, r5 WAR : add r2, r1, r0 sub r0, r3, r4
Control hazard Due to branch and jump instruction –Branch Conditional Unconditional
Handling of control hazard Pre-fetch target instruction :both instructions are fetched. Branch Target Buffer (BTB) : Associative memory contains previously executed branch instructions. Loop buffer : Small very high speed register file. Branch prediction :to guess the outcome of the condition. Delayed branch : Rearranging the m/c language
Contd. Stall until decision is made –Insert “no-op” instructions: those that accomplish nothing, just take time –Drawback: branches take 3 clock cycles each (assuming comparator is put in ALU stage)
Performance issues of pipelining Pipelining increases the CPU instruction throughput. Instruction throughput: The number of instructions completed per unit time. Pipelining does not reduce the execution time of each instruction due to overhead in the control of pipelining.