CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S15-04-15.

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers The Processor
Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Pipeline Hazards See: P&H Chapter 4.7.
CS Computer Architecture 1 CS 430 – Computer Architecture Pipelined Execution - Review William J. Taffe using slides of David Patterson.
S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.
Chapter 12 Pipelining Strategies Performance Hazards.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Pipelining Andreas Klappenecker CPSC321 Computer Architecture.
1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Goal: Reduce the Penalty of Control Hazards
King Fahd University of Petroleum and Minerals King Fahd University of Petroleum and Minerals Computer Engineering Department Computer Engineering Department.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
DLX Instruction Format
Appendix A Pipelining: Basic and Intermediate Concepts
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
-1.1- PIPELINING 2 nd week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PIPELINING 2 nd week References Pipelining concepts The DLX.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
CS /02 Semester II Help Session IIA Performance Measures Colin Tan S
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Branch Hazards and Static Branch Prediction Techniques
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CSCE 212 Chapter 6 Enhancing Performance with Pipelining Instructor: Jason D. Bakos.
CS203 – Advanced Computer Architecture Pipelining Review.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Stalling delays the entire pipeline
Pipelining Chapter 6.
CSCI206 - Computer Organization & Programming
Performance of Single-cycle Design
Pipeline Implementation (4.6)
Morgan Kaufmann Publishers The Processor
Pipelining review.
Pipelining Chapter 6.
Pipelining in more detail
CSCI206 - Computer Organization & Programming
CSCI206 - Computer Organization & Programming
Chapter Six.
Control unit extension for data hazards
Instruction Execution Cycle
CS203 – Advanced Computer Architecture
Pipelining (II).
Control unit extension for data hazards
Control unit extension for data hazards
Guest Lecturer: Justin Hsia
Presentation transcript:

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

Issue 1 Pipeline Registers Instruction execution typically involves several stages –Fetch: Instructions are read from memory –Decode: The instruction is interpreted, data is read from registers. –Execute: The instruction is actually executed on the data –Memory: Any data memory accesses are performed (e.g. read, write data memory) –Writeback: Results are written to destination registers.

Issue 1 Pipeline Registers Stages are built using combinational devices –The second you put something on the input of a combinational device, the outputs change. –These outputs form inputs to other stages, causing their outputs to change –Hence instructions in the fetch stage affect every other stage in the CPU. –Not possible to have multiple instructions at different stages, since each stage will now affect stages further down. Effect is that CPU can only execute 1 instruction at a time.

Issue 1 Pipeline Registers To support pipelining, stages must be de-coupled from each other. –Add pipeline registers! –Pipeline registers allow each stage to hold a different instruction, as they prevent one stage from affecting another stage until the appropriate time. Hence we can now execute 5 different instructions at 5 stages of the pipeline –1 different instruction at each stage.

Issue 2 Speedup Figure below shows a non-pipelined CPU executing 2 instructions Assuming each stage takes 1 cycles, each instruction will take 5 cycles (i.e. CPI=5)

Issue 2 Speedup For pipelined case: –First instruction takes 5 cycles –Subsequent instructions will take 1 cycle The other 4 cycles of each subsequent instruction is amortized by the previous instruction (see diagram).

Issue 2 Speedup For N+1 instructions, 1st instruction takes 5 cycles, subsequent N instructions take 1 cycle. Total number of cycles is not 5+N. Hence average number of cycles per instruction is: CPI = (5+N)/(N+1) As N tends to infinity, CPI tends to 1. Compare with none-pipeline case, a 5-stage pipeline gives a 5:1 speedup! Ideally, an M stage pipeline will give an M times speedup.

Issue 3 Hazards A hazard is a situation where computation is prevented from proceeding correctly. –Hazards can cause performance problems. –Even worse, hazards can cause computation to be incorrect. –Hence hazards must be resolved

Issue 3A Structural Hazards Structural Hazards Generally a shared resource (e.g. Memory, disk drive) can be used by only 1 processor or pipeline stage at a time. If >1 processor or pipeline stage needs to use the resource, we have a structural hazard. E.g. if 2 processors want to use a bus, we have a structural hazard that must be resolved by arbitration (See I/O notes). Structural hazards are reduced in the MIPS by having separate instruction and data memory –If they were in the same memory, it is possible that the IF stage may try to fetch instructions at the same time as the MEM stage accesses data => Structural Hazard results.

Issue 3B Data Hazards Data Hazards Caused by having >1 instruction executing concurrently. Consider the following instructions: add $1, $2, $3 add $4, $1, $1 The first add instruction will update the contents of $1 in WB stage. But the second instruction will read it $1 2 cycles earlier in the ID stage! It will obviously read the old value of $1, and the add instruction will give the wrong result.

Issue 3B Data Hazards The result that will be written to $1 in the WB stage first becomes available at the ALU in the EX stage. The result for the first add become available from the ALU just as the second instruction needs it. If we can just send this result over, can resolve hazard already! This is called “Forwarding”.

Issue 3B Data Hazards Sometimes forwarding doesn’t quite work. Consider: lw $1, 4($3) add $4, $1, $1 For the lw instruction the EX stage actually computes the result of 4 + $3 (i.e. the 4($3) portion of the instruction). This forms the fetch address. No use forwarding this to the add instruction!

Issue 3B Data Hazards The result of the ALU stage (i.e. the fetch address of the lw instruction) gets sent to the memory system in the MEM stage, and the contents of that addresses becomes available at the end of the MEM stage. But the add instruction needs it at the start of the EX stage. We have this situation:

Issue 3B Data Hazards The forwarding is being done backwards, meaning that at the point when the add instruction needs the data, the data is not yet available. Since it is not yet available, cannot forward!!

Issue 3B Data Hazards This form of hazard is called a “load-use” hazard, and is the only type of data hazard that cannot be resolved by forwarding. The way to resolve this is stall the add instruction by 1 cycle.

Issue 3B Data Hazards Stalling is a bad idea as the processor spends 1 cycle not doing anything. Can also find an independent instruction to place between the lw and add.

Issue 3B Data Hazards Forwarding can be done either from the ALU stage or from the MEM stage.

Issue 3B Data Hazards Hazards between adjacent instructions are resolved from the ALU stage (between add $1,$2, $3 and add $4,$1,$5) Hazards between instructions separated by another instructions (between add $1,$2,$3 and add $6,$1,$7) are resolved from the MEM stage. –This is because if we try to resolve this from the MEM stage, the add $6,$1,$7 instruction will actually get the results of the previous (add $4,$1,$5) instruction instead (since it is the instruction that is in the EX stage)

Issue 3C Control Hazards In an unoptimized pipeline branch decisions are made after the EX stage –The EX stage is where the comparisons are made. –Hence the branching decision becomes available at the end of the EX stage. Pipeline can be optimized by moving the comparisons to the ID stage –Comparisons are always made between register contents (e.g. beq $1, $3, R). –The register contents are available by the end of the ID stage.

Issue 3C Control Hazards However still have a problem. E.g. L1: add $3, $1, $1 beq $3, $4, L1 sub $4, $3, $3 Depending on whether the beq is taken or not, the next instruction to be fetched is either add (if the branch is taken) or sub (if the branch is not taken)

Issue 3C Control Hazard We don’t know which instruction to fetch until the end of the ID stage. But the IF stage must still fetch something! –Fetch add or fetch sub?

Issue 3C Control Hazards - Solutions Always assume that the branch is taken: –The fetch stage will fetch the add instruction. –By the time this fetch is complete, the outcome of the branching is known. –If the branch is taken, the add instruction proceeds to completion. –If the branch is not taken, the add instruction is flushed from the pipeline, and the sub instruction is fetched and executed. This causes a 1 cycle stall.

Issue 3C Control Hazards - Solutions Always assume that the branch is not taken –The IF stage will fetch the sub instruction first. –By this time, the outcome of the branch will be known. –If the branch is not taken, then the sub instruction executes to completion. –If the branch is taken, then the sub instruction is flushed, and the add instruction is fetched and executed. Results in 1 cycle stall.

Issue 3C Control Hazards - Solutions Delayed Branching L1: add $3, $1, $1 beq $3, $4, L1 sub $4, $3, $3 ori $5, $2, $3 Just like the assume not taken strategy, the IF stage fetches the sub instruction. However, the sub instruction executes to completion regardless of the outcome of the branch.

Issue 3C Control Hazards-Solutions –The outcome of the branching will now be known. –If the branch is taken, the add instruction is fetched and executed. –Otherwise the ori instruction is fetched and executed. This strategy is called “delayed branching” because the effect of the branch is not felt until after the sub instruction (i.e. 1 instruction later). The sub instruction here is called the delay slot or delay instruction, and it will always be executed regardless of the outcome of the branching.

Issue 4 Instruction Scheduling We must prevent pipeline stalls in order to gain maximum pipeline performance. For example, for the load-use hazard, we must find an instruction to place between the lw instruction and the instruction that uses the lw results to prevent stalling. Also may need to place instructions into delay slots. This re-arrangement of instructions is called Instruction Scheduling.

Issue 4 Instruction Scheduling Basic criteria: –An instruction I3 to be placed between two instructions I1 and I2 must be independent of both I1 and I2. lw $1, 0($3) add $2, $1, $4 sub $4, $3, $2 In this example, the sub instruction modifies $4, which is used by the add instruction. Hence it cannot be moved between the lw and the add.

Issue 4 Instruction Scheduling –An instruction I3 that is moved must not violate dependency orderings. For example: add $1, $2, $3 sub $5, $1, $7 lw $4, 0($6) ori $9, $4, $4 The add instruction cannot be moved between the lw and ori instructions as it would violate the dependency ordering with the sub instruction. –I.e. the sub depends on the add, and moving the add after the sub would cause incorrect computation of the sub.

Issue 4 Instruction Scheduling The nop instruction stands for “no operation”. When the CPU reads and executes the nop instruction, absolutely nothing happens, except that 1 cycle is wasted executing this instruction. The nop instruction can be used in delay slots or simply to waste time.

Issue 4 Instruction Scheduling Delay branch example: Suppose we have the following program, and suppose that branches are not delayed: add $3, $4, $5 add $2, $3, $7 beq $2, $3, L1 sub $7, $2, $4 L1: In this program, the 2 add instructions will be executed regardless of the outcome of the branch, but the sub instruction will not be executed if the branch is taken.

Issue 4 Instruction Scheduling Suppose a hardware designer modifies the beq instruction to become a delayed branch. –The sub instruction is now in the delay slot, and will be executed regardless of the outcome of the branch! –This is obviously not what the programmer originally intended. –To correct this, we must place an instruction that will be executed regardless of the outcome of the branch into the delay slot. –Either of the 2 add instructions qualify, since they will be executed no matter how the branch turns out. –BUT moving either of them into the delay slot will cause incorrect computation They will violate the dependency orderings between the first and second add, and between the second add and the lw.

Issue 4 Instruction Scheduling But if we don’t move anything into the delay slot, the program will not execute correctly. Solution: Place a nop instruction there. add $3, $4, $5 add $2, $3, $7 beq $2, $3, L1 nop #delay slot here sub $7, $2, $4 L1: The sub instruction now moves out of the delay slot.

Issue 4 Instruction Scheduling Loop Unrolling –Idea: If we loop 16 times to perform an operation, we can duplicate that operation 4 times and loop only 4 times. E.g. for(int i=0; i<16; i++) my[i] = my[i]+3;

Issue 4 Instruction Scheduling This loop can be unrolled to: for(int i=0; i<16; i=i+4) { my[i] = my[i]+3; my[i+1]=my[i+1]+3; my[i+2]=my[i+2]+3; my[i+3]=my[i+3]+3; }

Issue 4 Instruction Scheduling But why even bother doing this? –Loop unrolling actually generates more instructions! Previously we only had 1 instruction doing my[i]=my[i]+3 Now we have 4 such instructions! –Increasing the number of instructions gives us more flexibility in scheduling the code. This allows us to eliminate pipeline stalls etc. more effectively.

Issue 5 Improving Performance Super-pipelines –Each pipeline stage is further broken down. –Effectively each pipeline stage is in turn pipe-lined. –E.g. if each stage can be further broken down into 2 sub-stages:

Issue 5 Improving Performance This allows us to accommodate more instructions in the pipeline. –Now we can have 10 instructions operating simultaneously. –So now we can have a 10x speedup over the non-pipeline architecture instead of just a 5x speedup!

Issue 5 Improving Performance Unfortunately when things go wrong, penalties are higher: –E.g. a branch misprediction resulting in an IF-stage flush will now cause 2 bubbles (in the IF1 and IF2 stages) instead of just 1. –In a load-use stall, 2 bubbles must be inserted.

Issue 5 Improving Performance Superscalar Architectures –Have 2 or more pipelines working at the same time! In a single pipeline, normally the best CPI possible is 1. With 2 pipelines, the average CPI goes down to 1/2! –This will allow us to execute twice as many instructions per second. –Real situation not that ideal Instructions going to 2 different pipelines simultaneously must be independent of each other –There is NO forwarding between pipelines!

Issue 5 Improving Performance If CPU is unable to find independent instructions, then 1 pipeline will remain idle. Example of superscalar machine: –Pentium processor - 2 integer pipelines, 1 floating-point pipeline, giving a total of 3 pipelines!

Summary Issue 1: Pipeline registers –These decouple stages so that different instructions can exist in each stage. –Allows us to execute multiple instructions in a single pipeline. Issue 2: Speed-up –Ideally, an N stage pipeline should give you an N times speedup.

Summary Issue 3: Hazards –Structural hazards: Solved by having multiple resources. E.g. Separate memory for instruction and data. –Data hazards: Solved by forwarding or stalling. –Control hazards: Solved by branch prediction or delayed branching.

Summary Issue 4: Instruction Scheduling –Instructions may need to be re-arranged to avoid pipeline stalls (e.g. load-use hazards) or to ensure correct execution (e.g. filling in delayed slots). –Loop unrolling gives extra instructions. This in turn gives better scheduling opportunities.

Summary Issue 5: Improving Performance –Super-pipelines: Increases pipeline depth. A 5-stage pipeline becomes a 10-stage pipeline, improving performance by 10x instead of 5x. Also causes larger penalties. –Super-scalar pipelines: Have multiple pipelines Can increase instruction execution rate. Average CPI can actually fall below 1!