Speed up on cycle time Stalls – Optimizing compilers for pipelining

Slides:



Advertisements
Similar presentations
CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.
Advertisements

COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Forwarding and Hazards MemberRole William ElliottTeam Leader Jessica Tyler ShulerWiki Specialist Tyler KimseyLead Engineer Cameron CarrollEngineer Danielle.
Instruction-Level Parallelism (ILP)
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
Chapter 12 Pipelining Strategies Performance Hazards.
Pipelining Andreas Klappenecker CPSC321 Computer Architecture.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
DLX Instruction Format
Appendix A Pipelining: Basic and Intermediate Concepts
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Pipelining By Toan Nguyen.
Lecture 24: CPU Design Today’s topic –Multi-Cycle ALU –Introduction to Pipelining 1.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
Computer Architecture and the Fetch-Execute Cycle
Computer Architecture and the Fetch-Execute Cycle
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
CBP 2005Comp 3070 Computer Architecture1 Last Time … All instructions the same length We learned to program MIPS And a bit about Intel’s x86 Instructions.
RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
DICCD Class-08. Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system.
Real-World Pipelines Idea Divide process into independent stages
Pipelining 7/12/2013.
Computer Organization
CDA3101 Recitation Section 8
ARM Organization and Implementation
CSCI206 - Computer Organization & Programming
Lecture 16: Basic Pipelining
5 Steps of MIPS Datapath Figure A.2, Page A-8
Single Clock Datapath With Control
Appendix C Pipeline implementation
\course\cpeg323-08F\Topic6b-323
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Lecture 16: Basic Pipelining
Pipelining Chapter 6.
Lecture 5: Pipelining Basics
CSC 4250 Computer Architectures
CS 704 Advanced Computer Architecture
CSCI206 - Computer Organization & Programming
\course\cpeg323-05F\Topic6b-323
How to improve (decrease) CPI
Pipeline control unit (highly abstracted)
Control unit extension for data hazards
Lecture: Pipelining Extensions
Lecture: Pipelining Extensions
Instruction Level Parallelism (ILP)
The Processor Lecture 3.5: Data Hazards
Daxia Ge Friday February 9th, 2007
Instruction Execution Cycle
Chapter 8. Pipelining.
Pipeline control unit (highly abstracted)
Lecture 4: Advanced Pipelines
Pipeline Control unit (highly abstracted)
Reducing pipeline hazards – three techniques
Pipelining.
Control unit extension for data hazards
ARM ORGANISATION.
RTL for the SRC pipeline registers
Introduction to Computer Organization and Architecture
Control unit extension for data hazards
Instruction Level Parallelism
Pipelining Hazards.
Presentation transcript:

Speed up on cycle time Stalls – Optimizing compilers for pipelining

Recall processor

Steps to an instruction

Observations When the read Registers stage it executing, the fetch and decode steps are already done, so that hardware is not being used, and the execute and write back hardware is idle because they cannot begin. So there is a lot of hardware sitting idle. If we know where to fetch the next instruction from, we could begin fetching it as soon as the current instruction’s fetch step has completed. ( We would have to make the assumption that the next instruction is the instruction that comes right after the current instruction (this is not going to be the case when the program branches or when the program hits a page boundary, but it will be correct with a very high probability of the time.

Adding latches between stages

Goals of Pipelining So our goal is to get maximum use out of each stage of the hardware(i.e. Utilize the idle hardware) Since each stage produces results that are used at input to the next stage, we can put a series of latches (D-flip flops) between each stage to allow the result to be recorded and therefore feed into the next stage.

Clock cycles by instruction stages Start instructions immediately after each stage instead of after the entire instruction.

Speed up Cycle time of Pipelined vs. non-piped lined 𝐶𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒 𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒𝑑 = 𝐶𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒 𝑢𝑛𝑝𝑖𝑝𝑝𝑙𝑖𝑛𝑒𝑑 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑖𝑝𝑒𝑙𝑖𝑛𝑒 𝑆𝑡𝑎𝑔𝑒𝑠 +𝑃𝑖𝑝𝑒𝑙𝑖𝑛𝑒 𝑙𝑎𝑡𝑐ℎ 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 Example) an unpipelined processor has a cycle time of 25 nS. What is the cycle time of a pipelined version of the processor with 5 evenly divided pipeline stages if each pipeline latch has a latency of 1nS. 𝐶𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒 𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒𝑑 = 25𝑛𝑆 5 +1𝑛𝑆= 6𝑛𝑆 What if there were 50 stages? 𝐶𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒 𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒𝑑 = 25𝑛𝑆 50 +1𝑛𝑆= 1.5𝑛𝑆

Non-pipelined instructions When consecutive instructions read and write from the same register, without pipelining, there is no issue. Consider the following assembly code If the Register values were R1) 1 R2) 2 R3) 3 R4) 44 R5) 55 And the following instructions were run ADD R5, R2, R3 // add R2 and R3 and store the result in R5 SUB R4, R5, R1 // subtract R1 from R5 and store result in R4 The registers should be R1) 1 R2) 2 R3) 3 R4) 4 R5) 5

Pipelined instructions When consecutive instructions read and write from the same register, we have a hazard. Consider the following assembly code If the Register values were R1) 1 R2) 2 R3) 3 R4) 44 R5) 55 And the following instruction were run ADD R5, R2, R3 // add R2 and R3 and store the result in R5 SUB R4, R5, R1 // subtract R1 from R5 and store result in R4 At the EX step (step 4) of the ADD instruction the values of R5 and R1 are being latches for the RR step (step 3) of the SUB instruction. However , R5 has not been written to yet. The registers should be R1) 1 R2) 2 R3) 3 R4) 54 R5) 5

Pipelining with Stalls Install stalls where needed Hazard detected after decode of SUB instruction.

Pipelining - Result Forwarding Read and Write Register bit values in of consecutive instructions can be compared for Hazards and can be detected after the Decode step of the second instruction. If a Read After Write hazard is detected, the result of the ALU can be passed into the latched register values of the next instruction via a MUX . Therefore stalls can be avoided with this new hardware.

RaW (Read after Write) Hazard ADD R5, R2, R3 // add R2 and R3 and store the result in R5 SUB R4, R1, R5 // subtract R5 from R1 and store result in R4 The issue is that one of the values used as inputs of the second instruction is read prior to the previous instruction correctly setting.

Pipeline Optimizing Compiler A = (B+C) – D; E = 1; May compile into the following code with a RaW pipeline hazard needing stalls. ADD R5, R2, R3 // add R2 and R3 and store the result in R5 SUB R4, R5, R1 // subtract R5 from R1 and store result in R4 STOR A,R4 // store R4 into memory location A STOR E,#1 // store 1 into memory location E The following code correctly runs without needing pipeline stalls.

WaR (Write after Read) Hazard SUB R5, R2, R3 // subtract R3 from R2 and store result in R5 ADD R2, R4, R1 // add R1 to R4 and store result in R2 This should not be an issue assuming that all instruction states have the same latency. (i.e. the time for the computation to take electronically complete are the same. ) If not, the longer instructions complete after a later quick instructions.

WaW (Write after Write) Hazard SUB R5, R2, R3 // subtract R3 from R2 and store result in R5 ADD R5, R4, R1 // add R1 to R4 and store result in R5 This should not be an issue assuming that all instruction states have the same latency. (i.e. the time for the computation to take electronically complete are the same. ) If not, the longer instructions complete after a later quick instructions.

Branches (i.e. Jumps) LOAD R5, A // Load R5 with memory location A JZ else_label // If last instruction’s result was 0, jump ADD R1, R2, R3 SUB R4, R1, R2 JMP after_label else_label: ADD R1,R8,R9 after_label: DIV R9, R1, R3 If A == 0, then the instructions in purple should NOT have run but they would be pipelined and partially run and the clock would naturally complete the instructions once started.

Insert stalls in the Pipelining Insert stalls where needed

Time for stages of an instruction Fetch 8nS (calc next, copy in instruction) Decode 4nS (wires activate hardware) Read 5nS (register values copied Execute 3,10 or 30nS (passthru, And/Comp, Add) Write 5 or 20nS (write to registers/memory) Non-pipelined, we can set the 5 output clock to turn on the stage lines to 8+4+5+30+20 = 67nS per instruction. Note: max value for each stage needed to be used to cover worst case

Time for 5 stages of an instruction Fetch 8nS (calc next, copy in instruction) Decode 4nS (wires activate hardware) Read 5nS (register values copied) Execute 3,10 or 30nS (passthru, And/Comp, Add) Write 5 or 20nS (write to registers/memory) Pipelined, we must set a 5 output clock to the same time. That time must be the max of all stage times  30nS. If pipelined and between stage latches (2nS) we can get a cycle time of (30+2)nS when no hazards. Note: all stages must take the max stage time to cover worst case

Time for 14 stages of an instruction Fetch1 5nS (calc next, copy in instruction) Fetch2 3nS (copy in instruction) Decode 4nS (wires activate hardware) Read 5nS (register values copied Execute1 5nS (passthru, And/Comp, Add) Execute2 5nS ( And/Comp, Add) Execute3 5nS ( Add) Execute4 5nS ( Add) Execute5 5nS ( Add) Execute6 5nS ( Add) Write1 5nS (write to registers/memory) Write2 5nS (write to memory) Write3 5nS (write to memory) Write4 5nS (write to memory)

Time for 14 stages of an instruction With 2nS for pipeline latches between each stage, we can get a cycle time of (5+2) = 7nS cycle time when there are no hazards (therefore stalls). With more states, of course, they are a greater chance of hazards.