COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism
Advertisements

CSCI 4717/5717 Computer Architecture
Morgan Kaufmann Publishers The Processor
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Instruction-Level Parallelism (ILP)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Chapter 12 Pipelining Strategies Performance Hazards.
EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.
Goal: Reduce the Penalty of Control Hazards
L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Parallelism Processing more than one instruction at a time. Pipelining
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Pipelining Example Laundry Example: Three Stages
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
PipeliningPipelining Computer Architecture (Fall 2006)
Advanced Architectures
Computer Architecture Chapter (14): Processor Structure and Function
Computer Organization
Instruction Level Parallelism
ARM Organization and Implementation
William Stallings Computer Organization and Architecture 8th Edition
Morgan Kaufmann Publishers
Pipeline Implementation (4.6)
Dr. Javier Navaridas Pipelining Dr. Javier Navaridas COMP25212 System Architecture.
Chapter 4 The Processor Part 3
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Last Week Talks Any feedback from the talks? What did you like?
Pipelining review.
Superscalar Processors & VLIW Processors
Pipelining Chapter 6.
Hardware Multithreading
Pipelining in more detail
Dr. Javier Navaridas Pipelining Dr. Javier Navaridas COMP25212 System Architecture.
Control unit extension for data hazards
Instruction Execution Cycle
Pipelining.
Control unit extension for data hazards
Wackiness Algorithm A: Algorithm B:
Control unit extension for data hazards
Conceptual execution on a processor which exploits ILP
Presentation transcript:

COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time

COMP25212 Lecture 52 The Fetch-Execute Cycle Instruction execution is a simple repetitive cycle Fetch Instruction Execute Instruction CPU Memory

COMP25212 Lecture 53 Cycles of Operation Most logic circuits are driven by a clock In its simplest form one operations would take one clock cycle This is assuming that getting an instruction and accessing data memory can each be done in a 1/5 th of a cycle (i.e. a cache hit)

COMP25212 Lecture 54 Fetch-Execute Detail The two parts of the cycle can be subdivided Fetch –Get instruction from memory –Decode instruction & select registers Execute –Perform operation or calculate address –Access an operand in data memory –Write result to a register

COMP25212 Processor Detail Register Bank Data Cache PC Instruction Cache MUX ALU IF ID EX MEM WB Instruction Instruction Execute Access Write Fetch Decode Instruction Memory Back

COMP25212 Lecture 56 Logic to do this Each stage will do its work and pass work to the next Each block is only doing any work once every 1/5 th of a cycle Fetch Logic Decode Logic Exec LogicMem Logic Write Logic Inst CacheData Cache

COMP25212 Lecture 57 Can We Overlap Operations? E.g while decoding one instruction we could be fetching the next

COMP25212 Lecture 58 Insert Buffers Between Stages Instead of direct connection between stages – use extra buffers to hold state Clock buffers once per cycle Fetch Logic Decode Logic Exec LogicMem Logic Write Logic Inst CacheData Cache clock Instruction Reg.

COMP25212 Lecture 59 This is a Pipeline Just like a car production line, one stage puts engine in, next puts wheels on etc. We still execute one instruction every cycle We can now increase the clock speed by 5x 5x faster! But it isn’t quite that easy!

COMP25212 Lecture 510 Why 5 Stages Simply because early pipelined processors determined that dividing into these 5 stages of roughly equal complexity was appropriate Some recent processors have used more than 30 pipeline stages We will consider 5 for simplicity at the moment

Control Hazards

COMP25212 Lecture 512 The Control Transfer Problem The obvious way to fetch instructions is in serial program order (i.e. just incrementing the PC) What if we fetch a branch? We only know it’s a branch when we decode it in the second stage of the pipeline By that time we are already fetching the next instruction in serial order

COMP25212 Lecture 513 A Pipeline ‘Bubble’ Inst 1 Inst 2 Inst 3 Branch n Inst 5. Inst n 5 Bra321 n 5 3n+1 5Bra32n cycles We must mark Inst 5 as unwanted and Ignore it as it goes down the pipeline. But we have wasted a cycle Decode here

COMP25212 Lecture 514 Conditional Branches It gets worse! Suppose we have a conditional branch It is possible that we might not be able to determine the branch outcome until the execute (3 rd ) stage We would then have 2 ‘bubbles’ We can often avoid this by reading registers during the decode stage.

COMP25212 Lecture 515 Longer Pipelines ‘Bubbles’ due to branches are usually called Control Hazards They occur when it takes one or more pipeline stages to detect the branch The more stages, the less each does More likely to take multiple stages Longer pipelines usually suffer more degradation from control hazards

COMP25212 Lecture 516 Branch Prediction In most programs a branch instruction is executed many times Also, the instructions will be at the same (virtual) address in memory What if, when a branch was executed –We ‘remembered’ its address –We ‘remembered’ the address that was fetched next

COMP25212 Lecture 517 Branch Target Buffer We could do this with some sort of cache As we fetch the branch we check the target If a valid entry in buffer we use that to fetch next instruction Address Data Branch Add Target Add

COMP25212 Lecture 518 Branch Target Buffer For an unconditional branch we would always get it right For a conditional branch it depends on the probability that the next branch is the same as the previous E.g. a ‘for’ loop which jumps back many times we will get it right most of the time But it is only a prediction, if we get it wrong we correct next cycle (suffer a ‘bubble’)

COMP25212 Lecture 519 Outline Implementation Fetch Stage PC Inst Cache Branch Target Buffer valid inc

COMP25212 Lecture 520 Other Branch Prediction BTB is simple to understand but expensive to implement Also, as described, it just uses the last branch to predict In practice branch prediction depends on –More history (several previous branches) –Context (how did we get to this branch) Real branch predictors are more complex and vital to performance (long pipelines)

Data Hazards

COMP25212 Data Hazards Pipeline can cause other problems Consider ADD R1,R2,R3 MUL R0,R1,R1 The ADD instruction is producing a value in R1 The following MUL instruction uses R1 as input

COMP25212 Instructions in the Pipeline Register Bank Data Cache PC Instruction Cache MUX ALU IF ID EX MEM WB ADD R1,R2,R3MUL R0,R1,R1

COMP25212 The Data isn’t Ready At end of ID cycle, MUL instruction should have selected value in R1 to put into buffer at input to EX stage But the correct value for R1 from ADD instruction is being put into the buffer at output of EX stage at this time It won’t get to input of Register Bank until one cycle later – then probably another cycle to write into R1

COMP25212 Insert Delays? One solution is to detect such data dependencies in hardware and hold instruction in decode stage until data is ready – ‘bubbles’ & wasted cycles again Another is to use the compiler to try to reorder instructions Only works if we can find something useful to do – otherwise insert NOPs - waste

COMP25212 Forwarding Register Bank Data Cache PC Instruction Cache MUX ALU ADD R1,R2,R3MUL R0,R1,R1 We can add extra paths for specific cases Control becomes more complex

COMP25212 Why did it Occur? Due to the design of our pipeline In this case, the result we want is ready one stage ahead of where it was needed, why pass it down the pipeline? But what if we have the sequence LDR R1,[R2,R3] MUL R0,R1,R1 LDR instruction means load R1 from memory address R2+R3

COMP25212 Pipeline Sequence for LDR Fetch Decode and read registers (R2 & R3) Execute – add R2+R3 to form address Memory access, read from address Now we can write the value into register R1 We have designed the ‘worst case’ pipeline to work for all instructions

Forwarding Register Bank Data Cache PC Instruction Cache MUX ALU NOPMUL R0,R1,R1 We can add extra paths for specific cases Control becomes more complex LDR R1,[R2,R3]

COMP25212 Longer Pipelines As mentioned previously we can go to longer pipelines –Do less per pipeline stage –Each step takes less time –So can increase clock frequency –But greater penalty for hazards –More complex control Negative returns?

COMP25212 Where Next? Despite these difficulties it is possible to build processors which approach 1 cycle per instruction (cpi) Given that the computational model is one of serial instruction execution can we do any better than this?

Instruction Level Parallelism

Instruction Level Parallelism (ILP) Suppose we have an expression of the form x = (a+b) * (c-d) Assuming a,b,c & d are in registers, this might turn into ADD R0, R2, R3 SUB R1, R4, R5 MUL R0, R0, R1 STR R0, x

ILP (cont) The MUL has a dependence on the ADD and the SUB, and the STR has a dependence on the MUL However, the ADD and SUB are independent In theory, we could execute them in parallel, even out of order ADD R0, R2, R3 SUB R1, R4, R5 MUL R0, R0, R1 STR R0, x

The Data Flow Graph We can see this more clearly if we draw the data flow graph ADD SUB MUL R2 R3 R4 R5 x As long as R2, R3, R4 & R5 are available, We can execute the ADD & SUB in parallel

Amount of ILP? This is obviously a very simple example However, real programs often have quite a few independent instructions which could be executed in parallel Exact number is clearly program dependent but analysis has shown that maybe 4 is not uncommon (in parts of the program anyway).

How to Exploit? We need to fetch multiple instructions per cycle – wider instruction fetch Need to decode multiple instructions per cycle But must use common registers – they are logically the same registers Need multiple ALUs for execution But also access common data cache

Dual Issue Pipeline Structure Two instructions can now execute in parallel (Potentially) double the execution rate Called a ‘Superscalar’ architecture Register Bank Data Cache PC Instruction Cache MUX ALU I1 I2 ALU MUX

Register & Cache Access Note the access rate to both registers & cache will be doubled To cope with this we may need a dual ported register bank & dual ported cache. This can be done either by duplicating access circuitry or even duplicating whole register & cache structure

Selecting Instructions To get the doubled performance out of this structure, we need to have independent instructions We can have a ‘dispatch unit’ in the fetch stage which uses hardware to examine the instruction dependencies and only issue two in parallel if they are independent

Instruction order If we had ADD R1,R1,R0 MUL R0,R1,R1 ADD R3,R4,R5 MUL R4,R3,R3 Issued in pairs as above We wouldn’t be able to issue any in parallel because of dependencies

Compiler Optimisation But if the compiler had examined dependencies and produced ADD R1,R1,R0 ADD R3,R4,R5 MUL R0,R1,R1 MUL R4,R3,R3 We can now execute pairs in parallel (assuming appropriate forwarding logic)

Relying on the Compiler If compiler can’t manage to reorder the instructions, we still need hardware to avoid issuing conflicts But if we could rely on the compiler, we could get rid of expensive checking logic This is the principle of VLIW (Very Long Instruction Word) Compiler must add NOPs if necessary

Out of Order Execution There are arguments against relying on the compiler –Legacy binaries – optimum code tied to a particular hardware configuration –‘Code Bloat’ in VLIW – useless NOPs Instead rely on hardware to re-order instructions if necessary Complex but effective

Out of Order Execution Processor An instruction buffer needs to be added to store all issued instructions An scheduler is in charge of sending non- conflicted instructions to execute Memory and register accesses need to be delayed until all older instructions are finished to comply with application semantics.

Out of Order Execution Instruction Dispatching and Scheduling Memory and register accesses deferred Register Bank Memory Queue PC Instr. Cache ALU Instruction Buffer Dispatch Schedule Register Queue Data Cache Delay

Programmer Assisted ILP / Vector Instructions Linear Algebra operations such as Vector Product, Matrix Multiplication have LOTS of parallelism This can be hard to detect in languages like C Instructions can be too separated for hardware detection. Programmer can use types such as float4

Limits of ILP Modern processors are up to 4 way superscalar (but rarely achieve 4x speed) Not much beyond this –Hardware complexity –Limited amounts of ILP in real programs Limited ILP not surprising, conventional programs are written assuming a serial execution model – what next?