Dr. Javier Navaridas javier.navaridas@manchester.ac.uk Pipelining Dr. Javier Navaridas javier.navaridas@manchester.ac.uk COMP25212 System Architecture.

Slides:



Advertisements
Similar presentations
Lecture 4: CPU Performance
Advertisements

Morgan Kaufmann Publishers The Processor
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Pipelining and Control Hazards Oct
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A.
Chapter 12 Pipelining Strategies Performance Hazards.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Appendix A Pipelining: Basic and Intermediate Concepts
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
COMP541 Multicycle MIPS Montek Singh Mar 25, 2010.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Branch Hazards and Static Branch Prediction Techniques
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Pipelining Example Laundry Example: Three Stages
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
PipeliningPipelining Computer Architecture (Fall 2006)
Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.
Real-World Pipelines Idea Divide process into independent stages
Computer Architecture Chapter (14): Processor Structure and Function
Computer Organization
Computer Organization CS224
Stalling delays the entire pipeline
ARM Organization and Implementation
William Stallings Computer Organization and Architecture 8th Edition
CSCI206 - Computer Organization & Programming
Morgan Kaufmann Publishers
The University of Adelaide, School of Computer Science
Pipelining.
Pipeline Implementation (4.6)
Chapter 4 The Processor Part 4
ECS 154B Computer Architecture II Spring 2009
CDA 3101 Spring 2016 Introduction to Computer Organization
Chapter 4 The Processor Part 3
Morgan Kaufmann Publishers The Processor
Pipelining review.
The processor: Pipelining and Branching
Pipelining in more detail
CSC 4250 Computer Architectures
Dr. Javier Navaridas Pipelining Dr. Javier Navaridas COMP25212 System Architecture.
CSCI206 - Computer Organization & Programming
Pipeline control unit (highly abstracted)
The Processor Lecture 3.6: Control Hazards
Control unit extension for data hazards
Instruction Execution Cycle
Pipeline control unit (highly abstracted)
CS203 – Advanced Computer Architecture
CS 286 Computer Architecture & Organization
Pipeline Control unit (highly abstracted)
Pipelining (II).
Control unit extension for data hazards
ARM ORGANISATION.
Wackiness Algorithm A: Algorithm B:
Control unit extension for data hazards
Presentation transcript:

Dr. Javier Navaridas javier.navaridas@manchester.ac.uk Pipelining Dr. Javier Navaridas javier.navaridas@manchester.ac.uk COMP25212 System Architecture 1

Overview and Learning Outcomes Deepen the understanding on how modern processors work Learn how pipelining can improve processors performance and efficiency Being aware of the problems arising from using pipelined processors Understanding instruction dependencies

The Fetch-Execute Cycle As explained in COMP15111 Instruction execution is a simple repetitive cycle: Memory CPU PC LDR R0, x LDR R0, x LDR R1, y LDR R1, y Fetch Instruction Execute Instruction ADD R2, R1, R0 STR R2, Z … 3

Fetch-Execute Detail The two parts of the cycle can be further subdivided Fetch Get instruction from memory (IF) Decode instruction & select registers (ID) Execute Perform operation or calculate address (EX) Access an operand in data memory (MEM) Write result to a register (WB) We have designed the ‘worst case’ data path It works for all instructions 4

Processor Detail IF ID EX MEM WB Instruction Instruction Execute Access Write Fetch Decode Instruction Memory Back Data Cache Register Bank Instr. Cache MUX PC ALU Cycle i LDR R0, x Select register (PC) Compute address x Get value from [x] Write in R0 ADD R2,R1,R0 Select registers (R0, R1) Add R0 & R1 Do nothing Write in R2 Cycle i+1 5

Cycles of Operation Most logic circuits are driven by a clock In its simplest form one instruction would take one clock cycle (single-cycle processor) This is assuming that getting the instruction and accessing data memory can each be done in a 1/5th of a cycle (i.e. a cache hit) For this part we will assume a perfect cache replacement strategy 6

Logic to do this Each stage will do its work and pass to the next Inst Cache Data Cache Fetch Logic Decode Logic Exec Logic Mem Logic Write Logic Each stage will do its work and pass to the next Each block is only doing useful work once every 1/5th of a cycle 7

Application Execution Clock Cycle 1 2 3 LDR IF ID EX MEM WB   ADD Can we do it any better? Increase utilization Accelerate execution 8

Insert Buffers Between Stages Inst Cache Data Cache clock Instruction Reg. Fetch Logic Decode Logic Exec Logic Mem Logic Write Logic Instead of direct connection between stages – use extra buffers to hold state Clock buffers once per cycle 9

In a pipeline processor Just like a car production line We still can execute one instruction every cycle But now clock frequency is increased by 5x 5x faster execution! Clock Cycle 1 2 3 4 5 6 7 LDR IF ID EX MEM WB   ADD

Benefits of Pipelining

Why 5 Stages ? Simply because early pipelined processors determined that dividing into these 5 stages of roughly equal complexity was appropriate Some recent processors have used more than 30 pipeline stages We will consider 5 for simplicity at the moment 12

Real-world Pipelines ARM7TDMI – 3-stage pipeline ARM9TDMI and ARM9E-S – 5-stage pipeline

a) How much time would it take to execute the program? Imagine we have a non-pipelined processor running at 10MHz and want to run a program with 1000 instructions. a) How much time would it take to execute the program? Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 10-stage pipeline? c) A 100-stage pipeline? Looking at those results, it seems clear that increasing pipeline should increase the execution speed of a processor. Why do you think that processor designers (see Intel, below) have not only stopped increasing pipeline length but, in fact, reduced it? Pentium III – Coppermine (1999) 10-stage pipeline Pentium IV – NetBurst (2000) 20-stage pipeline Pentium Prescott (2004) 31-stage pipeline Core i7 9xx – Bloomfield (2008) 24-stage pipeline Core i7 5Yxx – Broadwell (2014) 19-stage pipeline Core i7 77XX – Kaby Lake (2017) ~20-stage pipeline Why not longer pipelines? Higher freq. => more power More stages => more extra hardware => more complex design (forwarding?) => more difficult to split into uniform size chunks => loading time of the registers limits cycle period There are some Issues that prevent from this kind of scaling

Limits to Pipeline Scalability Higher frequency => higher power More stages more extra hardware more complex design (control logic, forwarding?) more difficult to split into uniform size chunks loading time of the registers limits cycle period Hazards (control and data) A longer datapath means higher probability of hazards occurring and worse penalties when they happen

Control Hazards

The Control Transfer Problem Instructions are normally fetched sequentially (i.e. just incrementing the PC) What if we fetch a branch? We only know it is a branch when we decode it in the second stage of the pipeline By that time we are already fetching the next instruction in serial order We have a ‘Bubble’ in the pipeline 17

A Pipeline ‘Bubble’ Inst 1 Inst 2 Inst 3 B n Inst 5 Inst 6 … Inst n We know it is a branch here. Inst 5 is already fetched I don’t like this. Should improve for next year. We must mark Inst 5 as unwanted and ignore it as it goes down the pipeline. But we have wasted a cycle 18

Conditional Branches It gets worse! Suppose we have a conditional branch We are not be able to determine the branch outcome until the execute (3rd) stage We would then have 2 ‘bubbles’ We can often avoid this by reading registers during the decode stage. 19

Conditional Branches Inst 1 Inst 2 Inst 3 BEQ n Inst 5 Inst 6 … Inst n We do not know whether we have to branch until EX. Inst 5 & 6 are already fetched If condition is true, we must mark Inst 5 & 6 as unwanted and ignore them as they go down the pipeline. 2 wasted cycles now 20

Deeper Pipelines ‘Bubbles’ due to branches are called Control Hazards They occur because it takes one or more pipeline stages to detect the branch The more stages, the less each does More likely to take multiple stages Longer pipelines suffer more degradation from control hazards Is there any way around? 21

Branch Prediction In most programs many branch instructions are executed many times E.g. loops, functions What if, when a branch is executed We take note of its address We take note of the target address We use this info the next time the branch is fetched 22

Branch Target Buffer We could do this with some sort of (small) cache As we fetch the branch we check the BTB If a valid entry in BTB, we use its target to fetch next instruction (rather than the PC) Address Data Branch Address Target Address 23

Branch Target Buffer For unconditional branches we always get it right For conditional branches it depends on the probability of repeating the target E.g. a ‘for’ loop which jumps back many times we will get it right most of the time (only first and last time will mispredict) But it is only a prediction, if we get it wrong we pay a penalty (bubbles) 24

Outline Implementation valid Inst Cache Branch Target Buffer inc Fetch Stage PC 25

Other Branch Prediction BTB is simple to understand But expensive to implement And it just uses the last branch to predict In practice, prediction accuracy depends on More history (several previous branches) Context (how did we get to this branch) Real-world branch predictors are more complex and vital to performance for deep pipelines 26

Benefits of Branch Prediction The comparison is not done until 3rd stage, so 2 instructions have been issued and need to be eliminated from the pipeline and we have wasted 2 cycles If we predict that next instruction will be ‘n’ then we pay no penalty

Consider a simple program with two nested loops as the following: while (true) { for (i=0; i<x; i++) { do_stuff } With the following assumptions: do_stuff has 20 instructions that can be executed ideally in the pipeline. The overhead for control hazards is 3-cycles, regardless of the branch being static or conditional. Each of the two loops can be translated into a single branch instruction. Calculate the instructions-per-cycle that can be achieved for different values of x (2, 4, 10, 100): Without branch prediction. b) With a simple branch prediction policy - do the same as the last time.

Guest Lectures Next Week Wed, 14-March 9am Prof. John Goodacre (UoM, ARM, Kaleao) “Scalable processing for cloud computing” Fri, 16-Mar 2pm Dr. Mark Mawson (Hartree HPC Centre, STFC) “High Performance Computing: Use Cases and Architectures”