CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

Slides:



Advertisements
Similar presentations
PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,
Advertisements

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
CMPT 334 Computer Organization
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Computer Architecture
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Chapter Six 1.
ECE 445 – Computer Organization
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.
Pipelining Andreas Klappenecker CPSC321 Computer Architecture.
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
1 Atanasoff–Berry Computer, built by Professor John Vincent Atanasoff and grad student Clifford Berry in the basement of the physics building at Iowa State.
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Introduction to Pipelining Rabi Mahapatra Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
CS1104: Computer Organisation School of Computing National University of Singapore.
Computer Science Education
B 0000 Pipelining ENGR xD52 Eric VanWyk Fall
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
Computer Organization CS224 Fall 2012 Lesson 28. Pipelining Analogy  Pipelined laundry: overlapping execution l Parallelism improves performance §4.5.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2012 Revised from original slides provided by MKP.
Analogy: Gotta Do Laundry
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.
Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Pipelining Example Laundry Example: Three Stages
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Pipelining CS365 Lecture 9. D. Barbara Pipeline CS465 2 Outline  Today’s topic  Pipelining is an implementation technique in which multiple instructions.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CS203 – Advanced Computer Architecture Pipelining Review.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.
Lecture 5. MIPS Processor Design Pipelined MIPS #1 Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212 Computer Architecture.
Lecture 18: Pipelining I.
Pipelines An overview of pipelining
Morgan Kaufmann Publishers
Single Clock Datapath With Control
Pipeline Implementation (4.6)
ECE232: Hardware Organization and Design
Morgan Kaufmann Publishers The Processor
Chapter 4 The Processor Part 2
Morgan Kaufmann Publishers Enhancing Performance with Pipelining
Lecturer: Alan Christopher
Serial versus Pipelined Execution
Single Cycle vs. Multiple Cycle
CS203 – Advanced Computer Architecture
CSC3050 – Computer Architecture
Morgan Kaufmann Publishers The Processor
Presentation transcript:

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson & Hennessy, ©2005

Recap: Single Cycle Datapath Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction –especially problematic for more complex instructions like floating point multiply Clk lwsw Waste Cycle 1Cycle 2

Instruction Times (Critical Paths) Instr.I MemReg RdALU OpD MemReg WrTotal R-type(45%) Load (25%) Store (10%) Beq (15%) Jump (5%)  What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except: l Instruction and Data Memory (200 ps) l ALU and adders (200 ps) l Register File access (reads or writes) (100 ps)

Simplified MIPS Pipelined Datapath

Instruction Critical Paths Instr.I MemReg RdALU OpD MemReg WrTotal R- type load store beq jump  What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except: l Instruction and Data Memory (200 ps) l ALU and adders (200 ps) l Register File access (reads or writes) (100 ps)

Stages Five stages, one step per stage 1.IF: Instruction fetch from memory 2.ID: Instruction decode & register read 3.EX: Execute operation or calculate address 4.MEM: Access memory operand 5.WB: Write result back to register

Single Cycle vs. Multiple Cycle Multiple Cycle Implementation: Clk Cycle 1 IFIDEXMEMWB Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10 IFIDEXMEM lwsw IF R-type Clk Single Cycle Implementation: lwsw Waste Cycle 1Cycle 2

Gotta Do Laundry Michael, Conan, Jimmy, Pat each have one load of clothes to wash, dry, fold, and put away –Washer takes 30 minutes –Dryer takes 30 minutes –“Folder” takes 30 minutes –“Stasher” takes 30 minutes to put clothes into drawers MCJP

Sequential laundry takes 8 hours for 4 loads TaskOrderTaskOrder C J P M 30 Time 30 6 PM AM Sequential Laundry

Pipelined laundry takes 3.5 hours for 4 loads! TaskOrderTaskOrder C J P M 12 2 AM 6 PM Time 30 Pipelined Laundry

General Definitions Latency: time to completely execute a certain task –E.g., time to read a sector from disk is disk access time or disk latency Throughput: amount of work that can be done over a period of time

Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Time to “fill” pipeline and time to “drain” it reduces speedup: 2.3X v. 4X in this example 6 PM 789 Time C J P M 30 TaskOrderTaskOrder Pipelining Lessons

Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline? –Pipeline rate limited by slowest pipeline stage –Unbalanced lengths of pipe stages reduces speedup

A Pipelined MIPS Processor Start the next instruction before the current one has completed –improves throughput –instruction latency is not reduced –clock cycle (pipeline stage time) limited by slowest stage –for some instructions, some stages are wasted cycles Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IFIDEXMEMWB lw Cycle 7Cycle 6Cycle 8 sw IFIDEXMEMWB R-type IFIDEXMEMWB

Single Cycle vs. Multiple Cycle vs. Pipelined Multiple Cycle Implementation: Clk Cycle 1 IFIDEXMEMWB Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10 IFIDEXMEM lwsw IF R-type Clk Single Cycle Implementation: lwsw Waste Cycle 1Cycle 2 lw IFIDEXMEMWB Pipeline Implementation: IFIDEXMEMWB sw IFIDEXMEMWB R-type

Single Cycle vs. Pipelined Example: Compare average time between lw instructions of a single cycle implementation to a pipelined implementation. Assume following operation times for major functional units –200 ps for memory access –200 ps for ALU operation –100 ps for register file read or write (DONE IN CLASS, try 3 and 100, and n)

Pipelined Control (Simplified)

Pipelined Control

Simplified MIPS Pipelined Datapath Can you foresee any problems with these right-to-left flows?

Pipeline registers Need registers between stages –To hold information produced in previous cycle

IF

ID

EX for Load

MEM for Load

WB for Load Wrong register number There is a BUG here

Corrected Datapath for Load

Pipelined Control Control signals derived from instruction –As in single-cycle implementation

Hazards Situations that prevent starting the next instruction in the next cycle –Structure hazards –Data hazard –Control hazard

Structure Hazards Instruction cannot execute in proper clock cycle because hardware cannot support the combination of instructions that are set to execute in the clock cycle In MIPS pipeline with a single memory –Load/store requires data access –Instruction fetch would have to stall for that cycle Would cause a pipeline “bubble” Hence, pipelined datapaths require separate instruction/data memories –Or separate instruction/data caches

Data Hazards Instruction cannot execute in proper clock cycle because data that is needed is not yet available add$s0, $t0, $t1 sub$t2, $s0, $t3

Simplified MIPS Pipelined Datapath

Forwarding (aka Bypassing) Use result when it is computed –Don’t wait for it to be stored in a register –Requires extra connections in the datapath

Load-Use Data Hazard Can’t always avoid stalls by forwarding –If value not computed when needed –Can’t forward backward in time!

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction: C code A = B + E, C= B + F lw$t1, 0($t0) lw$t2, 4($t0) add$t3, $t1, $t2 sw$t3, 12($t0) lw$t4, 8($t0) add$t5, $t1, $t4 sw$t5, 16($t0) stall lw$t1, 0($t0) lw$t2, 4($t0) lw$t4, 8($t0) add$t3, $t1, $t2 sw$t3, 12($t0) add$t5, $t1, $t4 sw$t5, 16($t0)

Control Hazards Instruction cannot execute in proper clock cycle because the instruction that was fetched is not the one that is needed –Branch determines flow of control –Fetching next instruction depends on branch outcome –Pipeline can’t always fetch correct instruction In MIPS pipeline –Need to compare registers and compute target early in the pipeline –Add hardware to do it in ID stage

Stall on Branch Wait until branch outcome determined before fetching next instruction

Branch Prediction Correct branch prediction is very important and can produce substantial performance improvements. –static prediction –dynamic prediction To take full advantage of branch prediction, we can have the instructions not only fetched but also begin execution. This is known as speculative execution

MIPS with Predict Not Taken Prediction correct Prediction incorrect

More Realistic Branch Prediction Static branch prediction –Based on typical branch behavior –Example: loop and if-statement branches Predict backward branches taken Predict forward branches not taken Dynamic branch prediction –Hardware measures actual branch behavior e.g., record recent history of each branch –Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history

Branches Branch instructions can dramatically affect pipeline performance. Control operations are very frequent in current programs. 20% - 35% of the instructions executed are branches (conditional and unconditional). 65% of the branches actually take the branch. Conditional branches are much more frequent than unconditional (more than two times). More than 50% of conditional branches are taken.

Static Branch Prediction Static prediction techniques do not take into consideration execution history. Predict never taken (Motorola 68020): assumes that the branch is not taken. Predict always taken: assumes that the branch is taken.

Dynamic Branch Prediction Improve the accuracy of prediction by recording the history of conditional branches. One-bit prediction scheme –is used in order to record if the last execution resulted in a branch taken or not. The system predicts the same behavior as for the last time. Two-bit prediction scheme –with a two-bit scheme predictions can be made depending on the last two instances of execution.

One-Bit Prediction Scheme

Two-Bit Prediction Scheme

Branch History Table History info. can be used not only to predict the outcome of a conditional branch but also to avoid recalculation of the target address. Together with bits used for prediction, the target address can be stored for later use in a branch history table. Using D. B. P with history tables up to 90% of predictions can be correct. Pentium,PowerPC620 use speculative execution with D.B.P based on a branch history table.

Branch History Table