Lecture 3 Introduction to Pipelines Topics Single Cycle CPU Revisited Components Pipelining Readings: Appendix Sections C.1 and C.2 August 31, 2015 CSCE.

Slides:



Advertisements
Similar presentations
Lecture 4: CPU Performance
Advertisements

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipelining I Topics Pipelining principles Pipeline overheads Pipeline registers and stages Systems I.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
CIS429/529 Winter 2007 Pipelining-1 1 Pipeling RISC/MIPS64 five stage pipeline Basic pipeline performance Pipeline hazards Branch hazards More pipeline.
PipelinedImplementation Part I CSC 333. – 2 – Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
CIS629 Fall 2002 Pipelining 2- 1 Control Hazards Created by branch statements BEQZLOC ADDR1,R2,R3. LOCSUBR1,R2,R3 PC needs to be computed but it happens.
L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.
Computer ArchitectureFall 2007 © October 3rd, 2007 Majd F. Sakr CS-447– Computer Architecture.
DLX Instruction Format
Appendix A Pipelining: Basic and Intermediate Concepts
David O’Hallaron Carnegie Mellon University Processor Architecture PIPE: Pipelined Implementation Part I Processor Architecture PIPE: Pipelined Implementation.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining.
Lecture 7: Pipelining Review Kai Bu
COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
1 Seoul National University Pipelined Implementation : Part I.
Lecture 5: Pipelining Implementation Kai Bu
Lecture 05: Pipelining Basics & Hazards Kai Bu
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Lec 15Systems Architecture1 Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some.
Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.
CMPE 421 Parallel Computer Architecture
ECE 445 – Computer Organization
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
EECE 476: Computer Architecture Slide Set #5: Implementing Pipelining Tor Aamodt Slide background: Die photo of the MIPS R2000 (first commercial MIPS microprocessor)
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]
Pipeline Architecture I Slides from: Bryant & O’ Hallaron
Lecture 4.5 Pipelines – Control Hazards Topics Control Hazards Branch Prediction Misprediction stalls Readings: Appendix C September 2, 2015 CSCE 513 Computer.
Processor Design CT101 – Computing Systems. Content GPR processor – non pipeline implementation Pipeline GPR processor – pipeline implementation Performance.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Introduction to Computer Organization Pipelining.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
Real-World Pipelines Idea Divide process into independent stages
Computer Organization
ARM Organization and Implementation
Morgan Kaufmann Publishers
CMSC 611: Advanced Computer Architecture
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers
Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.
Morgan Kaufmann Publishers The Processor
CDA 3101 Spring 2016 Introduction to Computer Organization
School of Computing and Informatics Arizona State University
Lecture 3 Instruction Level Parallelism (Pipelining)
Pipelined Implementation : Part I
Lecture 2 Quantifying Performance
Pipelined Implementation : Part I
CSC 4250 Computer Architectures
Rocky K. C. Chang 6 November 2017
Systems I Pipelining I Topics Pipelining principles Pipeline overheads
An Introduction to pipelining
Pipelined Implementation : Part I
Guest Lecturer TA: Shreyas Chand
Pipelining Appendix A and Chapter 3.
Guest Lecturer: Justin Hsia
COMS 361 Computer Organization
What You Will Learn In Next Few Sets of Lectures
Presentation transcript:

Lecture 3 Introduction to Pipelines Topics Single Cycle CPU Revisited Components Pipelining Readings: Appendix Sections C.1 and C.2 August 31, 2015 CSCE 513 Computer Architecture

– 2 – CSCE 513 Fall 2015 Overview Last Time Quantifying Performance Amdahl’s law CPU time equationNew Review of Single cycle design 5 stage Pipeline Hazards Performance with StallsReferences Appendix A

– 3 – CSCE 513 Fall 2015 Homework Set #2  1.8 a-d (Change 2015 throughout the question  2025)  1.9  1.12  1.18  Matrix multiply (mm.c will be ed and placed on website)  Compile with gcc –S  Compile with gcc –O2 –S and note differences George K. ZipfGeorge K. Zipf (1949) Human Behavior and the Principle of Least Effort. Addison-Wesley George K. Zipf

– 4 – CSCE 513 Fall 2015 Finish up Slides from Lecture 2 Slides 18   Amdahl’s law  Geometric Means vs Arithmetic Means  Availability: MTTF Example  CPU Performance Equation  Fallacies and Pitfalls  List of Appendices  HW set 2; Slide 38 

– 5 – CSCE 513 Fall 2015 Patterson’s 5 steps to design a processor  Analyze instruction set => datapath requirements  Select set of data-path components & establish clock methodology  Assemble data-path meeting the requirements  Analyze implementation of each instruction to determine setting of control points that effects the register transfer.  Assemble the control logic

– 6 – CSCE 513 Fall 2015 Components we are assuming you know Basic Gates Ands, ors, xors, nands, nors Combinational components Adders ALU Multiplexers (MUXs) Decoders Not going to need: PLAs, PALs, FPGAs … Sequential Components Registers Register file: banks of registers, pair of Mux, decoder for load lines Memories Register Transfer Language NonBranch: PC  PC + 4 (control signal: register transfer)

– 7 – CSCE 513 Fall 2015 MIPS Simplifies Processor Design Instructions same size Instructions same size Source registers always in same place Source registers always in same place Immediates are always same size, same location Immediates are always same size, same location Operations always on registers or immediates Operations always on registers or immediates Single cycle data path means … CPI is … CPI is … CCT is … CCT is … Reference Reference Ref:

– 8 – CSCE 513 Fall 2015 Register File 32:1 Mux R0 R31 5x32 decoder Rd Rt Rs Data in Bus ABus B Notes  How big are the lines ? Some 5 Some 32 Some 1  Data-In goes to every register  R0 = 0

– 9 – CSCE 513 Fall 2015 Instruction Fetch (non-branch)

– 10 – CSCE 513 Fall 2015 High Level for Register-Register Instruct.

– 11 – CSCE 513 Fall 2015 Stores: Store Rs, disp(Rb) Notes Sign extend for 16bit immediates Write trace Read trace

– 12 – CSCE 513 Fall 2015 Loads LD rd, disp(Rr) Notes Sign extend for 16bit (disp) to calculate address = disp + Rr

– 13 – CSCE 513 Fall 2015 Branches Notes Sign extend for backwards branches Note Shift left 2 = Multiply by 4  which means displacement is in words Register Transfer Language Cond  R[rs] == R[rt] if (COND eq 0) PC  PC (SE(imm16) x 4 ) else PC  PC + 4

– 14 – CSCE 513 Fall 2015 Branch Hardware imm16 PC Clk Adder Mux Adder 4 nPC_sel PC Ext Inst Address

– 15 – CSCE 513 Fall 2015 Adding Instruction Fetch / PC Increment

– 16 – CSCE 513 Fall 2015 Simple Data Path for All Instructions

– 17 – CSCE 513 Fall 2015 Pulling it All Together Notes Note PC=PC+4 (all MIPS instructions are 4 bytes)

– 18 – CSCE 513 Fall 2015 Adding Control

– 19 – CSCE 513 Fall 2015 Non-pipelined RISC operations Fig C.21 Store 4 cycles (10%)CPI ? Branches 2 cycles (12%) Others 5 cycles (78%)

– 20 – CSCE 513 Fall 2015 Pipeline Motivational Example System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Can must have clock cycle of at least 320 ps Next few slides are from Bryant and O’Hallaron Computer Systems … Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 21 – CSCE 513 Fall Way Pipelined Version System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps Overall latency increases 360 ps from start to finish from Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 22 – CSCE 513 Fall 2015 Pipeline Diagrams from Bryant and O’Hallaron Unpipelined Cannot start new operation until previous one completes 3-Way Pipelined Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3 Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 23 – CSCE 513 Fall 2015 Operating a Pipeline from Bryant and O’Hallaron Time OP1 OP2 OP3 ABC ABC ABC Clock RegReg Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 239 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 241 RegReg RegReg RegReg 100 ps20 ps100 ps20 ps100 ps20 ps Comb. logic A Comb. logic B Comb. logic C Clock 300 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 359 Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 24 – CSCE 513 Fall 2015 Limitations: Nonuniform Delays from Bryant and O’Hallaron Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages RegReg Clock RegReg Comb. logic B RegReg Comb. logic C 50 ps20 ps150 ps20 ps100 ps20 ps Delay = 510 ps Throughput = 5.88 GOPS Comb. logic A Time OP1 OP2 OP3 ABC ABC ABC Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 25 – CSCE 513 Fall 2015 Limitations: Register Overhead from Bryant and O’Hallaron As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% 3-stage pipeline: 16.67% 6-stage pipeline: 28.57% High speeds of modern processor designs obtained through very deep pipelining Delay = 420 ps, Throughput = GOPSClock RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 26 – CSCE 513 Fall 2015 Multicycle Data Path (appendix C) Multicycle Data Path Execute instructions in stages Shorter Clock Cycle Time (CCT) Executing an instruction takes a few cycles (how many stages we have) We can execute different things in each stage at the same time; precursor to the pipelined versionStages Fetch Decode Execute Memory Write Back

– 27 – CSCE 513 Fall 2015 Stages of Classical 5-stage pipeline Instruction Fetch Cycle IR  Mem[PC] NPC  PC + 4Decode A  Regs[rs] B  Imm  sign-extend ofExecute.Memory Write Back.

– 28 – CSCE 513 Fall 2015 Execute Based on type of intsruction Memory Reference – calculate effective address d(rb) ALUOutput  A + Imm Register-Register ALU instruction ALUOutput  A func B Register-Immediate ALU instruction ALUOutput  A op ImmBranch ALUOutput  NPC + (Imm <<2) Cond  (A==0)

– 29 – CSCE 513 Fall 2015 Memory PC  NPCMemory LMD  Mem[ALUOutput], or Mem[ALUOutput]  BBranch If (cond) PC  ALUOutput

– 30 – CSCE 513 Fall 2015 Write-back (WB) cycle Register-Register ALU instruction Regs[rd]  ALUOutput Register-Immediate ALU instruction Regs[rt]  ALUOutput Load Instruction Regs[rt]  LMD

– 31 – CSCE 513 Fall 2015 Simple RISC Pipeline IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB Instruction Instruction n Instruction n+1 Instruction n+2 Instruction n+3 Instruction n+4 Clock cycle number (time  )

– 32 – CSCE 513 Fall 2015 Performance Analysis in Perfect World Assuming S stages in the pipeline. At each cycle a new instruction is initiated. To execute N instructions takes: N cycles to start-up instructions N cycles to start-up instructions (S-1) cycles to flush the pipeline (S-1) cycles to flush the pipeline TotalTime = N + (S-1) TotalTime = N + (S-1) Example for S=5 from previous slide N=100 instructions Time to execute in non-pipelined = 100 * 5 = 500 cycles Time to execute in non-pipelined = 100 * 5 = 500 cycles Time to execute in pipelined version = (5-1) = 104 cycles Time to execute in pipelined version = (5-1) = 104 cycles SpeedUp = … SpeedUp = …

– 33 – CSCE 513 Fall 2015 Implement Pipelines Supp. Fig C.4

– 34 – CSCE 513 Fall 2015 Pipeline Example with a problem (A.5 like) IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB DADD DSUB AND OR XOR Instruction R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11

– 35 – CSCE 513 Fall 2015 Inserting Pipeline Registers into Data Path fig A’.18

– 36 – CSCE 513 Fall 2015 Major Hurdle of Pipelining Consider executing the code below DADD R1, R2, R3 /* R1  R2 + R3 */ DSUB R4, R1, R5/* R4  R1 + R5 */ AND R6, R1, R7 /* R6  R1 + R7 */ OR R8, R1, R9 /* R8  R1 | R9 */ XOR R10, R1, R11 /* R10  R1 ^ R11 */

– 37 – CSCE 513 Fall 2015 RISC Pipeline Problems IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB DADD DSUB AND OR XOR Clock cycle number (time  )Instruction R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11 So what’s the problem?

– 38 – CSCE 513 Fall 2015 Hazards Data Hazards – a data value computed in one stage is not ready when it is needed in another stage of the pipeline Simple Solution: stall until it is ready but we can do better Control or Branch Hazards Structural Hazards – arise when resources are not sufficient to completely overlap instruction sequence e.g. two floating point add units then having to do three simultaneously

– 39 – CSCE 513 Fall 2015 Performance of Pipelines with Stalls Thus Pipelining can be thought of as improving CPI or improving CCT. Relationships Equations

– 40 – CSCE 513 Fall 2015 Performance Equations with Stalls If we ignore overhead of pipelining Special Case: If we assume every instruction takes same number of cycles, i.e., CPI = constant and assume this constant is the depth of the pipeline then

– 41 – CSCE 513 Fall 2015 Performance Equations with Stalls Alternatively focusing on improvement in CCT Then simplifying using formulas for CCTpipelined

– 42 – CSCE 513 Fall 2015 Performance Equations with Stalls Then simplifying using formulas for CCTpipelined and We obtain

– 43 – CSCE 513 Fall 2015 Structural Hazards If a combination of instructions cannot be accommodated because of resource conflicts is called a structural hazard. Examples Single port memory (what is a dual port memory anyway?) Single port memory (what is a dual port memory anyway?) One write port on register file One write port on register file Single floating point adder Single floating point adder … A stall in pipeline frequently called a pipeline bubble or just bubble. A bubble floats through the pipeline occupying space

– 44 – CSCE 513 Fall 2015 Example Structural Hazard Fig C.4

– 45 – CSCE 513 Fall 2015 Pipeline Stalled for Structural Hazard IFIDEXMEMWB IFIDEXMEM*WB IFIDEXMEMWB stallIFIDEXMEMWB stallIFIDEX Instruction Instruction n Instruction n+1 Instruction n+2 Instruction n+3 Instruction n+4 Clock cycle number (time  ) MEM – a Memory cycle that is a load or Store MEM* – a Memory cycle that is not a load or Store

– 46 – CSCE 513 Fall 2015 Data Hazards

– 47 – CSCE 513 Fall 2015 Data Hazard IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB DADD DSUB AND OR XOR Clock cycle number (time  )Instruction R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11 IM – instruction Memory, DM – data memory

– 48 – CSCE 513 Fall 2015 Figure C.6

– 49 – CSCE 513 Fall 2015 Minimizing Data Hazard Stalls by Forwarding

– 50 – CSCE 513 Fall 2015 Fig C.7 Forwarding

– 51 – CSCE 513 Fall 2015 Forward of operands for Stores C.8

– 52 – CSCE 513 Fall 2015 Control Hazrds

– 53 – CSCE 513 Fall 2015 Copyright © 2011, Elsevier Inc. All rights Reserved. Figure C.18 The states in a 2-bit prediction scheme. By using 2 bits rather than 1, a branch that strongly favors taken or not taken—as many branches do—will be mispredicted less often than with a 1-bit predictor. The 2 bits are used to encode the four states in the system. The 2-bit scheme is actually a specialization of a more general scheme that has an n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the counter can take on values between 0 and 2n – 1: When the counter is greater than or equal to one-half of its maximum value (2n – 1), the branch is predicted as taken; otherwise, it is predicted as untaken. Studies of n-bit predictors have shown that the 2-bit predictors do almost as well, thus most systems rely on 2-bit branch predictors rather than the more general n-bit predictors.

– 54 – CSCE 513 Fall 2015 Pop Quiz Suppose that your application is 60% parallelizable what is the overall Speedup in going from 1 core to 2? Assuming Power and Frequency are linearly related how is the Dynamic Power affected by the improvement?

– 55 – CSCE 513 Fall 2015 Plan of attack Chapter reading plan  1   Appendix C(pipeline review)   Appendix B (Cache review)   Chapter 2 (Memory Hierarchy)   Appendix A (ISA review not really)  Chapter 3 (Instruction Level Parallelism ILP)   Chapter 4 (Data level parallelism   Chapter 5 (Thread level parallelism)  Chapter 6 (Warehouse-scale computing)  Sprinkle in other appendices Website LecturesLectures HWHW LinksLinks Errata ??Errata ??moodle du/ du/ du/ du/ CEC login/passwordCEC login/passwordSystems Simplescalar - pipelineSimplescalar - pipeline Beowulf cluster - MPIBeowulf cluster - MPI GTX - multithreadedGTX - multithreaded