Lecture 3 Introduction to Pipelines Topics Single Cycle CPU Revisited Components Pipelining Readings: Appendix Sections C.1 and C.2 August 31, 2015 CSCE.

Lecture 3 Introduction to Pipelines Topics Single Cycle CPU Revisited Components Pipelining Readings: Appendix Sections C.1 and C.2 August 31, 2015 CSCE 513 Computer Architecture

– 2 – CSCE 513 Fall 2015 Overview Last Time Quantifying Performance Amdahl’s law CPU time equationNew Review of Single cycle design 5 stage Pipeline Hazards Performance with StallsReferences Appendix A

– 3 – CSCE 513 Fall 2015 Homework Set #2  1.8 a-d (Change 2015 throughout the question  2025)  1.9  1.12  1.18  Matrix multiply (mm.c will be emailed and placed on website)  Compile with gcc –S  Compile with gcc –O2 –S and note differences George K. ZipfGeorge K. Zipf (1949) Human Behavior and the Principle of Least Effort. Addison-Wesley George K. Zipf

– 4 – CSCE 513 Fall 2015 Finish up Slides from Lecture 2 Slides 18   Amdahl’s law  Geometric Means vs Arithmetic Means  Availability: MTTF Example  CPU Performance Equation  Fallacies and Pitfalls  List of Appendices  HW set 2; Slide 38 

– 5 – CSCE 513 Fall 2015 Patterson’s 5 steps to design a processor  Analyze instruction set => datapath requirements  Select set of data-path components & establish clock methodology  Assemble data-path meeting the requirements  Analyze implementation of each instruction to determine setting of control points that effects the register transfer.  Assemble the control logic

– 6 – CSCE 513 Fall 2015 Components we are assuming you know Basic Gates Ands, ors, xors, nands, nors Combinational components Adders ALU Multiplexers (MUXs) Decoders Not going to need: PLAs, PALs, FPGAs … Sequential Components Registers Register file: banks of registers, pair of Mux, decoder for load lines Memories Register Transfer Language NonBranch: PC  PC + 4 (control signal: register transfer)

– 7 – CSCE 513 Fall 2015 MIPS Simplifies Processor Design Instructions same size Instructions same size Source registers always in same place Source registers always in same place Immediates are always same size, same location Immediates are always same size, same location Operations always on registers or immediates Operations always on registers or immediates Single cycle data path means … CPI is … CPI is … CCT is … CCT is … Reference http://en.wikipedia.org/wiki/MIPS_architecture Reference http://en.wikipedia.org/wiki/MIPS_architecture Ref: http://en.wikipedia.org/wiki/MIPS_architecture

– 8 – CSCE 513 Fall 2015 Register File 32:1 Mux R0 R31 5x32 decoder Rd Rt Rs Data in Bus ABus B Notes  How big are the lines ? Some 5 Some 32 Some 1  Data-In goes to every register  R0 = 0

– 9 – CSCE 513 Fall 2015 Instruction Fetch (non-branch)

– 10 – CSCE 513 Fall 2015 High Level for Register-Register Instruct.

– 11 – CSCE 513 Fall 2015 Stores: Store Rs, disp(Rb) Notes Sign extend for 16bit immediates Write trace Read trace

– 12 – CSCE 513 Fall 2015 Loads LD rd, disp(Rr) Notes Sign extend for 16bit (disp) to calculate address = disp + Rr

– 13 – CSCE 513 Fall 2015 Branches Notes Sign extend for backwards branches Note Shift left 2 = Multiply by 4  which means displacement is in words Register Transfer Language Cond  R[rs] == R[rt] if (COND eq 0) PC  PC + 4 + (SE(imm16) x 4 ) else PC  PC + 4

– 14 – CSCE 513 Fall 2015 Branch Hardware imm16 PC Clk Adder Mux Adder 4 nPC_sel PC Ext Inst Address

– 15 – CSCE 513 Fall 2015 Adding Instruction Fetch / PC Increment

– 16 – CSCE 513 Fall 2015 Simple Data Path for All Instructions

– 17 – CSCE 513 Fall 2015 Pulling it All Together Notes Note PC=PC+4 (all MIPS instructions are 4 bytes)

– 18 – CSCE 513 Fall 2015 Adding Control

– 19 – CSCE 513 Fall 2015 Non-pipelined RISC operations Fig C.21 Store 4 cycles (10%)CPI ? Branches 2 cycles (12%) Others 5 cycles (78%)

– 20 – CSCE 513 Fall 2015 Pipeline Motivational Example System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Can must have clock cycle of at least 320 ps Next few slides are from Bryant and O’Hallaron Computer Systems … Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 21 – CSCE 513 Fall 2015 3-Way Pipelined Version System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps Overall latency increases 360 ps from start to finish from Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 22 – CSCE 513 Fall 2015 Pipeline Diagrams from Bryant and O’Hallaron Unpipelined Cannot start new operation until previous one completes 3-Way Pipelined Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3 Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 23 – CSCE 513 Fall 2015 Operating a Pipeline from Bryant and O’Hallaron Time OP1 OP2 OP3 ABC ABC ABC 0120240360480640 Clock RegReg Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 239 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 241 RegReg RegReg RegReg 100 ps20 ps100 ps20 ps100 ps20 ps Comb. logic A Comb. logic B Comb. logic C Clock 300 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 359 Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 24 – CSCE 513 Fall 2015 Limitations: Nonuniform Delays from Bryant and O’Hallaron Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages RegReg Clock RegReg Comb. logic B RegReg Comb. logic C 50 ps20 ps150 ps20 ps100 ps20 ps Delay = 510 ps Throughput = 5.88 GOPS Comb. logic A Time OP1 OP2 OP3 ABC ABC ABC Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 25 – CSCE 513 Fall 2015 Limitations: Register Overhead from Bryant and O’Hallaron As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% 3-stage pipeline: 16.67% 6-stage pipeline: 28.57% High speeds of modern processor designs obtained through very deep pipelining Delay = 420 ps, Throughput = 14.29 GOPSClock RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps Reference: Bryant and O’Hallaron Computer Systems: A Programmer’s Perspective

– 26 – CSCE 513 Fall 2015 Multicycle Data Path (appendix C) Multicycle Data Path Execute instructions in stages Shorter Clock Cycle Time (CCT) Executing an instruction takes a few cycles (how many stages we have) We can execute different things in each stage at the same time; precursor to the pipelined versionStages Fetch Decode Execute Memory Write Back

– 27 – CSCE 513 Fall 2015 Stages of Classical 5-stage pipeline Instruction Fetch Cycle IR  Mem[PC] NPC  PC + 4Decode A  Regs[rs] B  Imm  sign-extend ofExecute.Memory Write Back.

– 28 – CSCE 513 Fall 2015 Execute Based on type of intsruction Memory Reference – calculate effective address d(rb) ALUOutput  A + Imm Register-Register ALU instruction ALUOutput  A func B Register-Immediate ALU instruction ALUOutput  A op ImmBranch ALUOutput  NPC + (Imm <<2) Cond  (A==0)

– 29 – CSCE 513 Fall 2015 Memory PC  NPCMemory LMD  Mem[ALUOutput], or Mem[ALUOutput]  BBranch If (cond) PC  ALUOutput

– 30 – CSCE 513 Fall 2015 Write-back (WB) cycle Register-Register ALU instruction Regs[rd]  ALUOutput Register-Immediate ALU instruction Regs[rt]  ALUOutput Load Instruction Regs[rt]  LMD

– 31 – CSCE 513 Fall 2015 Simple RISC Pipeline 123456789 IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB Instruction Instruction n Instruction n+1 Instruction n+2 Instruction n+3 Instruction n+4 Clock cycle number (time  )

– 32 – CSCE 513 Fall 2015 Performance Analysis in Perfect World Assuming S stages in the pipeline. At each cycle a new instruction is initiated. To execute N instructions takes: N cycles to start-up instructions N cycles to start-up instructions (S-1) cycles to flush the pipeline (S-1) cycles to flush the pipeline TotalTime = N + (S-1) TotalTime = N + (S-1) Example for S=5 from previous slide N=100 instructions Time to execute in non-pipelined = 100 * 5 = 500 cycles Time to execute in non-pipelined = 100 * 5 = 500 cycles Time to execute in pipelined version = 100 + (5-1) = 104 cycles Time to execute in pipelined version = 100 + (5-1) = 104 cycles SpeedUp = … SpeedUp = …

– 33 – CSCE 513 Fall 2015 Implement Pipelines Supp. Fig C.4

– 34 – CSCE 513 Fall 2015 Pipeline Example with a problem (A.5 like) 123456789 IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB DADD DSUB AND OR XOR Instruction R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11

– 35 – CSCE 513 Fall 2015 Inserting Pipeline Registers into Data Path fig A’.18

– 36 – CSCE 513 Fall 2015 Major Hurdle of Pipelining Consider executing the code below DADD R1, R2, R3 /* R1  R2 + R3 */ DSUB R4, R1, R5/* R4  R1 + R5 */ AND R6, R1, R7 /* R6  R1 + R7 */ OR R8, R1, R9 /* R8  R1 | R9 */ XOR R10, R1, R11 /* R10  R1 ^ R11 */

– 37 – CSCE 513 Fall 2015 RISC Pipeline Problems 123456789 IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB DADD DSUB AND OR XOR Clock cycle number (time  )Instruction R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11 So what’s the problem?

– 38 – CSCE 513 Fall 2015 Hazards Data Hazards – a data value computed in one stage is not ready when it is needed in another stage of the pipeline Simple Solution: stall until it is ready but we can do better Control or Branch Hazards Structural Hazards – arise when resources are not sufficient to completely overlap instruction sequence e.g. two floating point add units then having to do three simultaneously

– 39 – CSCE 513 Fall 2015 Performance of Pipelines with Stalls Thus Pipelining can be thought of as improving CPI or improving CCT. Relationships Equations

– 40 – CSCE 513 Fall 2015 Performance Equations with Stalls If we ignore overhead of pipelining Special Case: If we assume every instruction takes same number of cycles, i.e., CPI = constant and assume this constant is the depth of the pipeline then

– 41 – CSCE 513 Fall 2015 Performance Equations with Stalls Alternatively focusing on improvement in CCT Then simplifying using formulas for CCTpipelined

– 42 – CSCE 513 Fall 2015 Performance Equations with Stalls Then simplifying using formulas for CCTpipelined and We obtain

– 43 – CSCE 513 Fall 2015 Structural Hazards If a combination of instructions cannot be accommodated because of resource conflicts is called a structural hazard. Examples Single port memory (what is a dual port memory anyway?) Single port memory (what is a dual port memory anyway?) One write port on register file One write port on register file Single floating point adder Single floating point adder … A stall in pipeline frequently called a pipeline bubble or just bubble. A bubble floats through the pipeline occupying space

– 44 – CSCE 513 Fall 2015 Example Structural Hazard Fig C.4

– 45 – CSCE 513 Fall 2015 Pipeline Stalled for Structural Hazard 123456789 IFIDEXMEMWB IFIDEXMEM*WB IFIDEXMEMWB stallIFIDEXMEMWB stallIFIDEX Instruction Instruction n Instruction n+1 Instruction n+2 Instruction n+3 Instruction n+4 Clock cycle number (time  ) MEM – a Memory cycle that is a load or Store MEM* – a Memory cycle that is not a load or Store

– 46 – CSCE 513 Fall 2015 Data Hazards

– 47 – CSCE 513 Fall 2015 Data Hazard 123456789 IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB IMIDEXDMWB DADD DSUB AND OR XOR Clock cycle number (time  )Instruction R1, R2, R3 R4, R1, R5 R6, R1, R7 R8, R1, R9 R10, R1, R11 IM – instruction Memory, DM – data memory

– 48 – CSCE 513 Fall 2015 Figure C.6

– 49 – CSCE 513 Fall 2015 Minimizing Data Hazard Stalls by Forwarding

– 50 – CSCE 513 Fall 2015 Fig C.7 Forwarding

– 51 – CSCE 513 Fall 2015 Forward of operands for Stores C.8

– 52 – CSCE 513 Fall 2015 Control Hazrds

– 53 – CSCE 513 Fall 2015 Copyright © 2011, Elsevier Inc. All rights Reserved. Figure C.18 The states in a 2-bit prediction scheme. By using 2 bits rather than 1, a branch that strongly favors taken or not taken—as many branches do—will be mispredicted less often than with a 1-bit predictor. The 2 bits are used to encode the four states in the system. The 2-bit scheme is actually a specialization of a more general scheme that has an n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the counter can take on values between 0 and 2n – 1: When the counter is greater than or equal to one-half of its maximum value (2n – 1), the branch is predicted as taken; otherwise, it is predicted as untaken. Studies of n-bit predictors have shown that the 2-bit predictors do almost as well, thus most systems rely on 2-bit branch predictors rather than the more general n-bit predictors.

– 54 – CSCE 513 Fall 2015 Pop Quiz Suppose that your application is 60% parallelizable what is the overall Speedup in going from 1 core to 2? Assuming Power and Frequency are linearly related how is the Dynamic Power affected by the improvement?

– 55 – CSCE 513 Fall 2015 Plan of attack Chapter reading plan  1   Appendix C(pipeline review)   Appendix B (Cache review)   Chapter 2 (Memory Hierarchy)   Appendix A (ISA review not really)  Chapter 3 (Instruction Level Parallelism ILP)   Chapter 4 (Data level parallelism   Chapter 5 (Thread level parallelism)  Chapter 6 (Warehouse-scale computing)  Sprinkle in other appendices Website LecturesLectures HWHW LinksLinks Errata ??Errata ??moodle https://dropbox.cse.sc.e du/https://dropbox.cse.sc.e du/https://dropbox.cse.sc.e du/https://dropbox.cse.sc.e du/ CEC login/passwordCEC login/passwordSystems Simplescalar - pipelineSimplescalar - pipeline Beowulf cluster - MPIBeowulf cluster - MPI GTX - multithreadedGTX - multithreaded

Lecture 3 Introduction to Pipelines Topics Single Cycle CPU Revisited Components Pipelining Readings: Appendix Sections C.1 and C.2 August 31, 2015 CSCE.

Similar presentations

Presentation on theme: "Lecture 3 Introduction to Pipelines Topics Single Cycle CPU Revisited Components Pipelining Readings: Appendix Sections C.1 and C.2 August 31, 2015 CSCE."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 3 Introduction to Pipelines Topics Single Cycle CPU Revisited Components Pipelining Readings: Appendix Sections C.1 and C.2 August 31, 2015 CSCE.

Similar presentations

Presentation on theme: "Lecture 3 Introduction to Pipelines Topics Single Cycle CPU Revisited Components Pipelining Readings: Appendix Sections C.1 and C.2 August 31, 2015 CSCE."— Presentation transcript:

Similar presentations

About project

Feedback