F 1 E 1 F 2 E 2 F 3 E 3 F 1 E 1 F 2 E 2 F 3 E 3 I 1 I 2 I 3 I 1 I 2 I 3 Instruction (a) Sequential execution (c) Pipelined execution Figure 8.1. Basic.

Slides:

Advertisements

Similar presentations

Advertisements

Lecture 4: CPU Performance

Pipelining (Week 8).

COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Chapter 8. Pipelining.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

MIPS Pipelined Datapath

Pipelined Processor II (cont’d) CPSC 321

Instruction Level Parallelism (ILP) Colin Stevens.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Chapter 12 Pipelining Strategies Performance Hazards.

Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Pipelining Andreas Klappenecker CPSC321 Computer Architecture.

Lec 17 Nov 2 Chapter 4 – CPU design data path design control logic design single-cycle CPU performance limitations of single cycle CPU multi-cycle CPU.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Pipelined Processor II CPSC 321 Andreas Klappenecker.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

CH12 CPU Structure and Function

Chapter 5 Basic Processing Unit

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

Please see “portrait orientation” PowerPoint file for Chapter 8 Figure 8.1. Basic idea of instruction pipelining.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.

Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing.

Pipelining Enhancing Performance. Datapath as Designed in Ch. 5 Consider execution of: lw $t1,100($t0) lw $t2,200($t0) lw $t3,300($t0) Datapath segments.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.

Topics covered: Pipelining CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:

CPU Overview Computer Organization II 1 February 2009 © McQuain & Ribbens Introduction CPU performance factors – Instruction count n Determined.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

PROCESSOR PIPELINING YASSER MOHAMMAD. SINGLE DATAPATH DESIGN.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Interstage Buffers 1 Computer Organization II © McQuain Pipeline Timing Issues Consider executing: add $t2, $t1, $t0 sub $t3, $t1, $t0 or.

Pipeline Timing Issues

Instruction Level Parallelism

William Stallings Computer Organization and Architecture 8th Edition

Single Clock Datapath With Control

Pipeline Implementation (4.6)

The fetch-execute cycle

Morgan Kaufmann Publishers The Processor

Chapter 8. Pipelining.

Current Design.

Systems Architecture II

Chapter 8. Pipelining.

Figure 8.1 Architecture of a Simple Computer System.

Lecture 8: Dynamic ILP Topics: out-of-order processors

Chapter 8. Pipelining.

Instruction Execution Cycle

Chapter 8. Pipelining.

8086 processor.

Pipeline Control unit (highly abstracted)

Lecture 9: Dynamic ILP Topics: out-of-order processors

MIPS Pipelined Datapath

Presentation transcript:

F 1 E 1 F 2 E 2 F 3 E 3 F 1 E 1 F 2 E 2 F 3 E 3 I 1 I 2 I 3 I 1 I 2 I 3 Instruction (a) Sequential execution (c) Pipelined execution Figure 8.1. Basic idea of instruction pipelining. Clock cycle1234 Instruction fetch unit Execution unit Interstage buffer B1 (b) Hardware organization Time T

E:Execute (ALU) (b) Position of the source and result registers in the processor pipeline

X Figure 8.9.Branch timing. F 1 D 1 E 1 W 1 I 2 (Branch) I Clock cycle F 2 D 2 F 3 X F k D k E k F k+1 D 1 I 3 I k I 1 W k E 1 (b) Branch address computed in Decode stage F 1 D 1 E 1 W 1 I 2 (Branch) I Clock cycle F 2 D 2 F 3 F k D k E k F k+1 D 1 I 3 I k I 1 W k E 1 (a) Branch address computed in Execute stage E 2 D 3 F 4 X I 4 8 Time T

FE FE FE FE FE FE FE Instruction Decrement Branch Shift (delay slot) Figure 8.13.Execution timing showing the delay slot being filled during the last two passes through the loop in Figure Decrement (Branch taken) Branch Shift (delay slot) Add (Branch not taken) Clock cycle Time

F 1 F 2 I 1 (Compare) I 2 (Branch>0) I 3 D 1 E 1 W 1 F 3 F 4 F k D k D 3 X XI 4 I k Instruction Figure 8.14.Timing when a branch decision has been incorrectly predicted as not taken. E 2 Clock cycle D 2 /P 2 Time

Figure State-machine representation of branch prediction algorithms. BTBNT BT BNT Branch taken (BT) Branch not taken (BNT) (a) A 2-state algorithm (b) A 4-state algorithm BT BNT BTBNTLNT LT LNT LTST SNT BT

X +[R1] F FD DE FD F F FD D D E X + [X +[R1]][[X +[R1]]] [X +[R1]] [[X +[R1]]] Load Next instruction Add Load Next instruction (a) Complex addressing mode (b) Simple addressing mode Figure Equivalent operations using complex and simple addressing modes. W W Clock cycle Time W Forward W W W

Figure 8.18.Datapath modified for pipelined execution, with Interstage buffers at the input and output of the ALU.

I 1 (Fadd) D 1 D 2 D 3 D 4 E 1A E 1B E 1C E 2 E 3A E 3B E 3C E 4 W 1 W 2 W 3 W 4 I 2 (Add) I 3 (Fsub) I 4 (Sub) Figure Instruction completion in program order Clock cycle Time (a) Delayed write I 1 (Fadd) D 1 D 2 D 3 D 4 E 1A E 1B E 1C E 2 E 3A E 3B E 3C E 4 W 1 W 2 W 3 W 4 I 2 (Add) I 3 (Fsub) I 4 (Sub) Clock cycle Time (b) Using temporary registers TW 2 4 F 1 F 2 F 3 F F 1 F 2 F 3 F 4

Figure Main building blocks of the UltraSPARC II processor.

ADDccR3,R4,R7 [R3]+[R4], Setconditioncodes BRZ,aLabelBranchifzero,setAnnulbitto1 FCMPF1,F5FP:Compare[F2]and[F5] FADDF2,F3,F6FP:F6[F2]+[F3] FMOVsF3,F4MovesingleprecisionoperandfromF3toF4... LabelFSUBF2,F3,F6FP:F6[F2]  [F3] LDSWR3,R4,R7Loadsinglewordatlocation[R3]+[R4]intoR7... (a) Program fragment ADDccR3,R4,R7 BRZ,aLabel FCMPF1,F5 FSUBF2,F3,F6 (b) Instruction grouping, branch taken ADDccR3,R4,R7 BRZ,aLabel FCMPF1,F5 FADDF2,F3,F6 (c) Instruction grouping, branch not taken Figure Example of instruction grouping.   

Figure Execution flow. Internal registers and execution units Data cache External cache Main memory Instruction cache Load/store Data Instructions Elastic interface queue Instruction buffer

Table 8.1 Examples of SPARC instructions. InstructionDescription ADDR5,R6,R7Integeradd:R7[R5]+[R6] ADDccR2,R3,R5 [R2]+[R3],setconditioncodeflags SUBR5,Imm,R7Integersubtract: R7[R5]Imm (sign-extended) ANDR3,Imm,R5BitwiseAND: R5[R3]ANDImm (sign-extended) XORR3,R4,R5BitwiseExclusiveOR:R5[R3]XOR[R4] FADDqF4,F12,F16Floating-pointadd,quadprecision: F12[F4]+[F12] FSUBsF2,F5,F7Floating-pointsubtract,singleprecision: F7[F2][F5] FDIVsF5,F10,F18Floating-pointdivide,singleprecision, F18[F5]/[F10] LDSWR3,R5,R7 32-bitwordat[R3]+[R5]signextendedtoa64-bit value LDXR3,R5,R7 64-bitextendedwordat[R3]+[R5] LDUBR4,Imm,R5Loadunsignedbytefrommemorylocation[R4]+Imm,the byteisloadedintotheleastsignificant8bitsofregisterR5, andallhigher-orderbitsarefilledwith0s STWR3,R6,R12StorewordfromregisterR3intomemorylocation[R6]+ [R12] LDFR5,R6,F3Loada32-bitwordataddress[R5]+[R6]intofloating pointregisterF3 LDDFR5,R6,F8Loaddoubleword(two32-bitwords)ataddress[R5]+[R6] intofloatingpointregistersF8andF9 STFF14,R6,ImmStorewordfromfloating-registerF14intomemorylocation [R6]+Imm BLEicc,LabelTesttheiccflagsandbranchtoLabeliflessthanorequal tozero BZ,pnxcc,LabelTestthexccflagsandbranchtoLabelifequaltozero, branchispredictednottaken BGT,a,pticc,LabelTestthe32-bitintegerconditioncodesandbranchtoLabel ifgreaterthanzero,setannulbit,branchispredictedtaken FBNE,pnLabelTestfloating-pointstatusflagsandbranchifnotequal, Theannulbitissettozeroandthebranchispredicted nottaken            