“Iron Law” of Processor Performance

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers The Processor
Advertisements

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Instruction-Level Parallelism (ILP)
Chapter 2 Pipelined Processors
Hazard Analysis Procedure  Memory Data Dependences ­Output Dependence (WAW) ­Anti Dependence (WAR) ­True Data Dependence (RAW)  Register Data Dependences.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
DLX Instruction Format
Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Appendix A Pipelining: Basic and Intermediate Concepts
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Pipelining Basics Assembly line concept An instruction is executed in multiple steps Multiple instructions overlap in execution A step in a pipeline is.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.
ECE3056A Architecture, Concurrency, and Energy Lecture: Pipelined Microarchitectures Prof. Moinuddin Qureshi Slides adapted from: Prof. Mutlu (CMU)
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
ECE/CS 552: Pipelining Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on set created by Mark Hill and John P.
Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Pipelining Example Laundry Example: Three Stages
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth.
EE524/CptS561 Jose G. Delgado-Frias 1 Processor Basic steps to process an instruction IFID/OFEXMEMWB Instruction Fetch Instruction Decode / Operand Fetch.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
Computer Architecture Lecture 7: Pipelining Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 1/29/2014.
Real-World Pipelines Idea Divide process into independent stages
CDA3101 Recitation Section 8
Pipelining: Hazards Ver. Jan 14, 2014
ARM Organization and Implementation
CMSC 611: Advanced Computer Architecture
Single Clock Datapath With Control
Appendix C Pipeline implementation
ECE232: Hardware Organization and Design
\course\cpeg323-08F\Topic6b-323
Chapter 4 The Processor Part 2
\course\cpeg323-05F\Topic6b-323
Pipeline control unit (highly abstracted)
The Processor Lecture 3.6: Control Hazards
Instruction Execution Cycle
Pipeline control unit (highly abstracted)
Pipeline Control unit (highly abstracted)
Pipelining Basic concept of assembly line
Pipelining Basic concept of assembly line
Pipelining Appendix A and Chapter 3.
Introduction to Computer Organization and Architecture
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Spring 2010 Ilam University
Pipelining Hazards.
Presentation transcript:

“Iron Law” of Processor Performance Wall-Clock Time Processor Performance = Program Instructions Cycles Program Instruction Time Cycle (code size) (CPI) (cycle time) = X X Architecture  Implementation  Realization Compiler Designer Processor Designer Chip Designer

Pipelined Design Motivation: Increase throughput with little increase in hardware Bandwidth or Throughput = Performance Bandwidth (BW) = no. of tasks/unit time For a system that operates on one task at a time: BW = 1/ latency BW can be increased by pipelining if many operands exist which need the same operation, i.e. many repetitions of the same task are to be performed. Latency required for each task remains the same or may even increase slightly.

Pipeline Illustrated:

Performance Model Starting from an unpipelined version with propagation delay T and BW = 1/T Ppipelined=BWpipelined = 1 / (T/ k +S ) where S = delay through latch k-stage pipelined unpipelined T T/k S T/k S

Hardware Cost Model Starting from an unpipelined version with hardware cost G Costpipelined = kL + G where L = cost of adding each latch, and k = number of stages k-stage pipelined unpipelined G G/k L G/k L

Cost/Performance Trade-off [Peter M. Kogge, 1981] C/P Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S) = LT + GS + LSk + GT/k Optimal Cost/Performance: find min. C/P w.r.t. choice of k k

“Optimal” Pipeline Depth (kopt) x104 G=175, L=41, T=400, S=22 Cost/Performance Ratio (C/P) G=175, L=21, T=400, S=11 Pipeline Depth k

Pipelining Idealism Uniform Suboperations The operation to be pipelined can be evenly partitioned into uniform-latency suboperations Repetition of Identical Operations The same operations are to be performed repeatedly on a large number of different inputs Repetition of Independent Operations All the repetitions of the same operation are mutually independent, i.e. no data dependence and no resource conflicts Good Examples: automobile assembly line floating-point multiplier instruction pipeline???

Instruction Pipeline Design Uniform Suboperations ... NOT!  balance pipeline stages - stage quantization to yield balanced stages - minimize internal fragmentation (some waiting stages) Identical operations ... NOT!  unifying instruction types - coalescing instruction types into one “multi-function” pipe - minimize external fragmentation (some idling stages) Independent operations ... NOT!  resolve data and resource hazards - inter-instruction dependency detection and resolution - minimize performance lose

The Generic Instruction Cycle The “computation” to be pipelined Instruction Fetch (IF) Instruction Decode (ID) Operand(s) Fetch (OF) Instruction Execution (EX) Operand Store (OS) Update Program Counter (PC)

The GENERIC Instruction Pipeline (GNR) Based on Obvious Subcomputations:

Balancing Pipeline Stages Without pipelining Tcyc TIF+TID+TOF+TEX+TOS = 31 Pipelined Tcyc  max{TIF, TID, TOF, TEX, TOS} = 9 Speedup= 31 / 9 Can we do better in terms of either performance or efficiency? IF ID OF EX OS TIF= 6 units TID= 2 units TID= 9 units TEX= 5 units TOS= 9 units

Balancing Pipeline Stages Two Methods for Stage Quantization: Merging of multiple subcomputations into one. Subdividing a subcomputation into multiple subcomputations. Current Trends: Deeper pipelines (more and more stages). Multiplicity of different (subpipelines). Pipelining of memory access (tricky).

Granularity of Pipeline Stages Coarser-Grained Machine Cycle: 4 machine cyc / instruction cyc Finer-Grained Machine Cycle: 11 machine cyc /instruction cyc TIF&ID= 8 units TID= 9 units TEX= 5 units TOS= 9 units Tcyc= 3 units

Hardware Requirements Logic needed for each pipeline stage Register file ports needed to support all the stages Memory accessing ports needed to support all the stages

Pipeline Examples MIPS R2000/R3000 AMDAHL 470V/7 IF RD ID OF EX OS ALU MEM WB 1 2 3 4 5 PC GEN . PC GEN 1 Cache Read PC GEN . Decode Add GEN Read REG EX 1 EX 2 Write Result 2 3 4 5 6 7 8 9 10 Check Result 11 PC GEN . 12

Unifying Instruction Types Procedure: Analyze the sequence of register transfers required by each instruction type. Find commonality across instruction types and merge them to share the same pipeline stage. If there exists flexibility, shift or reorder some register transfers to facilitate further merging.

Coalescing Resource Requirements The 6-stage TYPICAL (TYP) pipeline:

Interface to Memory Subsystem

Pipeline Interface to Register File:

6-stage TYP Pipeline IF

ALU Instruction Flow Path IF

Load Instruction Flow Path IF

Store Instruction Flow Path IF What is wrong in this figure?

Branch Instruction Flow Path IF

Pipeline Resource Diagram t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 IF I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 ID RD ALU MEM WB

Pipelining: Steady State Insti IF ID RD ALU MEM WB Insti+1 IF ID RD ALU MEM WB Insti+2 IF ID RD ALU MEM WB Insti+3 IF ID RD ALU MEM Insti+4 IF ID RD ALU IF ID RD IF ID F

Instruction Dependencies Data Dependence True dependence (RAW) Instruction must wait for all required input operands Anti-Dependence (WAR) Later write must not clobber a still-pending earlier read Output dependence (WAW) Earlier write must not clobber an already-finished later write Control Dependence (aka Procedural Dependence) Conditional branches cause uncertainty to instruction sequencing Instructions following a conditional branch depends on the resolution of the branch instruction (more exact definition later)

Example: Quick Sort on MIPS R2000 bge $10, $9, $36 mul $15, $10, 4 addu $24, $6, $15 lw $25, 0($24) mul $13, $8, 4 addu $14, $6, $13 lw $15, 0($14) bge $25, $15, $36 $35: addu $10, $10, 1 . . . $36: addu $11, $11, -1 # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low

Instruction Dependences and Pipeline Hazards A true dependence between Sequential Code Semantics two instructions may only involve one subcomputation of each instruction. i1: xxxx i1 i1: i2: xxxx i2 i2: i3: xxxx i3 i3: The implied sequential precedences are overspecifications. It is sufficient but not necessary to ensure program correctness.

Necessary Conditions for Data Hazards stage X j:rk_ Reg Write j:rk_ Reg Write j:_rk Reg Read iOj iAj iDj stage Y Reg Write Reg Read Reg Write i:rk_ i:_rk i:rk_ WAW Hazard WAR Hazard RAW Hazard dist(i,j)  dist(X,Y)  Hazard!! dist(i,j) > dist(X,Y)  Safe dist(i,j)  dist(X,Y)  ?? dist(i,j) > dist(X,Y)  ??

Hazards due to Memory Data Dependences Pipe Stage ALU Inst. Load inst. Store inst. Branch inst. 1. IF I-cache PC<PC+4 2. ID decode 3. RD read reg. 4. ALU ALU op. addr. gen. cond. gen. 5. MEM ----- read mem. write mem. PC<-br. addr. 6. WB write reg.

Hazards due to Register Data Dependences Pipe Stage ALU Inst. Load inst. Store inst. Branch inst. 1. IF I-cache PC<PC+4 2. ID decode 3. RD read reg. 4. ALU ALU op. addr. gen. cond. gen. 5. MEM ----- read mem. write mem. PC<-br. addr. 6. WB write reg.

Hazards due to Control Dependences Pipe Stage ALU Inst. Load inst. Store inst. Branch inst. 1. IF I-cache PC<PC+4 2. ID decode 3. RD read reg. 4. ALU ALU op. addr. gen. cond. gen. 5. MEM ----- read mem. write mem. PC<-br. addr. 6. WB write reg.