1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput.

Slides:

Advertisements

Similar presentations

Advertisements

Pipelined Processor II (cont’d) CPSC 321

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

ECE 445 – Computer Organization

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Pipelined Processor.

1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 18 - Pipelined.

Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

Chapter Six Enhancing Performance with Pipelining

1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.

L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.

1 CSE SUNY New Paltz Chapter Six Enhancing Performance with Pipelining.

Computer Organization Lecture Set – 06 Chapter 6 Huei-Yung Lin.

Lecture 28: Chapter 4 Today’s topic –Data Hazards –Forwarding 1.

1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

Enhancing Performance with Pipelining Slides developed by Rami Abielmona and modified by Miodrag Bolic High-Level Computer Systems Design.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Pipelined Datapath and Control

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2012 Revised from original slides provided by MKP.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CMPE 421 Parallel Computer Architecture

CMPE 421 Parallel Computer Architecture Part 2: Hardware Solution: Forwarding.

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.

1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,

Computing Systems Pipelining: enhancing performance.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.

1  2004 Morgan Kaufmann Publishers No encoding: –1 bit for each datapath operation –faster, requires more memory (logic) –used for Vax 780 — an astonishing.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Computer Organization

Morgan Kaufmann Publishers The Processor

Single Clock Datapath With Control

Pipeline Implementation (4.6)

ECE232: Hardware Organization and Design

Morgan Kaufmann Publishers The Processor

Chapter 4 The Processor Part 3

Review: MIPS Pipeline Data and Control Paths

Morgan Kaufmann Publishers The Processor

Morgan Kaufmann Publishers The Processor

Morgan Kaufmann Publishers The Processor

Chapter 4 The Processor Part 2

Morgan Kaufmann Publishers Enhancing Performance with Pipelining

Computer Organization CS224

Lecture 9. MIPS Processor Design – Pipelined Processor Design #2

Pipelining in more detail

Computer Architecture

Pipelined Control (Simplified)

The Processor Lecture 3.5: Data Hazards

November 5 No exam results today. 9 Classes to go!

CS203 – Advanced Computer Architecture

Pipelining (II).

Morgan Kaufmann Publishers The Processor

Systems Architecture II

Guest Lecturer: Justin Hsia

©2003 Craig Zilles (derived from slides by Howard Huang)

Presentation transcript:

1  1998 Morgan Kaufmann Publishers Chapter Six

2  1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?

3  1998 Morgan Kaufmann Publishers Pipelining What makes it easy –all instructions are the same length –just a few instruction formats –memory operands appear only in loads and stores What makes it hard? –structural hazards: suppose we had only one memory –control hazards: need to worry about branch instructions –data hazards: an instruction depends on a previous instruction We will build a simple pipeline and look at these issues We will talk about modern processors and what really makes it hard: –exception handling –trying to improve performance with out-of-order execution, etc.

4  1998 Morgan Kaufmann Publishers Basic Idea What do we need to add to actually split the datapath into stages?

5  1998 Morgan Kaufmann Publishers Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

6  1998 Morgan Kaufmann Publishers Corrected Datapath

7  1998 Morgan Kaufmann Publishers Instruction fetch stage of lw

8  1998 Morgan Kaufmann Publishers Instruction decode stage of lw

9  1998 Morgan Kaufmann Publishers Execution stage of lw

10  1998 Morgan Kaufmann Publishers Memory stage of lw

11  1998 Morgan Kaufmann Publishers Write back stage of lw

12  1998 Morgan Kaufmann Publishers Execution stage of sw

13  1998 Morgan Kaufmann Publishers Memory stage of sw

14  1998 Morgan Kaufmann Publishers Write back stage of sw

15  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline

16  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)

17  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)

18  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)

19  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)

20  1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)

21  1998 Morgan Kaufmann Publishers Graphically Representing Pipelines Can help with answering questions like: –how many cycles does it take to execute this code? –what is the ALU doing during cycle 4? –use this representation to help understand datapaths

22  1998 Morgan Kaufmann Publishers Pipeline Control Can these operations be completed one stage earlier? Think about critical path!!! Why does IM have no control signal at all? Why does RF have only write control? Can ALU result be written back in MEM? Cost of RF write port!!!

23  1998 Morgan Kaufmann Publishers Pipeline design considerations Simplification of control mechanism –Active in every clock cycle (always enable) Ex: –Instruction memory has no control signal. –Pipeline registers Minimization of power consumption –Explicit control for infrequent operations Ex: both read and write controls for data memory Cost consideration –Ex: alternatives of ALU write: in MEM or WB

24  1998 Morgan Kaufmann Publishers We have 5 stages. What needs to be controlled in each stage? –Instruction Fetch and PC Increment –Instruction Decode / Register Fetch –Execution –Memory Stage –Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine? –Centralized –Distributed Pipeline control

25  1998 Morgan Kaufmann Publishers Pass control signals along just like the data –Generate all control signals at the decode stage…similar to the single cycle implementation –Pass the generated control signals along the pipeline and consume the related control signals at the corresponding stage…similar to the multicycle implementation Pipeline Control

26  1998 Morgan Kaufmann Publishers Datapath with Control

27  1998 Morgan Kaufmann Publishers

28  1998 Morgan Kaufmann Publishers

29  1998 Morgan Kaufmann Publishers

30  1998 Morgan Kaufmann Publishers $4, $5

31  1998 Morgan Kaufmann Publishers

32  1998 Morgan Kaufmann Publishers

33  1998 Morgan Kaufmann Publishers

34  1998 Morgan Kaufmann Publishers

35  1998 Morgan Kaufmann Publishers

36  1998 Morgan Kaufmann Publishers Problem with starting next instruction before first is finished –dependencies that point backward in time are data hazards Dependencies

37  1998 Morgan Kaufmann Publishers Have compiler guarantee no hazards Where do we insert the nops?? sub$2, $1, $3 and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Insert NOP… sub$2, $1, $3 NOP NOP NOP and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Problem: this really slows us down! Software Solution

38  1998 Morgan Kaufmann Publishers Use temporary results, don’t wait for them to be written ALU forwarding Register file forwarding (latch-based register file) to handle read/write to same register (read what you just write!!!) Forwarding what if this $2 was $13? Transparent latch

39  1998 Morgan Kaufmann Publishers No forwarding datapath

40  1998 Morgan Kaufmann Publishers With forwarding datapath

41  1998 Morgan Kaufmann Publishers The control values for the forwarding multiplexors Mux controlSourceExplanation ForwardingA = 00ID/EX The first ALU operand comes from the register file ForwardingA = 10EX/MEM The first ALU operand is forwarded from prior ALU result ForwardingA = 01MEM/WB The first ALU operand is forwarded from data memory or an earlier ALU result ForwardingB = 00ID/EX The second ALU operand comes from the register file ForwardingB = 10EX/MEM The second ALU operand is forwarded from prior ALU result ForwardingB = 01MEM/WB The second ALU operand is forwarded from data memory or an earlier ALU result If (MEM/WB.RegWrite and (MEM/WB.RegisterRd  0) and (EX/MEM.RegisterRd  ID/EX.RegisterRs) and (MEM/WB.registerRd = ID/EX.RegisterRs)) ForwardA = 01 If (MEM/WB.RegWrite and (MEM/WB.RegisterRd  0) and (EX/MEM.RegisterRd  ID/EX.RegisterRt) and (MEM/WB.registerRd = ID/EX.RegisterRt)) ForwardB = 01

42  1998 Morgan Kaufmann Publishers The modified datapath resolves hazards via forwarding

43  1998 Morgan Kaufmann Publishers

44  1998 Morgan Kaufmann Publishers

45  1998 Morgan Kaufmann Publishers $2 Register file forwarding (through the transparent latch).

46  1998 Morgan Kaufmann Publishers Simultaneously match the $4 operands in stage MEM and stage WB. Forward from the nearest stage: MEM (based on the sequential programming model) Write from the WB stage to the register file (RF). Forward Reg. Write

47  1998 Morgan Kaufmann Publishers Load word can still cause a hazard: –an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to stall the load instruction Can't always forward Latch-based RF: read what you just write! (write then read)

48  1998 Morgan Kaufmann Publishers Stalling We can stall the pipeline by keeping an instruction in the same stage

49  1998 Morgan Kaufmann Publishers Hazard Detection Unit Stall by letting an instruction that won’t write anything go forward stall Insert NOP LOAD continues to next stage

50  1998 Morgan Kaufmann Publishers

51  1998 Morgan Kaufmann Publishers

52  1998 Morgan Kaufmann Publishers

53  1998 Morgan Kaufmann Publishers

54  1998 Morgan Kaufmann Publishers

55  1998 Morgan Kaufmann Publishers

56  1998 Morgan Kaufmann Publishers When we decide to branch, other instructions are in the pipeline! We are predicting branch not taken –need to add hardware for flushing instructions if we are wrong Branch Hazards

57  1998 Morgan Kaufmann Publishers Flushing Instructions Optimized data path for branch performance: Branch delay: 3 => 1 Original data path

58  1998 Morgan Kaufmann Publishers Instruction Flushing for Branch

59  1998 Morgan Kaufmann Publishers Instruction Flushing for Branch (Flushed and instr.)

60  1998 Morgan Kaufmann Publishers Improving Performance Try and avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1)  lw $t0, 0($t1) lw $t2, 4($t1)lw $t2, 4($t1) sw $t2, 0($t1)sw $t0, 4($t1) sw $t0, 4($t1)sw $t2, 0($t1) Add a branch delay slot (delayed branch) –the next instruction after a branch is always executed –rely on compiler to fill the slot with something useful add $2, $3, $4  beq $9, $10, 400 beq $9, $10, 400add $2, $3, $4 ; always executedsub $11, $12, $13: Superscalar: start more than one instruction in the same cycle

61  1998 Morgan Kaufmann Publishers Utilizing the branch delay slot (compiler’s task) : :

62  1998 Morgan Kaufmann Publishers Final data/control path for hazard handling

63  1998 Morgan Kaufmann Publishers Dynamic Branch Prediction

64  1998 Morgan Kaufmann Publishers Final data/control path for exception handling 1.flush instr.; 2.save PC (PC+4); Cause 3.set new PC; 4.overflowed instr. (EX) => NOP

65  1998 Morgan Kaufmann Publishers Overflow!

66  1998 Morgan Kaufmann Publishers continue! ………… Flushing (NOP’s) ……………

67  1998 Morgan Kaufmann Publishers … Fig.6.56: Complete data path and control for Chap. 6

68  1998 Morgan Kaufmann Publishers Instruction typePipe stages ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB Suerscalar Execution

69  1998 Morgan Kaufmann Publishers

70  1998 Morgan Kaufmann Publishers Simple Superscalar Code Scheduling Loop ： lw$t0, 0($s1) # $t0=array element ($s1 is i) addu$t0, $t0, $s2# add dcalar in $s2 (B) sw$t0, 0($s1)# store result addi$s1, $s1, -4# decrement pointer bne$s1, $zero, Loop# branch $s1 != 0 ALU or branch instructionData transfer instructionClock cycle Loop:lw$t0, 0($s1)1 addi$s1, $s1, -42 addu$t0, $t0, $s23 bne$s1, $zero, Loopsw$t0, 0($s1)4 Do { *I = *I + B; I = I -4 ; } While (I != 0) ; Do { *I = *I + B; I = I -4 ; } While (I != 0) ;

71  1998 Morgan Kaufmann Publishers Loop Unrolling for Superscalar Pipelines ALU or branch instructionData transfer instructionClock cycle Loop:addi$s1, $s1, -16lw$t0, 0($s1)1 lw$t1, 12($s1)2 addu$t0, $t0, $s2lw$t2, 8($s1)3 addu$t1, $t1, $s2lw$t3, 4($s1)4 addu$t2, $t2, $s2sw$t0, 16($s1)5 addu$t3, $t3, $s2sw$t1, 12($s1)6 sw$t2, 8($s1)7 bne$s1, $zero, Loopsw$t2, 8($s1)8 Do { I = I - 16 ; *I = *I + B; *(I+12) = *(I+12) + B; *(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B; ; } While (I != 0) ; Do { I = I - 16 ; *I = *I + B; *(I+12) = *(I+12) + B; *(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B; ; } While (I != 0) ;

72  1998 Morgan Kaufmann Publishers Loop Unrolling Superscalar has the architecture to perform parallel calculation For C source code: –for(i=100; i!=0; i--) { A[i]=A[i]+1; } –for(i=100; i!=0; i=i-4) { A[i]=A[i]+1; A[i-1]=A[i-1]+1; A[i-2]=A[i-2]+1; A[i-3]=A[i-3]+1; } In uni-processor, the functionalities are the same. But in superscalar, large amount of operations provide a richer opportunity for parallel execution.

73  1998 Morgan Kaufmann Publishers

74  1998 Morgan Kaufmann Publishers

75  1998 Morgan Kaufmann Publishers Dynamic Scheduling: dispatch add

76  1998 Morgan Kaufmann Publishers Dynamic Scheduling: dispatch subi

77  1998 Morgan Kaufmann Publishers Dynamic Scheduling: execute add

78  1998 Morgan Kaufmann Publishers Dynamic Scheduling: execute subi

79  1998 Morgan Kaufmann Publishers Dynamic Scheduling: write back add

80  1998 Morgan Kaufmann Publishers Dynamic Scheduling: write back subi

81  1998 Morgan Kaufmann Publishers

82  1998 Morgan Kaufmann Publishers

83  1998 Morgan Kaufmann Publishers Dynamic Scheduling The hardware performs the scheduling? –hardware tries to find instructions to execute –out of order execution is possible –speculative execution and dynamic branch prediction All modern processors are very complicated –DEC Alpha 21264: 9 stage pipeline, 6 instruction issue –PowerPC and Pentium: branch history table –Compiler technology important This class has given you the background you need to learn more Video: An Overview of Intel Pentium Processor (available from University Video Communications)

84  1998 Morgan Kaufmann Publishers Figure 6.52: The performance consequences of single-cycle, multiple-cycle and pipelined

85  1998 Morgan Kaufmann Publishers Figure 6.53: Basic relationship between the datapaths in Figure 6.52