Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Slides:

Advertisements

Similar presentations

Pipelining (Week 8).

Advertisements

Morgan Kaufmann Publishers The Processor

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Chapter 8. Pipelining.

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Pipelining II (1) Fall 2005 Lecture 19: Pipelining II.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.

Chapter 12 Pipelining Strategies Performance Hazards.

EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.

Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

King Fahd University of Petroleum and Minerals King Fahd University of Petroleum and Minerals Computer Engineering Department Computer Engineering Department.

Pipelined Processor II CPSC 321 Andreas Klappenecker.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Appendix A Pipelining: Basic and Intermediate Concepts

Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Basic Processing Unit (Week 6)

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.

Chapter 5 Basic Processing Unit

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

CS1104: Computer Organisation School of Computing National University of Singapore.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

11/13/2015 8:57 AM 1 of 86 Pipelining Chapter 6. 11/13/2015 8:57 AM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.

Chapter 4 The Processor. Chapter 4 — The Processor — 2 Introduction We will examine two MIPS implementations A simplified version A more realistic pipelined.

Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.

CMPE 421 Parallel Computer Architecture

Topics covered: Pipelining CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

1/24/ :00 PM 1 of 86 Pipelining Chapter 6. 1/24/ :00 PM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.

Pipelining Example Laundry Example: Three Stages

LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Introduction to Computer Organization Pipelining.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Computer Organization CS224

Pipelining Chapter 6.

Pipeline Implementation (4.6)

ECS 154B Computer Architecture II Spring 2009

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Pipelining Chapter 6.

Computer Organization “Central” Processing Unit (CPU)

Chapter 8. Pipelining.

Control unit extension for data hazards

Instruction Execution Cycle

Chapter 8. Pipelining.

Control unit extension for data hazards

RTL for the SRC pipeline registers

Control unit extension for data hazards

Presentation transcript:

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook

 Pipelining: overlapped instruction execution  Hazards that limit pipelined performance gain  Hardware/software implications of pipelining  Influence of pipelining on instruction sets 2

3

4

 Circuit technology and hardware arrangement influence the speed of execution for programs  All computer units benefit from faster circuits  Pipelining involves arranging the hardware to perform multiple operations simultaneously  Similar to assembly line where product moves through stations that perform specific tasks  Same total time for each item, but overlapped 5

6 FGHFGHFGHFGH Time

 Focus on pipelining of instruction execution  Multistage datapath in Chapter 5 consists of: Fetch, Decode, Execute, Memory, Write Back  Instructions fetched & executed one at a time with only one stage active in any cycle  With pipelining, multiple stages are active simultaneously for different instructions  Still 5 cycles to execute, but rate is 1 per cycle 7

8

 Use program counter (PC) to fetch instructions  A new instruction enters pipeline every cycle  Carry along instruction-specific information as instructions flow through the different stages  Use interstage buffers to hold this information  These buffers incorporate RA, RB, RM, RY, RZ, IR, and PC-Temp registers from Chapter 5  The buffers also hold control signal settings 9

10 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

Instruction Fetch IR Instruction Decode Register File ALU MEM Access RA RB Ctl3 Ctl4 Ctl5 RY RZ Ctl4 Ctl5 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Adapting Chapter 5 Design to Pipeline

Specific Example Instruction Fetch Instruction Decode Register File ALU MEM access RA RB Ctl3 Ctl4 Ctl5 RY RZ Ctl4 Ctl5 sw r8, 100(r9) add r3, r6, r7 or r2, r4, r5 and r3, r6, r7 Sub r1, r2, r3 sub r2, r1, r3 and r3, r6, r7 or r2, r4, r5 add r3, r6, r7 sw r8, 100(r9)

 Consider two successive instructions I j and I j  1  Assume that the destination register of I j matches one of the source registers of I j  1  Result of I j is written to destination in cycle 5  But I j  1 reads old value of register in cycle 3 (1+2)  Due to pipelining, I j  1 computation is incorrect  So stall (delay) I j  1 until I j writes the new value  Condition requiring this stall is a data hazard 13

 Now consider the specific instructions AddR2, R3, #100 SubtractR9, R2, #30  Destination R2 of Add is a source for Subtract  There is a data dependency between them because R2 carries data from Add to Subtract  On non-pipelined datapath, result is available in R2 because Add completes before Subtract 14

 With pipelined execution, old value is still in register R2 when Subtract is in Decode stage  So stall Subtract for 3 cycles in Decode stage  New value of R2 is then available in cycle 6 15

 Control circuitry must recognize dependency while Subtract is being decoded in cycle 3  Inter-stage buffers carry register identifiers for source(s) and destination of instructions  In cycle 3, compare destination identifier in Compute stage against source(s) in Decode  R2 matches, so Subtract kept in Decode while Add allowed to continue normally 16

 Stall the Subtract instruction for 3 cycles by keeping contents of interstage buffer B1  What happens after Add leaves Compute?  Control signals are set in cycles 3 to 5 to create an implicit NOP (No-operation) in Compute  NOP control signals in interstage buffer B2 create a cycle of idle time in each later stage  The idle time from each NOP is called a bubble 17

sub r2, r1, r3 and r12, r2, r5 or r13, r6, r2 add r14, r2, r2 sw r15, 100($2) Instruction Fetch Instruction Decode Register File ALU Memory Access RA RB Ctl3 Ctl4 Ctl5 RY RZ Ctl4 Ctl5 or r13, r6, r2 and r12, r2, r5 sub r2, r1, r3 Stalling: Example Detect data dependency

sub r2, r1, r3 and r12, r2, r5 or r13, r6, r2 add r14, r2, r2 sw r15, 100($2) Instruction Fetch Instruction Decode Register File ALU Memory Access RA RB Noop RY RZ Ctl4 Ctl5 or r13, r6, r2 and r12, r2, r5 sub r2, r1, r3 Stalling Allow sub Hold and Create bubble

sub r2, r1, r3 and r12, r2, r5 or r13, r6, r2 add r14, r2, r2 sw r15, 100($2) Stalling Bubble moves down the pipeline Instruction Fetch Instruction Decode Register File ALU Memory Access RA RB Noop RY RZ Ctl5 or r13, r6, r2 and r12, r2, r5 sub r2, r1, r3 Noop

sub r2, r1, r3 and r12, r2, r5 or r13, r6, r2 add r14, r2, r2 sw r15, 100($2) Stalling until completion of subtract Instruction Fetch Instruction Decode Register File ALU Memory Access RA RB Noop RY RZ or r13, r6, r2 and r12, r2, r5 Noop

sub r2, r1, r3 and r12, r2, r5 or r13, r6, r2 add r14, r2, r2 sw r15, 100($2) Stalling Now and can be decoded and proceed down the pipeline Instruction Fetch Instruction Decode Register File ALU Memory Access RA RB CTl5 Ctl4 Ctl3 RY RZ or r13, r6, r2 and r12, r2, r5 Noop

 Operand forwarding handles dependencies without the penalty of stalling the pipeline  For the preceding sequence of instructions, new value for R2 is available at end of cycle 3  Forward value to where it is needed in cycle 4 23

 Introduce multiplexers before ALU inputs to use contents of register RZ as forwarded value  Control circuitry now recognizes dependency in cycle 4 when Subtract is in Compute stage  Interstage buffers still carry register identifiers  Compare destination of Add in Memory stage with source(s) of Subtract in Compute stage  Set multiplexer control based on comparison 24

sub r2, r1, r3 and r12, r2, r5 or r13, r6, r2 add r14, r2, r2 sw r15, 100($2) Instruction Fetch Instruction Decode Register File ALU Memory Access RA RB Ctl3 Ctl4 Ctl5 RY RZ Ctl4 Ctl5 or r13, r6, r2 and r12, r2, r5 sub r2, r1, r3 add r14, r2, r2 Forwarding from Buffer 3 muxes

26

sub r2, r1, r3 and r12, r2, r5 or r13, r6, r2 add r14, r2, r2 sw r15, 100($2) Forwarding from Buffer 4 muxes Instruction Fetch Instruction Decode Register File ALU Memory Access RA RB Ctl3 Ctl4 Ctl5 RY RZ Ctl4 Ctl5 or r13, r6, r2 and r12, r2, r5 sub r2, r1, r3 add r14, r2, r2 sw r15, 100($2)

 Compiler can generate & analyze instructions  Data dependencies are evident from registers  Compiler puts three explicit NOP instructions between instructions having a dependency  Delay ensures new value available in register but causes total execution time to increase  Compiler can optimize by moving instructions into NOP slots (if data dependencies permit) 28

29

 Memory delays can also cause pipeline stalls  A cache memory holds instructions and data from the main memory, but is faster to access  With a cache, typical access time is one cycle  But a cache miss requires accessing slower main memory with a much longer delay  In pipeline, memory delay for one instruction causes subsequent instructions to be delayed 30

31

 Even with a cache hit, a Load instruction may cause a short delay due to a data dependency  One-cycle stall required for correct value to be forwarded to instruction needing that value  Optimize with useful instruction to fill delay 32 Forwarded from Stage 4

 C code for A = B + E; C = B + F; lwr2, 0(r1) lwr3, 4(r1) addr4, r2, r3 swr4, 12(r1) lwr5, 8(r1) addr6, r2, r5 swr6, 16(r1) stall 13 cycles Does this code require Forwarding to work? If so, where? 33

 Reorder code to avoid use of load result in the next instruction  C code for A = B + E; C = B + F; lwr2, 0(r1) lwr3, 4(r1) addr4, r2, r3 swr4, 12(r1) lwr5, 8(r1) addr6, r2, r5 swr6, 16(r1) stall lwr2, 0(r1) lwr3, 4(r1) lwr5, 8(r1) addr4, r2, r3 swr4, 12(r1) addr6, r2, r5 swr6, 16(r1) 11 cycles13 cycles 34

 We have 1 cycle stall due to data dependency, when there is cache hit.  How can we assess impact of this on pipeline throughput?  Need statistics on how often the 1-cycle penalty due to stall occurs. ▪ Inst i is load/store (called data-transfer or D-type in the project) and ▪ Inst i+1 has data dependency of Inst i. 35

 Miss can occur either during instruction fetch or data transfer.  Miss rates may defer for the two.  Also need to know the miss penalty (number of cycles to read the slower non-cache memory) and the frequency of data transfer instructions with dependency as in the last example. 36

 Instruction miss rate:m i  Data miss rate:m d  Frequency of D-type instructions:d  Miss penalty:p m cycles  Increase in cycles/instructions over ideal: 37 For typical values (see p. 211), the impact is much larger than for cache-hit

 For more analysis of the impacts of various delays and hazards on pipeline performance, Read Section

 Ideal pipelining: fetch each new instruction while previous instruction is being decoded  Branch instructions alter execution sequence, but they must be processed to determine the effect  Any delay for determining branch outcome leads to an increase in total execution time  Techniques to mitigate this effect are desired  Understand branch behavior to find solutions

 Consider instructions I j, I j  1, I j  2 in sequence  I j is an unconditional branch with target I k  In Chapter 5, the Compute stage determined the target address using offset and PC  4 value  In pipeline, target I k is known for I j in cycle 4, but instructions I j  1, I j  2 fetched in cycles 2 & 3  Target I k should have followed I j immediately, so discard I j  1, I j  2 and incur two-cycle penalty

 In pipeline, adder for PC is used every cycle, so it cannot calculate the branch target address  So introduce a second adder just for branches  Place this second adder in the Decode stage to enable earlier determination of target address  For previous example, now only I j  1 is fetched  Only one instruction needs to be discarded  The branch penalty is reduced to one cycle

 Consider a conditional branch instruction: Branch_if_[R5]=[R6]LOOP  Requires not only target address calculation, but also requires comparison for condition  In Chapter 5, ALU performed the comparison  Target address now calculated in Decode stage  To maintain one-cycle penalty, introduce a comparator just for branches in Decode stage

 Let both branch decision and target address be determined in Decode stage of pipeline  Instruction immediately following a branch is always fetched, regardless of branch decision  That next instruction is discarded with penalty, except when conditional branch is not taken  The location immediately following the branch is called the branch delay slot

 Instead of conditionally discarding instruction in delay slot, always let it complete execution  Let compiler find an instruction before branch to move into slot, if data dependencies permit  Called delayed branching due to reordering  If useful instruction put in slot, penalty is zero  If not possible, insert explicit NOP in delay slot for one-cycle penalty, whether or not taken