Appendix A Pipelining: Basic and Intermediate Concepts

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

ELEN 468 Advanced Logic Design

CMPT 334 Computer Organization

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Pipelining Preview Basics & Challenges

Lecture 6: Pipelining MIPS R4000 and More Kai Bu

Pipelining: Basic and Intermediate Concepts

S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.

EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Chapter Six Enhancing Performance with Pipelining

Pipelining Andreas Klappenecker CPSC321 Computer Architecture.

COMP381 by M. Hamdi 1 Pipelining Control Hazards and Deeper pipelines.

DLX Instruction Format

1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

-1.1- PIPELINING 2 nd week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PIPELINING 2 nd week References Pipelining concepts The DLX.

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining.

Lecture 7: Pipelining Review Kai Bu

Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.

Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.

Lecture 05: Pipelining Basics & Hazards Kai Bu

Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.

Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

11/13/2015 8:57 AM 1 of 86 Pipelining Chapter 6. 11/13/2015 8:57 AM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.

Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

Branch Hazards and Static Branch Prediction Techniques

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

1/24/ :00 PM 1 of 86 Pipelining Chapter 6. 1/24/ :00 PM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.

LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.

1. Convert the RISCEE 1 Architecture into a pipeline Architecture (like Figure 6.30) (showing the number data and control bits). 2. Build the control line.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

Introduction to Computer Organization Pipelining.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Real-World Pipelines Idea Divide process into independent stages

Computer Organization

Morgan Kaufmann Publishers

Lecture 07: Pipelining Multicycle, MIPS R4000, and More

ELEN 468 Advanced Logic Design

CMSC 611: Advanced Computer Architecture

Pipeline Implementation (4.6)

Appendix C Pipeline implementation

CDA 3101 Spring 2016 Introduction to Computer Organization

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Pipelining Multicycle, MIPS R4000, and More

CSC 4250 Computer Architectures

Throughput = #instructions per unit time (seconds/cycles etc.)

Presentation transcript:

Appendix A Pipelining: Basic and Intermediate Concepts

Pipelining An implementation technique whereby multiple instructions are overlapped in execution. Each step in the pipeline (called a pipe stage) completes a part of an instruction. Because all stages proceed at the same time, the length of a processor (clock) cycle is determined by the time required for the slowest pipe stage. CSCE 614 Fall 2009

Pipelining Designer’s goal: Balancing the length of each pipeline stage. If the stages are perfectly balanced, the time per instruction on the pipelined processor is, Time per instruction on unpipelined machine Number of pipe stages Speedup from pipelining = number of pipe stages CSCE 614 Fall 2009

RISC Instruction Set (MIPS64) 64-bit version of the MIPS instruction set. 32 registers 3 classes of instructions ALU instructions: DADD, DSUB, … Load and store instructions: LD, SD, … Branches and jumps CSCE 614 Fall 2009

Implementation of a RISC (Unpipelined, Multicycle) Implementation of an integer subset of a RISC architecture that takes at most 5 clock cycles. Instruction Fetch (IF) Instruction Decode/Register Fetch (ID) Execution/Effective Address Calculation (EX) Memory Access (MEM) Write-Back (WB) CSCE 614 Fall 2009

Instruction Format (32-bit Version) All MIPS instructions are 32 bits long. R-format (add, sub, …) OP rs rt rd sa funct I-format (lw, sw, …) OP rs rt immediate J-format (j) OP jump target CSCE 614 Fall 2009

Instruction Fetch Cycle (IF) Send the program counter (PC) to memory. Fetch the current instruction from memory. Update the PC to the next sequential PC by adding 4 to the PC. CSCE 614 Fall 2009

Instruction Decode/Register Fetch Cycle (ID) Decode the instruction and read the registers from the register file. Do the equality test on the registers for a possible branch. Sign-extend the offset field of the instruction in case it is needed. Compute the possible branch target address by adding the sign-extended offset to the incremented PC. CSCE 614 Fall 2009

Execution/Effective Address Calculation (EX) The ALU operates on the operands prepared in the prior cycle. Memory reference instructions: The ALU adds the base register and the offset to form the effective address. Register-Register: The ALU performs the operation specified by the ALU opcode on the values from the register file. Register-Immediate: The ALU performs the operation specified by the opcode on the first value from the register file and the sign-extended immediate. CSCE 614 Fall 2009

Memory Access (MEM) If the instruction is a load, memory does a read using the effective address computed in the previous cycle. If it is a store, then the memory writes the data from the second register read from the register file using the effective address. CSCE 614 Fall 2009

Write-Back cycle (WB) Register-Register ALU instruction or Load instruction: Write the result into the register file. CSCE 614 Fall 2009

In this implementation, branch instructions require 2 cycles, store instructions require 4 cycles, and all other instructions require 5 cycles. Assuming a branch frequency of 12% and a store frequency of 10%, What is the overall CPI? CSCE 614 Fall 2009

Classic 5 Stage Pipeline for a RISC Processor CSCE 614 Fall 2009

Performance Issues in Pipelining Pipelining increases the CPU instruction throughput. Throughput: the number of instructions completed per unit of time. Pipelining does not decrease the execution time of an individual instruction. It increases the execution time due to overhead (clock skew and pipeline register delay) in the control of the pipeline. CSCE 614 Fall 2009

Example (p. A-10) Consider the unpipelined processor. Assume that it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? CSCE 614 Fall 2009

Classic 5 Stage Pipeline for a RISC Processor CSCE 614 Fall 2009

Classic 5-Stage Pipeline What happens in the pipeline? One resource cannot be used for two different operations on the same clock cycle. => Separate instruction and data memories. The register file is used in two stages: ID (two reads) and WB (one write). => Register write in the first half of the clock cycle and register read in the second half. CSCE 614 Fall 2009

Pipeline Hazards

Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. Structural Hazards Data Hazards Control Hazards Hazards can make it necessary to stall the pipeline. CSCE 614 Fall 2009

Pipeline Hazards When an instruction is stalled, all instructions issued later than the stalled instruction are also stalled. No new instructions are fetched during the stall. CSCE 614 Fall 2009

Structural Hazards Hardware cannot support the combination of instructions that we want to execute in the same clock cycle. Suppose we have a single memory instead of two memories. CSCE 614 Fall 2009

Control Hazards This arises from the need to make a decision based on the results of one instruction while others are executing. branch instruction Pipeline stall (or bubble) How can we overcome this problem? CSCE 614 Fall 2009

Branch Hazards To minimize the branch penalty, put in enough hardware so that we can test registers, calculate the branch target address, and update the PC during the second stage. CSCE 614 Fall 2009

Example Estimate the impact on the CPI of stalling on branches. Assume all other instructions have a CPI of 1. CSCE 614 Fall 2009

Branch Prediction Computers do indeed use prediction to handle branches. Simplest: Always predict that branches will fail. If you’re right, the pipeline proceeds at full speed. Dynamic hardware predictors make their guesses depending on the behavior of each branch. Popular: Keeping a history for each branch as taken or untaken, and then using the past to predict the future. => about 90% accuracy CSCE 614 Fall 2009

Branch Prediction When the guess is wrong, the pipeline must make sure that the instruction following the wrongly guessed branch have no effect and must restart the pipeline from the proper branch address. CSCE 614 Fall 2009

Delayed Branch Delayed decision Used in MIPS The delayed branch always executes the next sequential instruction, with the branch taking place after that one instruction delay. CSCE 614 Fall 2009

CSCE 614 Fall 2009

MIPS software will place an instruction immediately after the delayed branch instruction that is not affected by the branch, and a taken branch changes the address of the instruction that follows this safe instruction. Compilers typically fill about 50% of the branch delay slots with useful instructions. CSCE 614 Fall 2009

Data Hazards An instruction depends on the results of a previous instruction still in the pipeline. e.g. add $s0, $t0, $t1 sub $t2, $s0, $t3 The add instruction doesn’t write the result until the 5th stage. => 3 bubbles CSCE 614 Fall 2009

Solution forwarding (or bypassing): getting the missing item early from the internal resources. e.g. as soon as the ALU creates the sum for the add, we can supply it as the input for the subtract. CSCE 614 Fall 2009

CSCE 614 Fall 2009

Load-Use Data Hazard CSCE 614 Fall 2009

Even with forwarding, we still have to stall one stage for a load-use data hazard. Delayed loads: to follow a load with an instruction independent of that load. CSCE 614 Fall 2009

CSCE 614 Fall 2009

Implementation of the MIPS Datapath CSCE 614 Fall 2009

Events on Every Pipe Stage of the MIPS Pipeline See Figure A.19 on page A-32. CSCE 614 Fall 2009

Revised Datapath CSCE 614 Fall 2009

Revised Pipeline Structure See Figure A.25 on page A-39. CSCE 614 Fall 2009

Extending the MIPS to Handle Multicycle Operations

Floating-Point Operations The floating-point pipeline will allow for a longer latency for operations. the EX cycle may be repeated as many times as needed to complete the operation. The number of repetitions can vary for different operations. There may be multiple floating-point functional units. CSCE 614 Fall 2009

Assumptions Main integer unit: handles loads and stores, integer ALU operations, and branches. FP and integer multiplier. FP adder: handles FP add, subtract, and conversion. FP and integer divider. The EX stages of these functional units are not pipelined. CSCE 614 Fall 2009

MIPS with 3 FP Functional Units CSCE 614 Fall 2009

Because EX is not pipelined, no other instruction using that functional unit may issue until the previous instruction leaves EX. Instruction issue (p. A-33): the process of letting an instruction move from the ID stage into the EX stage of the pipeline. If an instruction cannot proceed to the EX stage, the entire pipeline behind that instruction will be stalled. CSCE 614 Fall 2009

Latency: the number of intervening cycles between an instruction that produces a result and an instruction that uses the result. Initiation interval: the number of cycles that must elapse between issuing two operations of a given type. CSCE 614 Fall 2009

Example (Figure A.30) Functional Unit Latency Initiation Interval Integer ALU 1 Data memory (integer/FP loads) FP add 3 FP multiply (integer multiply) 6 FP divide (integer divide) 24 25 CSCE 614 Fall 2009

Since most operations consume their operands at the beginning of EX stage, the latency is usually the number of stages after EX that an instruction produces a result. 0 for Integer ALU operations. 1 for loads. Pipeline latency is essentially equal to 1 cycle less than the depth of the execution pipeline, which is the number of stages from the EX stage to the stage that produces the result. CSCE 614 Fall 2009

To achieve a higher clock rate, fewer logic levels are put in each pipe stage. => The number of pipe stages required for more complex operations is larger. The penalty for the faster clock rate is longer latency for operations. CSCE 614 Fall 2009

Supporting Multiple FP Operations unpipelined CSCE 614 Fall 2009