Compiler analyzes the instructions before and after the branch and rearranges the program sequence by inserting useful instructions in the delay steps.

Slides:



Advertisements
Similar presentations
PIPELINE AND VECTOR PROCESSING
Advertisements

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 CSE SUNY New Paltz Chapter Six Enhancing Performance with Pipelining.
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
PIPELINING AND VECTOR PROCESSING
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
1 Pipelining and Vector Processing Computer Organization Computer Architectures Lab PIPELINING AND VECTOR PROCESSING Parallel Processing Pipelining Arithmetic.
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Pipelining. A process of execution of instructions may be decomposed into several suboperations Each of suboperations may be executed by a dedicated segment.
Lecture 18: Pipelining I.
Computer Organization
Stalling delays the entire pipeline
Note how everything goes left to right, except …
Review: Instruction Set Evolution
William Stallings Computer Organization and Architecture 8th Edition
Morgan Kaufmann Publishers
CMSC 611: Advanced Computer Architecture
5 Steps of MIPS Datapath Figure A.2, Page A-8
Single Clock Datapath With Control
Appendix C Pipeline implementation
ECE232: Hardware Organization and Design
ECS 154B Computer Architecture II Spring 2009
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
ECE232: Hardware Organization and Design
Appendix A - Pipelining
School of Computing and Informatics Arizona State University
Forwarding Now, we’ll introduce some problems that data hazards can cause for our pipelined processor, and show how to handle them with forwarding.
Chapter 4 The Processor Part 3
Review: MIPS Pipeline Data and Control Paths
Morgan Kaufmann Publishers The Processor
Chapter 4 The Processor Part 2
CMSC 611: Advanced Computer Architecture
Pipelining and Vector Processing
Pipelining in more detail
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
Pipelining Basic concept of assembly line
The Processor Lecture 3.6: Control Hazards
Chap. 9 Pipeline and Vector Processing
The Processor Lecture 3.4: Pipelining Datapath and Control
Control unit extension for data hazards
An Introduction to pipelining
Guest Lecturer TA: Shreyas Chand
Instruction Execution Cycle
COMS 361 Computer Organization
Pipelining Basic concept of assembly line
Pipelining Basic concept of assembly line
Pipelining Appendix A and Chapter 3.
Introduction to Computer Organization and Architecture
COMPUTER ORGANIZATION AND ARCHITECTURE
MIPS Pipelined Datapath
©2003 Craig Zilles (derived from slides by Howard Huang)
Pipelined datapath and control
Presentation transcript:

Compiler analyzes the instructions before and after the branch and rearranges the program sequence by inserting useful instructions in the delay steps

Pipeline Conflicts : 3 major difficulties 1) Resource conflicts memory access by two segments at the same time 2) Data dependency when an instruction depend on the result of a previous instruction, but this result is not yet available 3) Branch difficulties branch and other instruction (interrupt, ret, ..) that change the value of PC Data Dependency Hardware Hardware Interlock previous instruction Hardware Delay Operand Forwarding previous instruction, register Software Delayed Load previous instruction No-operation instruction Handling of Branch Instructions Prefetch target instruction Conditional branch:- branch target instruction ( instruction (

Prefetch Target Instruction –Fetch instructions in both streams, branch not taken and branch taken –Both are saved until branch is executed. Then, select the right instruction stream and discard the wrong stream Branch Target Buffer(BTB; Associative Memory) –Entry: Addr of previously executed branches; Target instruction and the next few instructions –When fetching an instruction, search BTB. –If found, fetch the instruction stream in BTB; –If not, new stream is fetched and update BTB Loop Buffer (High Speed Register file) – Storage of entire loop that allows to execute a loop without accessing memory Branch Prediction –Guessing the branch condition, and fetch an instruction stream based on the guess. Correct guess eliminates the branch penalty Delayed Branch –Compiler detects the branch and rearranges the instruction sequence by inserting useful instructions that keep the pipeline busy in the presence of a branch instruction page

Mechanisms for Instruction Pipelining Goal: Achieve maximum parallelism in pipeline by smoothening the instruction flow and minimizing the idle cycles Mechanisms:- Prefetch Buffers Multiple Functional Units Internal Data Forwarding Hazard Avoidance

Prefetch Buffers Used to match the instruction fetch rate to the pipeline consumption rate In a single memory access, a block of consecutive instructions are fetched into a prefetch buffer Three types of prefetch buffers:- Sequential buffers, used to store sequential instructions Target buffers, used to store branch target instructions Loop buffer, used to store loop instructions

Multiple Functional Units At times, a specific pipeline stage becomes the bottleneck Identified by large number of checks in a row in reservation table To resolve dependencies, we use reservation stations Each RS is uniquely identified with a tag monitored by tag unit (Register Tagging) Helps in conflict resolution and serving as buffer page

Multifunctional Arithmetic Pipeline Multifunctional arithmetic pipeline perform many functions Types of multifunctional pipelines:- Static pipeline Performs single function at a given time, another function at some other time Dynamic pipeline Performs multiple functions at the same time Care needs to be taken in sharing the pipeline

Static Multifunctional Pipeline Example: Advanced Scientific Computer Key features:- Four pipeline arithmetic units Large number of working registers in the processor which controls operations of memory buffer units and arithmetic units IPU handles fetching and decoding of instructions

Pipeline Interconnections Example: Advanced Scientific Computer Arithmetic pipeline has eight stages It is an example of static multifunctional pipeline With change in inter- connections, different functions (fixed-point and floating point) can be performed

Performance Considerations The execution time T of a program that has a dynamic instruction count N is given by: where S is the average number of clock cycles it takes to fetch and execute one instruction, and R is the clock rate. Instruction throughput is defined as the number of instructions executed per second. page

Overview An n-stage pipeline has the potential to increase the throughput by n times. However, the only real measure of performance is the total execution time of a program. Higher instruction throughput will not necessarily lead to higher performance. Two questions regarding pipelining How much of this potential increase in instruction throughput can be realized in practice? What is good value of n? page

“Iron Law” of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle Instructions per program depends on source code, compiler technology, and ISA Cycles per instructions (CPI) depends upon the ISA and the microarchitecture Time per cycle depends upon the microarchitecture and the base technology Microarchitecture CPI cycle time Microcoded >1 short Single-cycle unpipelined 1 long Pipelined page

CPI Examples Inst 3 7 cycles Inst 1 Inst 2 5 cycles 10 cycles Microcoded machine 3 instructions, 22 cycles, CPI=7.33 Time Unpipelined machine 3 instructions, 3 cycles, CPI=1 Inst 1 Inst 2 Inst 3 Pipelined machine 3 instructions, 3 cycles, CPI=1 Inst 1 Inst 2 Inst 3

Technology Assumptions A small amount of very fast memory (caches) backed up by a large, slower memory Fast ALU (at least for integers) Multiported Register files (slower!) A 5-stage pipeline will be the focus of our detailed design - some commercial designs have over 30 pipeline stages to do an integer add! page

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: page

Example… SpeedUpA = Pipe. Depth/(1 + 0) x (clockunpipe/clockpipe) Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUpA = Pipe. Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipe. Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipe. Depth/1.4) x 1.05 = 0.75 x Pipe. Depth SpeedUpA / SpeedUpB = Pipe. Depth/(0.75 x Pipe. Depth) = 1.33 Machine A is 1.33 times faster

Designing a Pipelined Processor What do we need to do to pipeline the process ? Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File Address RS2 MUX Memory ALU Inst Data Memory L M D RD MUX MUX Sign Extend Imm WB Data page

5 Steps of MIPS/DLX Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory MUX RS2 ALU Data Memory MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage page

Graphically Representing Pipelines Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths page

Visualizing Pipelining Time (clock cycles) Reg ALU DMem Ifetch Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5 I n s t r. O r d e

Conventional Pipelined Execution Representation Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB page

Single Cycle, Multiple Cycle, vs. Pipeline Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr page

Vector Processing Science and Engineering Applications Long-range weather forecasting, Petroleum explorations, Seismic data analysis, Medical diagnosis, Aerodynamics and space flight simulations, Artificial intelligence and expert systems, Mapping the human genome, Image processing Vector Operations Arithmetic operations on large arrays of numbers Conventional scalar processor Machine language Vector processor Single vector instruction Initialize I = 0 20 Read A(I) Read B(I) Store C(I) = A(I) + B(I) Increment I = I + 1 If I  100 go to 20 Continue Fortran language DO 20 I = 1, 100 20 C(I) = A(I) + B(I) C(1:100) = A(1:100) + B(1:100)

ADD A B C 100 Vector Instruction Format : Matrix Multiplication 3 x 3 matrices multiplication : n2 = 9 inner product : inner product 9 Cumulative multiply-add operation : n3 = 27 multiply-add : multiply-add 9 X 3 multiply-add = 27       C11= 0

Pipeline for calculating an inner product : Floating point multiplier pipeline : 4 segment Floating point adder pipeline : 4 segment after 1st clock input after 8th clock input Four section summation after 4th clock input A1B1 A4B4 A3B3 A2B2 A1B1 after 9th, 10th, 11th ,... A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1 A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1 , , ,  

Memory Interleaving : Simultaneous access to memory from two or more source using one memory bus system Address Interleaving Different sets of addresses are assigned to different memory modules

MIPS : Million Instruction Per Second Supercomputer Supercomputer = Vector Instruction + Pipelined floating-point arithmetic Performance Evaluation Index MIPS : Million Instruction Per Second FLOPS : Floating-point Operation Per Second megaflops : 106, gigaflops : 109 Cray supercomputer : Cray Research Clay-1 : 80 megaflops, 4 million 64 bit words memory Clay-2 : 12 times more powerful than the clay-1 VP supercomputer : Fujitsu VP-200 : 300 megaflops, 32 million memory, 83 vector instruction, 195 scalar instruction VP-2600 : 5 gigaflops page

Array Processors Performs computations on large arrays of data Array Processing Attached array processor : Auxiliary processor attached to a general purpose computer .It is designed as a peripheral for a conventional host computer. Its purpose is to enhance the performance of the computer by providing vector processing. It achieves high performance by means of parallel processing with multiple functional units.

SIMD array processor : Computer with multiple processing units operating It is processor which consists of multiple processing unit operating in parallel. The processing units are synchronized to perform the same task under control of common control unit. Each processor elements(PE) includes an ALU , a floating point arithmetic unit and working register.in parallel page