Chapter One Introduction to Pipelined Processors.

Slides:



Advertisements
Similar presentations
The CPU The Central Presentation Unit What is the CPU?
Advertisements

Computer Architecture
I/O Organization popo.
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
PIPELINE AND VECTOR PROCESSING
Computer Organization and Architecture
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
There are two types of addressing schemes:
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
COMP25212 Advanced Pipelining Out of Order Processors.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Chapter 12 Pipelining Strategies Performance Hazards.
Basic Input/Output Operations
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Pipelining By Toan Nguyen.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI.
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
1 Control Unit Operation and Microprogramming Chap 16 & 17 of CO&A Dr. Farag.
Principles of Linear Pipelining
Chapter One Introduction to Pipelined Processors
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
Introduction  The speed of execution of program is influenced by many factors. i) One way is to build faster circuit technology to build the processor.
Question What technology differentiates the different stages a computer had gone through from generation 1 to present?
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Chapter One Introduction to Pipelined Processors
Chapter One Introduction to Pipelined Processors
Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced.
Chapter One Introduction to Pipelined Processors.
Chapter One Introduction to Pipelined Processors.
BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.
CS203 – Advanced Computer Architecture ILP and Speculation.
IBM System 360. Common architecture for a set of machines
Tomasulo’s Algorithm Born of necessity
Module: Part 2 Dynamic Scheduling in Hardware - Tomasulo’s Algorithm
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Chapter One Introduction to Pipelined Processors
Chapter One Introduction to Pipelined Processors
Microprocessor Microarchitecture Dynamic Pipeline
\course\cpeg323-08F\Topic6b-323
Introduction to Parallel Processing
Instruction Level Parallelism and Superscalar Processors
High-level view Out-of-order pipeline
Pipelining and Vector Processing
Out of Order Processors
Superscalar Processors & VLIW Processors
Adapted from the slides of Prof
Checking for issue/dispatch
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
Tomasulo Organization
Adapted from the slides of Prof
COMPUTER ORGANIZATION AND ARCHITECTURE
Conceptual execution on a processor which exploits ILP
Pipelining and Superscalar Techniques
Presentation transcript:

Chapter One Introduction to Pipelined Processors

Principle of Designing Pipeline Processors (Design Problems of Pipeline Processors)

Data Buffering and Busing Structures

Speeding up of pipeline segments The processing speed of pipeline segments are usually unequal. Consider the example given below: S1S2S3 T1T2T3

Speeding up of pipeline segments If T1 = T3 = T and T2 = 3T, S2 becomes the bottleneck and we need to remove it How? One method is to subdivide the bottleneck – Two divisions possible are:

Speeding up of pipeline segments First Method: S1 TT2T S3 T

Speeding up of pipeline segments First Method: S1 TT2T S3 T

Speeding up of pipeline segments Second Method: S1 TTT S3 T T

Speeding up of pipeline segments If the bottleneck is not sub-divisible, we can duplicate S2 in parallel S1 S2 S3 T 3T T S2 3T S2 3T

Speeding up of pipeline segments Control and Synchronization is more complex in parallel segments

Data Buffering Instruction and data buffering provides a continuous flow to pipeline units Example: 4X TI ASC

In this system it uses a memory buffer unit (MBU) which – Supply arithmetic unit with a continuous stream of operands – Store results in memory The MBU has three double buffers X, Y and Z (one octet per buffer) – X,Y for input and Z for output

Example: 4X TI ASC This provides pipeline processing at high rate and alleviate mismatch bandwidth problem between memory and arithmetic pipeline

Busing Structures PBLM: Ideally subfunctions in pipeline should be independent, else the pipeline must be halted till dependency is removed. SOLN: An efficient internal busing structure. Example : TI ASC

In TI ASC, once instruction dependency is recognized, update capability is incorporated by transferring contents of Z buffer to X or Y buffer.

Internal Data Forwarding and Register Tagging

Internal Forwarding and Register Tagging Internal Forwarding: It is replacing unnecessary memory accesses by register-to- register transfers. Register Tagging: It is the use of tagged registers for exploiting concurrent activities among multiple ALUs.

Internal Forwarding Memory access is slower than register-to- register operations. Performance can be enhanced by eliminating unnecessary memory accesses

Internal Forwarding This concept can be explored in 3 directions: 1.Store – Load Forwarding 2.Load – Load Forwarding 3.Store – Store Forwarding

Store – Load Forwarding

Load – Load Forwarding

Store – Store Forwarding

Register Tagging

Example : IBM Model 91 : Floating Point Execution Unit

Example : IBM Model 91-FPU The floating point execution unit consists of : – Data registers – Transfer paths – Floating Point Adder Unit – Multiply-Divide Unit – Reservation stations – Common Data Bus

Example : IBM Model 91-FPU There are 3 reservation stations for adder named A1, A2 and A3 and 2 for multipliers named M1 and M2. Each station has the source & sink registers and their tag & control fields The stations hold operands for next execution.

Example : IBM Model 91-FPU 3 store data buffers(SDBs) and 4 floating point registers (FLRs) are tagged Busy bits in FLR indicates the dependence of instructions in subsequent execution Common Data Bus(CDB) is to transfer operands

Example : IBM Model 91-FPU There are 11 units to supply information to CDB: 6 FLBs, 3 adders & 2 multiply/divide unit Tags for these stations are : UnitTagUnitTag FLB10001ADD11010 FLB20010ADD21011 FLB30011ADD31100 FLB40100M11000 FLB50101M21001 FLB60110

Example : IBM Model 91-FPU Internal forwarding can be achieved with tagging scheme on CDB. Example: Let F refers to FLR and FLB i stands for i th FLB and their contents be (F) and (FLB i ) Consider instruction sequence ADD F,FLB1 F  (F) + (FLB 1 ) MPY F,FLB2F  (F) x (FLB 2 )

Example : IBM Model 91-FPU During addition : – Busy bit of F is set to 1 – Contents of F and FLB1 is sent to adder A1 – Tag of F is set to 1010 (tag of adder) Busy Bit = 1Tag=1010 F

Floating Point Operand Stack(FLOS) TagSinkTagSourceCTRL TagSinkTagSourceCTRL 1010F0001FLB1CTRL Tags Store 3 data buffers 2 (SDB) 1 TagSinkTagSourceCTRL TagSinkTagSourceCTRL Floating Point Buffers (FLB) Control Storage BusInstruction Unit Decoder AdderMultiplier (Common Data Bus) Busy Bit = 1Tag=1010

Example : IBM Model 91-FPU Meantime, the decode of MPY reveals F is busy, then – F should set tag of M1 as 1010 (Tag of adder) – F should change its tag to 1000 (Tag of Multiplier) – Send content of FLB2 to M1 Busy Bit = 1Tag=1000 F

Floating Point Operand Stack(FLOS) TagSinkTagSourceCTRL TagSinkTagSourceCTRL TagSinkTagSourceCTRL Tags Store 3 data buffers 2 (SDB) 1 TagSinkTagSourceCTRL 1000F0010 FLB2 CTRL Floating Point Buffers (FLB) Control Storage BusInstruction Unit Decoder AdderMultiplier (Common Data Bus) Busy Bit = 1Tag=1000

Example : IBM Model 91-FPU When addition is done, CDB finds that the result should be sent to M1 Multiplication is done when both operands are available

Hazard Detection and Resolution

Hazards are caused by resource usage conflicts among various instructions They are triggered by inter-instruction dependencies Terminologies: Resource Objects: set of working registers, memory locations and special flags

Hazard Detection and Resolution Data Objects: Content of resource objects Each Instruction can be considered as a mapping from a set of data objects to a set of data objects. Domain D(I) : set of resource of objects whose data objects may affect the execution of instruction I.

Hazard Detection and Resolution Range R(I): set of resource objects whose data objects may be modified by the execution of instruction I Instruction reads from its domain and writes in its range

Hazard Detection and Resolution Consider execution of instructions I and J, and J appears immediately after I. There are 3 types of data dependent hazards: 1.RAW (Read After Write) 2.WAW(Write After Write) 3.WAR (Write After Write)

RAW (Read After Write) The necessary condition for this hazard is

RAW (Read After Write) Example: I1 : LOAD r1,a I2 : ADD r2,r1 I2 cannot be correctly executed until r1 is loaded Thus I2 is RAW dependent on I1

WAW(Write After Write) The necessary condition is

WAW(Write After Write) Example I1 : MUL r1, r2 I2 : ADD r1,r4 Here I1 and I2 writes to same destination and hence they are said to be WAW dependent.

WAR(Write After Read) The necessary condition is

WAR(Write After Read) Example: I1 : MUL r1,r2 I2 : ADD r2,r3 Here I2 has r2 as destination while I1 uses it as source and hence they are WAR dependent

Hazard Detection and Resolution Hazards can be detected in fetch stage by comparing domain and range. Once detected, there are two methods: 1.Generate a warning signal to prevent hazard 2.Allow incoming instruction through pipe and distribute detection to all pipeline stages.