Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI.

Slides:

Advertisements

Similar presentations

Computer Architecture

Advertisements

Part 4: combinational devices

PIPELINING AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.

Chapter 7 Henry Hexmoor Registers and RTL

Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.

Lecture 11 Oct 12 Circuits for floating-point operations addition multiplication division (only sketchy)

ARITHMETIC LOGIC SHIFT UNIT

8085 processor. Bus system in microprocessor.

Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Chapter 7 – Registers.

Processor System Architecture

Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 5: CPU and Memory.

Processor Technology and Architecture

Chapter 12 Pipelining Strategies Performance Hazards.

Chapter 7. Register Transfer and Computer Operations

Assembly Language for Intel-Based Computers Chapter 2: IA-32 Processor Architecture Kip Irvine.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

DLX Instruction Format

Logic and Computer Design Dr. Sanjay P. Ahuja, Ph.D. FIS Distinguished Professor of CIS ( ) School of Computing, UNF.

Pipelining By Toan Nguyen.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

The Structure of the CPU

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

Computer Organization and Architecture Computer Arithmetic Chapter 9.

Computer Arithmetic Nizamettin AYDIN

Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.

ECEG-3202: Computer Architecture and Organization, Dept of ECE, AAU 1 Floating-Point Arithmetic Operations.

Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.

1 Microprocessor-based systems Course 2 General structure of a computer.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Chap 7. Register Transfers and Datapaths. 7.1 Datapaths and Operations Two types of modules of digital systems –Datapath perform data-processing operations.

Pipeline Extensions prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University MIPS Extensions1May 2015.

Chapter One Introduction to Pipelined Processors.

Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.

1 Control Unit Operation and Microprogramming Chap 16 & 17 of CO&A Dr. Farag.

Principles of Linear Pipelining

Principles of Linear Pipelining. In pipelining, we divide a task into set of subtasks. The precedence relation of a set of subtasks {T 1, T 2,…, T k }

Processor Architecture

Chapter One Introduction to Pipelined Processors

PHY 201 (Blum)1 Shift registers and Floating Point Numbers Chapter 11 in Tokheim.

Introduction  The speed of execution of program is influenced by many factors. i) One way is to build faster circuit technology to build the processor.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Chapter One Introduction to Pipelined Processors

Structure and Role of a Processor

Elements of Datapath for the fetch and increment The first element we need: a memory unit to store the instructions of a program and supply instructions.

Chapter One Introduction to Pipelined Processors

Speedup Speedup is defined as Speedup = Time taken for a given computation by a non-pipelined functional unit Time taken for the same computation by a.

Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW register name refers to a temporary value produced.

Chapter One Introduction to Pipelined Processors.

BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.

Chapter 5 Computer Organization TIT 304/TCS 303. Purpose of This Chapter In this chapter we introduce a basic computer and show how its operation can.

Floating Point Representations

Computer Architecture Chapter (14): Processor Structure and Function

Chap 7. Register Transfers and Datapaths

Register Transfer and Microoperations

Chapter One Introduction to Pipelined Processors

Pipelining and Vector Processing

Out of Order Processors

Data Representation and Arithmetic Algorithms

Functional Units.

MARIE: An Introduction to a Simple Computer

Data Representation and Arithmetic Algorithms

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

COMPUTER ORGANIZATION AND ARCHITECTURE

Pipelining and Superscalar Techniques

Presentation transcript:

Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI

Speeding up of pipeline segments The processing speed of pipeline segments are usually unequal. Consider the example given below: S1S2S3 T1T2T3

Speeding up of pipeline segments If T1 = T3 = T and T2 = 3T, S2 becomes the bottleneck and we need to remove it How? One method is to subdivide the bottleneck – Two divisions possible are:

Speeding up of pipeline segments First Method: S1 TT2T S3 T

Speeding up of pipeline segments First Method: S1 TT2T S3 T

Speeding up of pipeline segments Second Method: S1 TTT S3 T T

Speeding up of pipeline segments If the bottleneck is not sub-divisible, we can duplicate S2 in parallel S1 S2 S3 T 3T T S2 3T S2 3T

Speeding up of pipeline segments Control and Synchronization is more complex in parallel segments

Internal Forwarding and Register Tagging © Fr Dr Jaison Mulerikkal CMI

Internal Forwarding and Register Tagging Internal Forwarding: It is replacing unnecessary memory accesses by register-to- register transfers. Register Tagging: It is the use of tagged registers for exploiting concurrent activities among multiple ALUs.

Internal Forwarding Memory access is slower than register-to- register operations. Performance can be enhanced by eliminating unnecessary memory accesses

Internal Forwarding This concept can be explored in 3 directions: 1.Store – Fetch Forwarding 2.Fetch – Fetch Forwarding 3.Store – Store Forwarding

Store-Fetch Forwarding One store and then fetch operations are replaced by one store and one register transfer.

Fetch-Fetch Forwarding Here two fetch operations can be replaced by one fetch and one register transfer.

Store-Store Forwarding Here, two memory updates (stores) of the same word can be combined into one, since the second store overwrites the first

Example

Original Dataflow

Register Tagging

Example : IBM Model 360/91 : Floating Point Execution Unit

Example : IBM Model 360/91-FPU The floating point execution unit consists of : – Data registers – Transfer paths – Floating Point Adder Unit – Multiply-Divide Unit – Reservation stations – Common Data Bus

Example : IBM Model 91-FPU There are 3 reservation stations for adder named A1, A2 and A3 and 2 for multipliers named M1 and M2. Each station has the source & sink registers and their tag & control fields The stations hold operands for next execution.

Example : IBM Model 91-FPU 3 store data buffers(SDBs) and 4 floating point registers (FLRs) are tagged Busy bits in FLR indicates the dependence of instructions in subsequent execution Common Data Bus(CDB) is to transfer operands

Example : IBM Model 91-FPU There are 11 units to supply information to CDB: 6 FLBs, 3 adders & 2 multiply/divide unit Tags for these stations are : UnitTagUnitTag FLB10001ADD11010 FLB20010ADD21011 FLB30011ADD31100 FLB40100M11000 FLB50101M21001 FLB60110

Example : IBM Model 91-FPU Internal forwarding can be achieved with tagging scheme on CDB. Example: Let F refers to FLR and FLB i stands for i th FLB and their contents be (F) and (FLB i ) Consider instruction sequence ADD F,FLB1 F  (F) + (FLB 1 ) MPY F,FLB2F  (F) x (FLB 2 )

Example : IBM Model 91-FPU During addition : – Busy bit of F is set to 1 – Contents of F and FLB1 is sent to adder A1 – Tag of F is set to 1010 (tag of adder) Busy Bit = 1Tag=1010 F

Floating Point Operand Stack(FLOS) TagSinkTagSourceCTRL TagSinkTagSourceCTRL 1010F0001FLB1CTRL Tags Store 3 data buffers 2 (SDB) 1 TagSinkTagSourceCTRL TagSinkTagSourceCTRL Floating Point Buffers (FLB) Control Storage BusInstruction Unit Decoder AdderMultiplier (Common Data Bus) Busy Bit = 1Tag=1010

Example : IBM Model 91-FPU Meantime, the decode of MPY reveals F is busy, then – F should set tag of M1 as 1010 (Tag of adder) – F should change its tag to 1000 (Tag of Multiplier) – Send content of FLB2 to M1 Busy Bit = 1Tag=1000 F

Floating Point Operand Stack(FLOS) TagSinkTagSourceCTRL TagSinkTagSourceCTRL TagSinkTagSourceCTRL Tags Store 3 data buffers 2 (SDB) 1 TagSinkTagSourceCTRL 1000F0010 FLB2 CTRL Floating Point Buffers (FLB) Control Storage BusInstruction Unit Decoder AdderMultiplier (Common Data Bus) Busy Bit = 1Tag=1000

Hazard Detection and Resolution

Hazards are caused by resource usage conflicts among various instructions They are triggered by inter-instruction dependencies Terminologies: Resource Objects: set of working registers, memory locations and special flags

Hazard Detection and Resolution Data Objects: Content of resource objects Each Instruction can be considered as a mapping from a set of data objects to a set of data objects. Domain D(I) : set of resource of objects whose data objects may affect the execution of instruction I.

Hazard Detection and Resolution Range R(I): set of resource objects whose data objects may be modified by the execution of instruction I Instruction reads from its domain and writes in its range

Hazard Detection and Resolution There are 3 types of data dependent hazards: 1.RAW (Read After Write) 2.WAW(Write After Write) 3.WAR (Write After Write)

RAW (Read After Write) Consider execution of instructions I and J, and J appears immediately after I. Where, domain D(I) of an instruction I is the set of resource objects that may affect the execution of I The range R(I) of an instruction I is the set of resource objects whose data objects may be modified by the execution of instruction I.

RAW (Read After Write) RAW hazard between two instructions I & J may occur when J attempts to read some data object that has been modified by I

RAW (Read After Write) The necessary condition for this hazard is

WAW(Write After Write) WAW may occur if both I & J attempts to modify the same data object.

WAW(Write After Write) The necessary condition is

WAR(Write After Read) WAR may occure when J attempts to modify some data object that is read by I.

WAR(Write After Read) The necessary condition is

Hazard Detection and Resolution Hazards can be detected in fetch stage by comparing domain and range. Once detected, there are few methods: 1.Stop the pipe and suspend the execution of instructions J, J+1, etc until the instruction I has passed the resource conflict. 2.Generate a warning signal to prevent hazard & distribute detection to all pipeline stages.

Arithmetic Pipeline Design

Arithmetic Pipelines Pipeline arithmetic units are usually found in very high speed computers. They are used to implement floating point operation, multiplication of fixed point numbers and similar computations encountered in scientific problems.

Arithmetic Pipelines Arithmetic pipelines differ from instruction pipelines in some important ways. – They are generally synchronous. each stage executes in a fixed number of clock cycles. – They may be nonlinear Instead of a steady progression through a fixed sequence of stages, a task in a nonlinear pipeline may use more than one stage at a time, or may return to the same stage at several points in processing.

Pipelining of Floating Point Adder Unit

Floating Point Adder Unit This pipeline is linearly constructed with 4 functional stages. The inputs to this pipeline are two normalized floating point numbers of the form A = a x 10 p B = b x 10 q where a and b are two fractions ( Mantissas ) and p and q are their exponents. For simplicity, base 10 is assumed

Floating Point Adder Unit Our purpose is to compute the sum C = A + B = c x 10 r = d x 10 s where r = max(p,q) and 0.1 ≤ d < 1 For example: A= x 10 3 B= x 10 2 a = b= p=3 & q =2

4 steps Find r = max(p,q)and t = |p – q| Shift right the fraction associated with the smaller exponent by t units Perform fixed-point addition of two fractions to produce the intermediate sum fraction c Count the number of leading zeros (u) in fraction c and shift left c by u units to produce the normalized fraction sum

Floating point adder 1.Input the original fractions and exponents. Compute the larger exponent and the exponent difference. Shift the fraction corresponding to the smaller exponent right for a number of places equal to the difference. Both fractions are now adjusted to match the same (larger) exponent. Output the exponent and the two fractions.

Floating point adder 2.Add the two fractions, producing a sum. Pass through the exponent unchanged. Output the exponent and fractions. 3.Count leading zeros in the result fraction. Shift the fraction to normalize. Output the original exponent, the fraction, and the count. 4.Add the exponent and count. Output the adjusted exponent and the normalized fraction.

Floating Point Adder Unit Operations performed in the four pipeline stages are : 1.Compare p and q and choose the largest exponent, r = max(p,q)and compute t = |p – q| Example: r = max(p, q) = 3 t = |p-q| = |3-2|= 1

Floating Point Adder Unit 2.Shift right the fraction associated with the smaller exponent by t units to equalize the two exponents before fraction addition. Example: Smaller exponent, b= Shift right b by 1 unit is 0.082

Floating Point Adder Unit 3.Perform fixed-point addition of two fractions to produce the intermediate sum fraction c, where 0 ≤ c < 1 Example : a = b= c = a + b = =

Floating Point Adder Unit 4.Count the number of leading zeros (u) in fraction c and shift left c by u units to produce the normalized fraction sum d = c x 10 u, with a leading bit 1. Update the large exponent s by subtracting s = r – u to produce the output exponent. Example: c = , u = -1  right shift d = , s= r – u = 3-(-1) = 4 C = x 10 4

Floating Point Adder Unit The above 4 steps can all be implemented with combinational logic circuits and the 4 stages are: 1.Comparator / Subtractor 2.Shifter 3.Fixed Point Adder 4.Normalizer (leading zero counter and shifter)

4-STAGE FLOATING POINT ADDER

Arithmetic Pipeline - Floating-point adder Exponents Segment 1: Segment 2: Segment 3: Segment 4: RR R R R R R R Adjust exponent Normalize result Add mantissas Align mantissas Choose exponent Compare exponents by subtraction Difference=3-2=1 Mantissas baAB For example: X=0.9504*10 3 Y=0.8200* S= =

Floating point adder As an example of a pipelined arithmetic unit we consider a floating point adder. This pipeline accepts as input two normalized floating point numbers of the form: – A = a x 2 p – B = b x 2 q

Floating point adder Here a and b are 2's complement fractions in the range 0.5<f<1.0. p and q are corresponding base 2 exponents. The normalized sum is to be computed. Four stages can be identified for this pipeline.

Floating point adder 1.Input the original fractions and exponents. Compute the larger exponent and the exponent difference. Shift the fraction corresponding to the smaller exponent right for a number of places equal to the difference. Both fractions are now adjusted to match the same (larger) exponent. Output the exponent and the two fractions.

Floating point adder 2.Add the two fractions, producing a sum. Pass through the exponent unchanged. Output the exponent and fractions. 3.Count leading zeros in the result fraction. Shift the fraction to normalize. Output the original exponent, the fraction, and the count. 4.Add the exponent and count. Output the adjusted exponent and the normalized fraction.