Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI.

Similar presentations


Presentation on theme: "Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI."— Presentation transcript:

1 Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI

2 Speeding up of pipeline segments The processing speed of pipeline segments are usually unequal. Consider the example given below: S1S2S3 T1T2T3

3 Speeding up of pipeline segments If T1 = T3 = T and T2 = 3T, S2 becomes the bottleneck and we need to remove it How? One method is to subdivide the bottleneck – Two divisions possible are:

4 Speeding up of pipeline segments First Method: S1 TT2T S3 T

5 Speeding up of pipeline segments First Method: S1 TT2T S3 T

6 Speeding up of pipeline segments Second Method: S1 TTT S3 T T

7 Speeding up of pipeline segments If the bottleneck is not sub-divisible, we can duplicate S2 in parallel S1 S2 S3 T 3T T S2 3T S2 3T

8 Speeding up of pipeline segments Control and Synchronization is more complex in parallel segments

9 Internal Forwarding and Register Tagging © Fr Dr Jaison Mulerikkal CMI

10 Internal Forwarding and Register Tagging Internal Forwarding: It is replacing unnecessary memory accesses by register-to- register transfers. Register Tagging: It is the use of tagged registers for exploiting concurrent activities among multiple ALUs.

11 Internal Forwarding Memory access is slower than register-to- register operations. Performance can be enhanced by eliminating unnecessary memory accesses

12 Internal Forwarding This concept can be explored in 3 directions: 1.Store – Fetch Forwarding 2.Fetch – Fetch Forwarding 3.Store – Store Forwarding

13 Store-Fetch Forwarding One store and then fetch operations are replaced by one store and one register transfer.

14 Fetch-Fetch Forwarding Here two fetch operations can be replaced by one fetch and one register transfer.

15 Store-Store Forwarding Here, two memory updates (stores) of the same word can be combined into one, since the second store overwrites the first

16 Example

17 Original Dataflow

18

19

20

21

22 Register Tagging

23 Example : IBM Model 360/91 : Floating Point Execution Unit

24 Example : IBM Model 360/91-FPU The floating point execution unit consists of : – Data registers – Transfer paths – Floating Point Adder Unit – Multiply-Divide Unit – Reservation stations – Common Data Bus

25

26 Example : IBM Model 91-FPU There are 3 reservation stations for adder named A1, A2 and A3 and 2 for multipliers named M1 and M2. Each station has the source & sink registers and their tag & control fields The stations hold operands for next execution.

27

28 Example : IBM Model 91-FPU 3 store data buffers(SDBs) and 4 floating point registers (FLRs) are tagged Busy bits in FLR indicates the dependence of instructions in subsequent execution Common Data Bus(CDB) is to transfer operands

29 Example : IBM Model 91-FPU There are 11 units to supply information to CDB: 6 FLBs, 3 adders & 2 multiply/divide unit Tags for these stations are : UnitTagUnitTag FLB10001ADD11010 FLB20010ADD21011 FLB30011ADD31100 FLB40100M11000 FLB50101M21001 FLB60110

30 Example : IBM Model 91-FPU Internal forwarding can be achieved with tagging scheme on CDB. Example: Let F refers to FLR and FLB i stands for i th FLB and their contents be (F) and (FLB i ) Consider instruction sequence ADD F,FLB1 F  (F) + (FLB 1 ) MPY F,FLB2F  (F) x (FLB 2 )

31 Example : IBM Model 91-FPU During addition : – Busy bit of F is set to 1 – Contents of F and FLB1 is sent to adder A1 – Tag of F is set to 1010 (tag of adder) Busy Bit = 1Tag=1010 F

32 Floating Point Operand Stack(FLOS) TagSinkTagSourceCTRL TagSinkTagSourceCTRL 1010F0001FLB1CTRL Tags Store 3 data buffers 2 (SDB) 1 TagSinkTagSourceCTRL TagSinkTagSourceCTRL Floating Point Buffers (FLB) Control 1 2 3 4 5 6 Storage BusInstruction Unit Decoder AdderMultiplier (Common Data Bus) Busy Bit = 1Tag=1010

33 Example : IBM Model 91-FPU Meantime, the decode of MPY reveals F is busy, then – F should set tag of M1 as 1010 (Tag of adder) – F should change its tag to 1000 (Tag of Multiplier) – Send content of FLB2 to M1 Busy Bit = 1Tag=1000 F

34 Floating Point Operand Stack(FLOS) TagSinkTagSourceCTRL TagSinkTagSourceCTRL TagSinkTagSourceCTRL Tags Store 3 data buffers 2 (SDB) 1 TagSinkTagSourceCTRL 1000F0010 FLB2 CTRL Floating Point Buffers (FLB) Control 1 2 3 4 5 6 Storage BusInstruction Unit Decoder AdderMultiplier (Common Data Bus) Busy Bit = 1Tag=1000

35 Hazard Detection and Resolution

36 Hazards are caused by resource usage conflicts among various instructions They are triggered by inter-instruction dependencies Terminologies: Resource Objects: set of working registers, memory locations and special flags

37 Hazard Detection and Resolution Data Objects: Content of resource objects Each Instruction can be considered as a mapping from a set of data objects to a set of data objects. Domain D(I) : set of resource of objects whose data objects may affect the execution of instruction I.

38 Hazard Detection and Resolution Range R(I): set of resource objects whose data objects may be modified by the execution of instruction I Instruction reads from its domain and writes in its range

39 Hazard Detection and Resolution There are 3 types of data dependent hazards: 1.RAW (Read After Write) 2.WAW(Write After Write) 3.WAR (Write After Write)

40 RAW (Read After Write) Consider execution of instructions I and J, and J appears immediately after I. Where, domain D(I) of an instruction I is the set of resource objects that may affect the execution of I The range R(I) of an instruction I is the set of resource objects whose data objects may be modified by the execution of instruction I.

41 RAW (Read After Write) RAW hazard between two instructions I & J may occur when J attempts to read some data object that has been modified by I

42 RAW (Read After Write) The necessary condition for this hazard is

43 WAW(Write After Write) WAW may occur if both I & J attempts to modify the same data object.

44 WAW(Write After Write) The necessary condition is

45 WAR(Write After Read) WAR may occure when J attempts to modify some data object that is read by I.

46 WAR(Write After Read) The necessary condition is

47 Hazard Detection and Resolution Hazards can be detected in fetch stage by comparing domain and range. Once detected, there are few methods: 1.Stop the pipe and suspend the execution of instructions J, J+1, etc until the instruction I has passed the resource conflict. 2.Generate a warning signal to prevent hazard & distribute detection to all pipeline stages.

48 Arithmetic Pipeline Design

49 Arithmetic Pipelines Pipeline arithmetic units are usually found in very high speed computers. They are used to implement floating point operation, multiplication of fixed point numbers and similar computations encountered in scientific problems.

50

51 Arithmetic Pipelines Arithmetic pipelines differ from instruction pipelines in some important ways. – They are generally synchronous. each stage executes in a fixed number of clock cycles. – They may be nonlinear Instead of a steady progression through a fixed sequence of stages, a task in a nonlinear pipeline may use more than one stage at a time, or may return to the same stage at several points in processing.

52 Pipelining of Floating Point Adder Unit

53 Floating Point Adder Unit This pipeline is linearly constructed with 4 functional stages. The inputs to this pipeline are two normalized floating point numbers of the form A = a x 10 p B = b x 10 q where a and b are two fractions ( Mantissas ) and p and q are their exponents. For simplicity, base 10 is assumed

54 Floating Point Adder Unit Our purpose is to compute the sum C = A + B = c x 10 r = d x 10 s where r = max(p,q) and 0.1 ≤ d < 1 For example: A=0.9504 x 10 3 B=0.8200 x 10 2 a = 0.9504 b= 0.8200 p=3 & q =2

55 4 steps Find r = max(p,q)and t = |p – q| Shift right the fraction associated with the smaller exponent by t units Perform fixed-point addition of two fractions to produce the intermediate sum fraction c Count the number of leading zeros (u) in fraction c and shift left c by u units to produce the normalized fraction sum

56

57 Floating point adder 1.Input the original fractions and exponents. Compute the larger exponent and the exponent difference. Shift the fraction corresponding to the smaller exponent right for a number of places equal to the difference. Both fractions are now adjusted to match the same (larger) exponent. Output the exponent and the two fractions.

58 Floating point adder 2.Add the two fractions, producing a sum. Pass through the exponent unchanged. Output the exponent and fractions. 3.Count leading zeros in the result fraction. Shift the fraction to normalize. Output the original exponent, the fraction, and the count. 4.Add the exponent and count. Output the adjusted exponent and the normalized fraction.

59

60 Floating Point Adder Unit Operations performed in the four pipeline stages are : 1.Compare p and q and choose the largest exponent, r = max(p,q)and compute t = |p – q| Example: r = max(p, q) = 3 t = |p-q| = |3-2|= 1

61 Floating Point Adder Unit 2.Shift right the fraction associated with the smaller exponent by t units to equalize the two exponents before fraction addition. Example: Smaller exponent, b= 0.8200 Shift right b by 1 unit is 0.082

62 Floating Point Adder Unit 3.Perform fixed-point addition of two fractions to produce the intermediate sum fraction c, where 0 ≤ c < 1 Example : a = 0.9504 b= 0.082 c = a + b = 0.9504 + 0.082 = 1.0324

63 Floating Point Adder Unit 4.Count the number of leading zeros (u) in fraction c and shift left c by u units to produce the normalized fraction sum d = c x 10 u, with a leading bit 1. Update the large exponent s by subtracting s = r – u to produce the output exponent. Example: c = 1.0324, u = -1  right shift d = 0.10324, s= r – u = 3-(-1) = 4 C = 0.10324 x 10 4

64 Floating Point Adder Unit The above 4 steps can all be implemented with combinational logic circuits and the 4 stages are: 1.Comparator / Subtractor 2.Shifter 3.Fixed Point Adder 4.Normalizer (leading zero counter and shifter)

65 4-STAGE FLOATING POINT ADDER

66 Arithmetic Pipeline - Floating-point adder Exponents Segment 1: Segment 2: Segment 3: Segment 4: RR R R R R R R Adjust exponent Normalize result Add mantissas Align mantissas Choose exponent Compare exponents by subtraction Difference=3-2=1 Mantissas baAB For example: X=0.9504*10 3 Y=0.8200*10 2 0.082 3 S=0.9504+0.082=1.0324 0.10324 4

67

68 Floating point adder As an example of a pipelined arithmetic unit we consider a floating point adder. This pipeline accepts as input two normalized floating point numbers of the form: – A = a x 2 p – B = b x 2 q

69 Floating point adder Here a and b are 2's complement fractions in the range 0.5<f<1.0. p and q are corresponding base 2 exponents. The normalized sum is to be computed. Four stages can be identified for this pipeline.

70 Floating point adder 1.Input the original fractions and exponents. Compute the larger exponent and the exponent difference. Shift the fraction corresponding to the smaller exponent right for a number of places equal to the difference. Both fractions are now adjusted to match the same (larger) exponent. Output the exponent and the two fractions.

71 Floating point adder 2.Add the two fractions, producing a sum. Pass through the exponent unchanged. Output the exponent and fractions. 3.Count leading zeros in the result fraction. Shift the fraction to normalize. Output the original exponent, the fraction, and the count. 4.Add the exponent and count. Output the adjusted exponent and the normalized fraction.


Download ppt "Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI."

Similar presentations


Ads by Google