Download presentation
Presentation is loading. Please wait.
Published byΜεθόδιος Δημητρακόπουλος Modified over 6 years ago
1
The IEEE Floating Point Standard and execution units for it
1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
2
Copyright 2006 - Joanne DeGroat, ECE, OSU
Lecture overview The standard Floating Point Basics A floating point adder design 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
3
The floating point standard
Single Precision Value of bits stored in representation is: If e=255 and f /= 0, then v is NaN regardless of s If e=255 and f = 0, then v = (-1)s ¥ If 0 < e < 255, then v = (-1)s 2e-127 (1.f) – normalized number If e = 0 and f /= 0, the v = (-1)s (0.f) Denormalized numbers – allow for graceful underflow If e = 0 and f = 0 the v = (-1)s 0 (zero) 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
4
The floating point standard
Double Precision Value of bits in word representation is: If e=2047 and f /= 0, then v is NaN regardless of s If e=2047 and f = 0, then v = (-1)s ¥ If 0 < e < 2047, then v = (-1)s 2e-1023 (1.f) – normalized number If e = 0 and f /= 0, the v = (-1)s (0.f) Denormalized numbers – allow for graceful underflow If e = 0 and f = 0 the v = (-1)s 0 (zero) 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
5
The floating point standard
Notes on single and double precision The leading 1 of the fractional part is not stored for normalized numbers Mantissa …. Representation allows for +0 and -0 indicating direction of 0 (allow determination that might matter if rounding was used) Denormalized numbers allow graceful underflow towards 0 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
6
Copyright 2006 - Joanne DeGroat, ECE, OSU
Conversion Examples Converting from base 10 to the representation Single precision example Covert 10010 Step 1 – convert to binary In a binary representation form of 1.xxx have = x 26 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
7
Conversion Example Continued
x is binary for 100 Thus the exponent is a 6 Biased exponent will be 6+127=133 = Sign will be a 0 for positive Stored fractional part f will be 1001 Thus we have s e f …. C in hexadecimal $42C is representation for 100 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
8
Copyright 2006 - Joanne DeGroat, ECE, OSU
Another example Representation for -175 (sign magnitude rep) 175 = = Or x 27 S = 1 Exponent is = 134 = Fractional part f = Representation …. Or in Hex $C32F 0000 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
9
Copyright 2006 - Joanne DeGroat, ECE, OSU
A fractional example Decimal value 0.25 Convert to binary In power form x 2-2 Sign is + so 0 Exponent is = 125 = Fractional part is 00000… Representation is … And in Hex $3E 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
10
Copyright 2006 - Joanne DeGroat, ECE, OSU
Converting back Convert $C32F 0000 into decimal Extract components from S = 1 Exponent = = = 134 unbias – 127 =7 f = so mantissa is Adjust man by exponent (move binary pt 7 places) Or = 175 Sign is negative so -175 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
11
Copyright 2006 - Joanne DeGroat, ECE, OSU
Another example Convert $41C to decimal …. S is 0 so positive number Exponent = 128+3= =4 f = 1001 so mantissa is With 4 binary positions have as final number in binay which is 25 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
12
Arithmetic with floating point numbers
Add op1 $42C and op2 $41C8 0000 First divide into component parts Op1 $42C = …. S = 0 E = = 133 – 127 = 6 Mop1 = … Op2 $41C = …. E = = 131 – 127 = 4 Mop2 = … 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
13
Copyright 2006 - Joanne DeGroat, ECE, OSU
Now add the mantissas But first align the mantissas Op …. Op …. Which is the smaller number and needs to be aligned Exponent difference between op1 and op2 is 2 So shift op2 by 2 binary places or Op2 becomes … 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
14
Copyright 2006 - Joanne DeGroat, ECE, OSU
Add Add op1 mantissa with the aligned op2 mantissa … … Result exponent is 6 Value is or =125 Values added were 100 and 25 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
15
Constructing Result Value
Sign 0 Exponent 6 E = = 133 – 127 = 6 Mantissa of Result Fractional Part …. Constructed Value $4 2 F A (125) 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
16
Floating point representation of 125
Positive so s is 0 Exponent is = 133 = Fractional part from mantissa of or Constructed value $42FA 0000 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
17
Multiplication example
Multiply op1 $42C & op2 $41C8 0000 First divide into component parts Op1 $42C = …. S = 0 E = = 133 – 127 = 6 Mop1 = … Op2 $41C = …. E = = 131 – 127 = 4 Mop2 = … 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
18
Multiplication basics
Base 10 example 3x102 * 1.1x102 = 3.3 x 104 Have 2 numbers A x 2ea and B x 2eb Multiply and get result = A*B x 2ea+eb 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
19
Copyright 2006 - Joanne DeGroat, ECE, OSU
So here Have sign of both is + so result is + Exponent addition Both exponents are biased as stored If you add stored binary exponents you need to subtract the extra bias or 127 Or using pencil and paper (or powerpoint) can just add the unbiased exponent of one operand to the other biased exponent Here have = 137 = 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
20
Copyright 2006 - Joanne DeGroat, ECE, OSU
The mantissas Do a binary multiplication 1.1001 1 1001 1100 1 and add Adjusting for binary point have 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
21
Copyright 2006 - Joanne DeGroat, ECE, OSU
Final result Exponent is 137 or 10 Mantissa is Adjusted for exponent Value is Or = = 2500 And we were multiplying 100 * 25 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
22
Copyright 2006 - Joanne DeGroat, ECE, OSU
Specification of a FPA Floating Point Add/Subtract Unit Specification Inputs in IEEE 754 Double Precision Must perform both addition and subtraction Must handle the full floating point standard Normalized numbers Not a Numbers – NaNs +/- Infinity Denormalized numbers 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
23
Specifications continued
Result will be a IEEE 754 Double Precision representation Unit will correctly handle the invalid operation of adding + ¥ and - ¥ = Nan per the standard Unit latches it inputs into registers from parallel 64-bit data busses. There is a separate signal line that indicates the operation add or subtract 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
24
Specifications continued
Outputs The correctly represented result Flags that are output are Zero result Overflow to infinity from normalized numbers as inputs NaN result Overshift (result is the larger of the two operands) Denormalized result Inexact (result was rounded) Invalid operation for addition 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
25
High level block diagram
Basic architecture interface Data – 64 bit A,B,& C Busses Control signals – Latch, Add/Sub, Asel, Drive Condition Flags Output – 7 Flag signals Clocks – Phi1 and Phi2 (a 2 phase clocked architecture 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
26
Copyright 2006 - Joanne DeGroat, ECE, OSU
Start the VHDL The entity interface In the next lecture 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
27
Copyright 2006 - Joanne DeGroat, ECE, OSU
Denormalized Example Denormalized example(multiplication by 100) $ is …… f = *$42C (100) Change Denormalized to 2e-127form(was 2e-126) S = 0 E = 2-127 M = x.f = …… 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
28
Copyright 2006 - Joanne DeGroat, ECE, OSU
Do multiplication Had the FP representation of 100 S = 0 E = (133 – 127) M = 1.f = Multiply and get a result with E = = -121 M = *0.01 = 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
29
Copyright 2006 - Joanne DeGroat, ECE, OSU
Renormalize The values to store S = 0 E = for a mantissa of (stored e 6) Adjust to normailized form E = for a mantissa of (stored e 4) Construct value to store S E F ……… = $ 1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.