Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11
Floating Point Representation u The IEEE-754 Floating Point Standard is a widely used floating point representation from among the many alternative formats u The representation of floating point numbers contains: a mantissa (variant of a scaled, sign magnitude integer) an exponent (8-bit, biased-127 integer) u In this way floating point representation resembles scientific notation Any number N can be represented as M*10^e, where e = floor(log 10 N) M = N/(10^e) 1 < M < 10
A number N represented in floating point is determined by the mantissa m, an exponent e, and its sign, s N = (-1) s * m * 2 e If the sign is negative, s = 1. If the sign is positive, s = 0. The mantissa is normalized, i.e., 1 m < 2 u In the IEEE-754 single precision format, the mantissa is represented with 23 bits (only the fractional part is stored m = (+/-) 1.f 22 f 21 …f 1 f 0 u Double precision floating point works the same way, but the bit fields are larger: 1-bit sign, 11-bit exponent, 52 bits for the fractional part of the mantissa Floating Point Representation
Conversion to base-2 1.Break the decimal number into two parts: an integer and a fraction 2.Convert the integer into binary and place it to the left of the binary point 3.Convert the fraction into binary and place it to the right of the binary point 4.Write it in base-2 scientific notation and normalize
Convert to floating point representation 1. Convert 22 to binary = Convert.625 to binary 2*.625= *.25 = *.5= Thus = In base –2 scientific notation: *2 0 Normalized form: *2 4 Example =.101 2
IEEE-754 SPFP Representation u Given the floating point representation N = (-1) s * m * 2 e where m = 1.f 22 f 21 …f 1 f 0 u we can convert it to the IEEE-754 SPFP format using the relations: F = (m-1)*2 23 (hence F is an integer) E =e S =s SEF
Single-Precision Floating Point The IEEE-754 single precision format has 32 bits distributed as 0 E 255, thus the actual exponent e (interpreted as biased-127) is restricted so that -127 e 128 But e = -127 and e = 128 have special meaning SEF 1 823
Special values and the hidden bit In IEEE-754, zero is represented by setting E = F = 0 regardless of the sign bit, thus there are two representations for zero: +0 and -0. + by S=0, E=255, F=0 - by S=1, E=255, F=0 NaN or Not-a-Number by E=255, F 0 (may result from 0 divided by 0) u The leading 1 in the fraction is not represented. It is the hidden bit.
Converting to IEEE-754 SPFP 1.Convert into a normalized base-2 representation 2.Bias the exponent. The result will be E. 3.Put the values into the correct field. Note that only the fractional part of the mantissa is stored in F.
Example Convert to IEEE-754 SPFP format 1. In scientific notation: *2 0 Normalized form: * Bias the exponent: = = Place into the correct fields. S = 0 E = F = SE F
Example Convert to IEEE FPS format = * Normalized form: * Bias the exponent: = = Place into the correct fields. S = 0 E = F = SE F
Example Convert to IEEE FPS format (single precision) 2*.7 = *.4 = *.8 = *.6 = *.2 = *.4 = *.8 = *.6 = *.2 = =
1. In binary scientific notation: * 2 0 Normalized: * Bias the exponent: = = Place into the correct fields S = 1 E = F = SE F
Representing as hexadecimal u It is difficult for people to read binary one bit pattern looks much like another u Raw data, which is not being interpreted as representing a particular data type, is often displayed using hexadecimal instead of binary u The final step in many IEEE-754 SPFP problems will be to convert the result to hexadecimal C2A76666
u Given a single precision floating point number with bit fields S, E, and F (interpreted as unsigned integers), the value of the number is normally calculated as N = (-1) S (1 + F/2 23 )2 E-127 u This interpretation is not used when E = 255 (+ , - , or NaN) E = 0, F = 0 (+0 or –0) What about E 0, F 0? Graceful underflow
u Given a single precision floating point number with bit fields S, E = 0, and F (interpreted as unsigned integers), the value of the number is calculated as N = (-1) S (0 + F/2 23 ) u This allows representation of numbers as small as , though each order of magnitude below results in loss of one bit of precision. Graceful underflow
u Normal interpretation: N = 2 (1 – 127) = 24 bits of precision (counting the hidden bit) u E = 0 interpretation: N = (.1 2 ) = (.5) = Only 23 bits of precision u E =0 interpretation: N = ( ) = (.0625) = Only 20 bits of precision Graceful underflow