IT11004: Data Representation and Organization Floating Point Representation
normalized form A real number is called normalized, if it is in the form: d 0.d 1 d 2 d 3 …x10 n where n is an integer, d 1 d 2 d 3 … are the digits of the number in base 10, and d 0 is not zero. As examples, – the number in normalized form is x10 2 – the number in normalized form is x10 -3 Clearly, any non-zero real number can be normalized. 2
Encoding – MSB is sign bit – exp field encodes E – frac field encodes M Sizes – Single precision: 8 exp bits, 23 frac bits 32 bits total – Double precision: 11 exp bits, 52 frac bits 64 bits total – Extended precision: 15 exp bits, 63 frac bits Only found in Intel-compatible machines Stored in 80 bits – 1 bit wasted Floating Point Precisions sexpfrac
Single-precision floating-point format (binary32) A computer number format that occupies 4 bytes (32 bits) in computer memory and represents a wide dynamic range of values by using a floating point. One of the first programming languages to provide single- and double-precision floating-point data types was Fortran. Single-precision binary floating-point is used due to its wider range over fixed point Single precision is known as – float in C, C++, C#, Java[1], and Haskell, and – single in Pascal, Visual Basic, and MATLAB. 4
IEEE single-precision binary floating-point format: binary32 5
convert a base 10 real number into binary32 format consider a real number with an integer and a fraction part such as – Convert the integer part into binary – convert the fraction part using the following technique – add the two results and adjust them to produce a proper final conversion Conversion of the fractional part x 2 = x 2 = x 2 = fraction = 0.000, terminate (0.375) 10 can be exactly represented in binary as (0.011) 2 Therefore (12.375) 10 = (12) 10 + (0.375) 10 = (1100) 2 + (0.011) 2 = ( ) 2 In normalized form (12.375) 10 = x2 3 6
convert a base 10 real number into binary32 format… In normalized form (12.375) 10 = x2 3 From which we deduce: – The exponent is 3 (and in the biased form it is therefore =130 = ) – The fraction is (looking to the right of the binary point) From these we can form the resulting 32 bit IEEE 754 binary32 format representation of as: = H 7
Ex 1 Consider a value We can see that : (0.25) 10 =(1.0) 2 x2 -2 From which we deduce : The exponent is −2 – (and in the biased form it is 127+(−2)= 125 = ) The fraction is 0 – (looking to the right of binary point in 1.0 is all zeros) From these we can form the resulting 32 bit IEEE 754 binary32 format representation of real number 0.25 as: = 3e H x 2 = x 2 = 1.0 =
Ex 2 Convert into binary 32 floating point format 9
Double-precision floating-point format (binary64) a computer number format that occupies two adjacent storage locations in computer memory. A double-precision number, sometimes simply called a double, may be defined to be an integer, fixed point, or floating point binary64 is having: – Sign bit: 1 bit – Exponent width: 11 bits – Significand precision: 52 bits 10