IEEE floating point format V1.0 (22/10/2005) IEEE floating point format Most computers use a standard format known as the IEEE floating-point format defined by IEEE 754-1990 standard for binary floating point arithmetic.
Single Precision The IEEE single precision floating point standard representation requires a 32 bit word S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF 0 1 8 9 31 Sign Exponent Fraction A sign bit of 0 indicates a positive number, and a 1 is negative The exponent field is represented by "excess 127 notation“ The 23 fraction bits actually represent 24 bits of precision, as a leading 1 in front of the decimal point is implied (hidden bit). In 32 bit IEEE format, 1 bit is allocated as the sign bit, the next 8 bits are allocated as the exponent field, and the last 23 bits are the fractional parts of the normalized number.
The value may be determined as: E > 0 and E < 255 V = (-1)^S*1.F* 2^(E-127) S=sign bit E = exponent in excess 127 representation F = fractional part in binary notation "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point hidden bit = (-1)S (1.F )2E-127
There are some exceptions
E= 255: Specials: Not a Number, +Infinity, -Infinity F <> 0, then V=NaN ("Not a number"), Overflow, error..... F =0 and S is 1, then V=-Infinity F =0 and S is 0, then V=Infinity
E= 0: Denormals; F=0 and S is 1, then V=-0 F=0 and S is 0, then V=0 F <>0, then V = denormalized, tiny number, smaller than smallest allowed “denormalized” values V = (-1)^S * 2 ^ (-126) * (0.F) = (-1)S (0.F )2-126
The range As exponent field 0 and 255 is reserved, the range would be restricted to 2 -126 to 2 127
Example
Finding the IEEE 32 bit floating point representation : For the decimal number -11.5 : STEPS: Convert to binary: -11.510 = -1011.12 Convert to normalized binary scientific notation (Shift number to the form of 1.F* 2^E): -1011.1 = -1.0111x23 Use only the fractional part (remove the “1.” preceding the fractional part: hidden bit) F = 01110000000000000000000 Add 127 (excess 127 code) to exponent field, convert to binary: 3+127 = 130 = 10000010 E = 10000010 Determine the sign bit: Negative number, set to 1; otherwise set to 0. S = 1 Assemble the 32 bits (S & E & F): 1 10000010 01110000000000000000000 V = 11000001001110000000000000000000
Finding the IEEE 32 bit floating point representation : For the decimal number 9+97/128 : STEPS: Convert to binary: 9+97/128 = 1001.11000012 Convert to normalized binary scientific notation (Shift number to the form of 1.F* 2^E): 1001.1100001= 1.0011100001x23 Use only the fractional part (remove the “1.” preceding the fractional part: hidden bit) F = 00111000010000000000000 Add 127 (excess 127 code) to exponent field, convert to binary: 3+127 = 130 = 10000010 E = 10000010 Determine the sign bit: Negative number, set to 1; otherwise set to 0. S = 0 Assemble the 32 bits (S & E & F): 0 10000010 00111000010000000000000 V = 01000001000111000010000000000000
IEEE floating point format example 0 00000000 00000000000000000000000 = 0 1 00000000 00000000000000000000000 = -0 0 11111111 00000000000000000000000 = Infinity 1 11111111 00000000000000000000000 = -Infinity 0 11111111 00000100000000000000000 = NaN 1 11111111 01100010001001010001010 = NaN 0 10000000 00000000000000000000000 = +1*2^(128-127)*1.0 = 2 0 10000001 10100000000000000000000 = +1 *2^(129-127)*1.101 = 6.5 1 10000001 10100000000000000000000 = -1 *2^(129-127)*1.101 = -6.5 0 00000001 00000000000000000000000 = +1 *2^(1-127)*1.0 = 2**(-126) 0 00000000 10000000000000000000000 = +1 *2^(-126)*0.1 = 2**(-127) 0 00000000 00000000000000000000001 = +1 *2^(-126)*0.00000000000000000000001 = 2^(-149) =2-149 (Smallest positive value)
Precision Fixed-point representations: the number of digits before and after the decimal point is set Floating point: there is no fixed number of digits before and after the decimal point; that is, the decimal point can float Floating-point representations are slower and less accurate than fixed-point representations, but they can handle a larger range of numbers As computers are integer machines, complex codes are used to represent real numbers
Floating-point numbers are just approximations small discrepancies in the approximations can return meaningless results One of the challenges in programming with floating-point values is ensuring that the approximations lead to reasonable results
Precision: the number of bits used to hold the fractional part Floating-point numbers are often classified as single precision or double precision floating-point number that has more precision than a single-precision number requires more digits to the right of the decimal point Precision: the number of bits used to hold the fractional part The more the precision, the more exactly it can represent fractional quantities A double-precision number uses twice as many bits as a single-precision value, so it can represent fractional quantities much more exactly For example, if a single-precision number requires 32 bits, its double-precision counterpart will be 64 bits long. The extra bits increase not only the precision but also the range of magnitudes that can be represented The exact amount by which the precision and range of magnitudes are increased depends on what format the program is using to represent floating-point values The term double precision is something of a misnomer because the precision is not really double
Double Precision The IEEE double precision floating point standard representation requires a 64 bit word S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 0 1 11 12 63 Sign Exponent Fraction A sign bit of 0 indicates a positive number, and a 1 is negative The exponent field is represented by "excess 1023 notation“ The 52 fraction bits actually represent 53 bits of precision, as a leading 1 in front of the decimal point is implied (hidden bit). In 32 bit IEEE format, 1 bit is allocated as the sign bit, the next 8 bits are allocated as the exponent field, and the last 23 bits are the fractional parts of the normalized number.
The value may be determined as follows: E > 0 and E < 2047 V = (-1)^S*1.F* 2^(E-1023) S=sign bit E = exponent in excess 1023 representation F = fractional part in binary notation "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point hidden bit = (-1)S (1.F )2E-1023
There are some exceptions
E= 2047: Specials: Not a Number, +Infinity, -Infinity F <> 0, then V=NaN ("Not a number"), Overflow, error..... F=0 and S is 1, then V=-Infinity F =0 and S is 0, then V=Infinity
E= 0: Denormals; F=0 and S is 1, then V=-0 F=0 and S is 0, then V=0 F <>0, then V = denormalized, tiny number, smaller than smallest allowed “denormalized” values V = (-1)^S * 2 ^ (-1022) * (0.F) = (-1)S (0.F )2-1022
The range As exponent field 0 and 2047 is reserved, the range would be restricted to 2 -1022 to 2 1023
Reference: http://www.psc.edu/general/software/packages/ieee/ieee.html http://www2.cs.uh.edu/~johnson2/ieee.html http://www.mimosa.org/documents/IEEE%20Floating%20Point%20Information.html Mathematical and statistical software packages installed on PSC machines Distributed Computing Utilities software packages and libraries installed on PSC machines.