Floating Point Representations Updated: 06/03/2010 Floating Point Representations
Decimal Floating Point Numbers 1.5 102.450 Decimal fractions ¾ -> 0.75 1/100 -> 0.01
Scientific Notation How can we more compactly represent these values? 1,000,000,000 -> 1.0 x 10^9 0.000025 -> 2.5 x 10^-5 What are the two parts of a scientific notation value called? Mantissa and Exponent Normalize mantissa so it has one digit left of the decimal point
Floating point numbers Floating point numbers such as 7.519, -0.01, and 4.3x108 are represented using the IEEE 754 standard format Floating point is represented using a mantissa and exponent Example: 7.51x25 The mantissa is 7.51 The exponent is 5 ---> Note: exponent is a power of 2 A set number of bits is assigned to represent the mantissa and exponent 1 8 bits 23 bits mantissa exponent sign bit 32 bit single precision 1 11 bits 52 bits mantissa exponent sign bit 64 bit double precision
Rounding Not every floating point value can be represented exactly in binary using a finite number of bits Question: What are some examples? 1/3 = 0.3333… PI = 3.141…. In these cases, must round to the nearest number that can be represented If a number is halfway between two possible representable values, then round to the one whose least-significant digit is even
Examples of Rounding Round each of these numbers to two significant digits. 1.345 --> 1.3 Choose 1.3 since 1.345 is nearer to 1.3 than 1.4 78.953 --> 79 Choose 79 since it’s nearer than 78 12.5 --> 12 12.5 is halfway between 12 and 13 Choose 12 since its least significant digit is even 13.5 --> 14 13.5 is halfway between 13 and 14 Choose 14 since its least significant digit is even
Fractional binary numbers Fractional binary numbers use the familiar decimal place-value representation, but with a base of 2 instead of 10 Example: 11.101b = 1x21 + 1x20 + 1x2-1 + 0x2-2 + 1x2-3 = 2 + 1 + 1/2 + 0/4 + 1/8 = 3 + 0.5 + 0.0 + 0.125 = 3.625
Exercises Convert this binary fraction into decimal 101.011 Answer: 5 + 0*1/2 + 1*1/4 + 1*1/8 = 5.375 Express the decimal value 6.5 as a binary fraction Answer: 110.1 Express the decimal value 11.75 as a binary fraction Answer: 1011.11
Normalized Mantissa for Scientific Notation Scientific notation numbers express the mantissa with one digit to the left of the decimal point. Given your original number Shift the decimal point left or right until one non-zero digit is to the left of the decimal point For each shift left increase the power of ten exponent by 1 For each shift right decrease power of ten exponent by 1 Examples: 102.5 x 104 = 1.025 x 106 7589 x 105 = 7.589 x 108 0.0045 x 100 = 4.5 x 10-3
Normalized Binary Mantissa Binary Fraction Normalized IEEE 754 Mantissa 11.110 1.1110 (shift binary point left 1 place) 1.01 1.01 (no shift) 101.01 1.0101 (shift binary point left 2) 0.001011 1.011 (shift binary point right 3) Question: What do all of the normalized binary mantissas have in common?
Fractional representation of mantissa What do all of the normalized binary mantissas have in common? The one bit to the left of the binary point is always a 1 So if we use 23 bits for the single-precision mantissa, we can “save” a bit by not storing this leading 1 So simply discard the lead 1 after normalizing the binary mantissa
Mantissa Example What is the binary representation of the mantissa in IEEE 754 for 6.25? Solution: 6.25 = 1x22 + 1x21 + 0x20 + 0x2-1 + 1x2-2 = 110.01b Shift the binary point as far to the left as possible until the bit to the left of the binary point is 1 110.01b --> 1.1001b (Shift left by 2 places) This shift gives us the assumed 1 bit in the integer part of the mantissa fractional representation...effectively gains one additional bit of representation Mantissa encodes only bits to the right of the binary point 1001b
Mantissa Example Continued... What is the binary representation of the mantissa in IEEE 754 for 6.25? Solution...: Keeping only the bits to the right of the binary point... 1001b Sign extend the the 4-bits into 23 bits for single precision Append the extra bits to the right for a binary fraction 1001 0000 0000 0000 0000 000 Our imaginary binary point
Updating the Exponent 6.25 = 110.01b has an implied exponent of 20 Following the IEEE 754 convention of shifting the binary point to the left, in this case by 2 positions has the effect of updating the exponent 1.1001b (Following shift left of binary point by 2 positions) For each left shift binary point = Add 1 to binary exponent 6.25 = 1.1001b x 22
Updating the Binary Exponent Binary Fraction Normalized Binary Exponents 1.01 x 20 1.01 x 20 11.110 x 20 1.1110 x 21 1101.01 x 20 1.10101 x 23 0.001011 x 20 1.011 x 2-3 0.0000101 x 20 1.01 x 2-5
Representing the exponent in IEEE 754 The exponent is represented as a biased integer For single precision add 127 to the value of the normalized base ten integer exponent For double precision add 1023 to the value of the normalized base ten integer exponent
Representing the exponent in IEEE 754 The exponent is represented as a biased integer For single precision add 127 to the value of the exponent For double precision add 1023 to the value of the exponent Example: How would the values -45 and 123 be represented in the 8-bit biased format for single precision? Answer: -45 + 127 = 82 = 01010010b 123 + 127 = 250 = 11111010b
Encoding the Biased Binary Exponent Binary Fraction Normalized Exponents Biased Exponent 1.01 x 20 1.01 x 20 0 + 127 = 127 11.110 x 20 1.1110 x 21 1 + 127 = 128 1101.01 x 20 1.10101 x 23 3 + 127 = 130 0.001011 x 20 1.011 x 2-3 -3 + 127 = 124 0.0000101 x 20 1.01 x 2-5 -5 + 127 = 122 Encode each biased exponent as an unsigned 8-bit number. Encode each biased exponent in 8-bit two’s complement. Suppose you had to rapidly sort by exponents, which format would be more efficient?
Floating Point Example #1 Recall that 6.25 = 1.1001b x 22 Encode 6.25 as a 32-bit single precision binary number Sign bit = 0 Mantissa = 1.1001 (encoding omits assumed lead 1) Exponent = 2 + 127 = 129 = 10000001 Encode using 32-bit single precision binary format 0 10000001 10010000000000000000000 Sign bit Exponent Mantissa
Floating point example #2 What is the value of the single-precision floating-point number represented by the following 32-bit binary encoding? 0 10000000 110 0000 0000 0000 0000 0000 Sign bit = 0 Encoded Exponent = 10000000 = 128 Encoded Mantissa = 110 0000 0000 0000 0000 0000 Subtract the added bias of 127 to reveal an exponent = 1 Mantissa = . 110 0000 0000 0000 0000 0000 Mantissa = 1.11 (Replace the assumed 1 before the binary point) Mantissa = 1.11 = 1x20 + 1x2-1 + 1x2-2 = 1.75 Value = 1.75 x 21 = 3.5
Floating Point Example #3 -6.25 = -1.1001b x 22 Encode -6.25 as a 32-bit single precision binary number Sign bit = 1 (Use signed magnitude for mantissa) Mantissa = 1.1001 (encoding omits assumed lead 1) Exponent = 2 + 127 = 129 = 10000001 Encode using 32-bit single precision binary format 1 10000001 10010000000000000000000 Sign bit Exponent Mantissa
Exercise Exercise 2.18 (a) on page 42 of Computer Architecture by N. Carter What value is represented by this IEEE single precision value? 1 01111010 100 0000 0000 0000 0000 0000
Exercise: Solution What value is represented by this IEEE single precision value? 1 01111010 100 0000 0000 0000 0000 0000 Sign bit = 1 Encoded Exponent = 01111010 = 122 Encoded Mantissa = 100 0000 0000 0000 0000 0000 Subtract added bias of 127 from encoded exponent Actual exponent is -5 Mantissa = . 100 0000 0000 0000 0000 0000 = .1 Mantissa = 1.1 (Add back the assumed 1 before the binary point) Mantissa = - 1 x 20 + 1x2-1 = -1.5 Value = -1.5 x 2-5 = -1.5 x (1/32) = -0.046875
IEEE 754 Single Precision Range Smallest positive normalized number 1.00000000000000000000000 x 2-126 Largest normalized number 1.11111111111111111111111 x 2127
Representing 1.0
Representing 0.0 The assumed 1 bit in the mantissa gains an extra bit of precision But zero cannot be represented exactly since a mantissa of 0 is interpreted as 1.0 The IEEE 754 standard specifies that zero is represented using an exponent of 0 with a mantissa of 0.
NaN NaN = Not a Number Special value used to represent a value produced by an error condition such as overflow, underflow, or divide by zero NaN is represented by all 1’s in the exponent field and a non-zero mantissa field Any math operation using NaN results in NaN Example: NaN + 4.5 = NaN
Infinity IEEE 754 represents infinity using all 1’s in the exponent and a fraction field of 0. The sign bit designates positive or negative infinity
Floating Point Addition (Decimal Example) Example: 9.999 x 101 + 1.610x10-1 Step 1: Shift decimal point of smaller number to the left until its updated exponent matches the exponent of the larger number 1.610x10-1 0.01610x101 Step 2: Add the mantissas (Assume only 4 significant digits) 9.999 x 101 +0.016 x 101 10.015 x 101 Step 3: Re-normalize to get one non-zero digit left of decimal point 10.015 x 101 1.0015 x 102 Step 4: Round the mantissa to 4 significant digits 1.0015 x 102 1.002 x 102
Floating Point Addition Example Use single-precision floating point to compute 0.25 + 1.5 0.25 (base 10) = (1/4) = 0.01 = 1.0 x 2-2 1.5 (base 10) = 1 + (1/2) = 1.1 x 20 Shift binary point of smaller number to the left so exponents match 1.0 x 2-2 0.01 x 20
Floating Point Addition Example (continued) Use single-precision floating point to compute 0.25 + 1.5 Next, add the mantissas, both with exponent of 0 0.01 x 20 +1.10 x 20 1.11 x 20
Floating Point Addition Example (Continued) Use single-precision floating point to compute 0.25 + 1.5 0.01 x 20 +1.10 x 20 1.11 x 20 Encode result using 32-bit single precision Sign bit = 0 Mantissa = 11000000000000000000000 (23 bits) Exponent = 0 + 127 = 127 = 01111111 The 32-bit single precision encoding is... 0 01111111 11000000000000000000000
Floating Point Addition Exercise: Solution 2.20 (b) Use single precision to compute 147.5 + 0.25 147.5 (base 10) = 128 + 16 + 2 + 1 + (1/2) = = 10010011.1 Convert to normalized mantissa format 10010011.1 x 20 1.00100111 x 27 Shifted binary point 7 places to the left See Computer Architecture by N. Carter, page 43
Floating Point Addition Exercise: Solution 2.20 (b) Use single precision to compute 147.5 + 0.25 0.25 (base 10) = (1/4) = 0.01 Convert to normalized mantissa format 0.01 x 20 1.0 x 2-2 Shift binary point 2 places to the right
Floating Point Addition Exercise: Solution 2.20 (b) Use single precision to compute 147.5 + 0.25 1.00100111 x 27 + 1.0 x 2-2 Shift binary point of smaller number to left to match exponent (7) of the larger number 1.0 x 2-2 0.000000001 x 27 Shift binary point 9 places to the left to go from exponent of -2 to 7
Floating Point Addition Exercise: Solution 2.20 (b) Use single precision to compute 147.5 + 0.25 Add the mantissas, both expressed with exponent 7 1.001001110 x 27 + 0.000000001 x 27 1.001001111 x 27
Floating Point Addition Exercise: Solution 2.20 (b) Use single precision to compute 147.5 + 0.25 Encode the result 1.001001111 x 27 in single precision Sign bit = 0 since result is positive Mantissa = 00100111100000000000000 (23 bits) Exponent = 7 + 127 = 134 = 10000110 The 32-bit single precision encoding is... 0 10000110 00100111100000000000000
Addition with Negative Values If a value is negative, you must first convert the negative value into two’s complement Example: -0.111 Convert to two’s complement by... 1.000 inverting all bits + 0.001 adding 1 1.001 Use the two’s complement version of the value when adding the mantissas. Discard the carry overflow bit.
Addition with Negative Value(s) 1.000 x 20 (1.0 in base ten) -1.000 x 2-1 (-0.5 in base ten) Move binary point of the smaller number so exponents match -0.100 x 20 (-0.5 in base ten) Convert mantissa of -0.5 into two’s complement then add +1.100 x 2-1 (-0.5 in base ten) 10.100 x 2-1 Two’s complement addition discards carry overflow bit The sum is 0.100 x 2-1 Normalize the exponent to get sum of 1.000 x 2-1 (0.5 base ten)
Floating Point Addition (page 282 of H&P) Example: Compute 0.5 + -0.4375 (base 10) using binary arithmetic. 0.5 (base 10) = 0.1 x 20 Normalize to get 1 to left of the binary point 0.5 = 0.1 x 20 = 1.0x2-1 -0.4375 = -0.0111 = - ((1/4) + (1/8) + (1/16)) Normalize to get 1 to the left of the binary point -0.0111 1.11 x 2-2
Floating Point Addition (page 282 of H&P) Compute 0.5 = 0.1 x 20 = 1.0x2-1 + -0.4375 = 1.11 x 2-2 Step 1: Shift binary point of smaller number to the left until its updated exponent matches the exponent of the larger number 1.11 x 2-2 0.111 x 2-1 Step 2: Add the mantissas * Convert negative value to two’s 1.000 (1.0 decimal) complement then add -0.111 (-0.875 decimal) * Discard carry overflow bit 0.001 (0.125 decimal) 0.001 x 2-1
Floating Point Addition (page 282 of H&P) Step 3: Normalize to get 1 to left of binary point 0.001 x 2-1 1.0 x 2-4 Exponent of -4 lies between 127 and -126 (range of single precision exponents)...therefore no overflow or underflow Express exponent in biased notation by adding 127 Encoded exponent = -4 + 127 = 123 Step 4: Round to 23 binary digits of mantissa precision 1.0 x 2-4 (no rounding needed)
Floating point multiplication Multiply the mantissas and add the exponents Result = (mantissa1 x mantissa2) + 2(exp1 + exp2) Example (in decimal) 5x103 x 2x106 = 10x109 If the mantissa is >= 10 then shift the mantissa down 1 place (divide by 10) and increment the result exponent 10x109 = 1x1010
Floating point multiplication Since the IEEE 754 uses biased integers to represent the exponent, the bias must be considered when adding the exponents Add the two biased integer exponents, then subtract the bias value from the result Example: Add biased +127 exponents of 150 and 45 Break down the exponents to see the bias values of 127 150 = (23 + 127) 45 = (-82 + 127) Add the biased exponents: 150 + 45 = 195 Subtract the bias of 127: 150 + 45 – 127 = 68 result biased exponent Check it: 68 – 127 = Actual exponent of -59 = 23 + -82
Floating Point Multiplication: Example Exercise 2.20 (a) Use IEEE single precision to compute 32 x 16. 32 (base 10) = 100000.0 x 20 (binary) Convert to normalized binary mantissa format 100000.0 x 20 1.0 x 25 (Shift binary point 5 places to left) Exponent = 5 + 127 = 132 16 (base 10) = 10000.0 x 20 (binary) 10000.0 x 20 1.0 x 24 (Shift binary point 4 places to left) Exponent = 4 + 127 = 131 See Computer Architecture by N. Carter, page 43
Floating Point Multiplication: Example Exercise 2.20 (a) Use IEEE single precision to compute 32 x 16. 1.0 x 25 Exponent = 5 + 127 = 132 1.0 x 24 Exponent = 4 + 127 = 131 Multiply mantissas: 1.0 x 1.0 1.0 x1.0 Count number of bits right of binary point of operands 0 0 Place binary point two places from left of product + 1 0 0 1.0 0 Add +127 biased exponents: 132 + 131 – 127 = 136 Actual unbiased exponent = 136 – 127 = 9 Product = 1.0 x 29 = 512
Floating Point Multiplication: Example Exercise 2.20 (a) Use IEEE single precision to compute 32 x 16. 1.0 x 25 Exponent = 5 + 127 = 132 1.0 x 24 Exponent = 4 + 127 = 131 Multiply mantissas: 1.0 x 1.0 = 1.0 (binary) Add +127 biased exponents: 132 + 131 – 127 = 136 Actual unbiased exponent = 136 – 127 = 9 Product = 1.0 x 29 = 512 Sign Bit = 0 Mantissa = 1.00000000000000000000000 Exponent = 136 = 10001000 The encoded IEEE 754 single precision number is... 0 10001000 00000000000000000000000
Floating Point Multiplication Exercise: Solution 2.20 (c) Compute 0.125 x 8 using single-precision binary. 0.125 (base 10) = 0.001 x 20 = 1.0 x 2-3 (Normalized binary mantissa) 8 (base 10) = 1000.0 x 20 = 1.0 x 23 (Normalized binary mantissa) See Computer Architecture by N. Carter, page 43
Floating Point Multiplication Exercise: Solution 2.20 (c) Compute 0.125 x 8 using single-precision binary. 1.0 x 2-3 Biased exponent = -3 + 127 = 124 1.0 x 23 Biased exponent = 3 + 127 = 130 Multiply mantissas: 1.0 x 1.0 = 1.0 (binary) Add biased exponents: 124 + 130 – 127 = 127 Actual exponent: 127 – 127 = 0 Sign Bit = 0 Mantissa = 1.00000000000000000000000 Exponent = 127 = 01111111 0 01111111 00000000000000000000000 is the encoded binary number
Floating Point Multiplication Exercise Multiply 0.75 x 32 using IEEE 754 single-precision format 0.75 = 0.11 x 20 Normalized 1.1 x 2-1 Biased exponent = -1 + 127 = 126 32 = 100000.0 x 20 Normalized 1.0 x 25 Biased exponent = 5 + 127 = 132 Multiply the mantissas 1.1 To place the binary point... x1.0 Count number of bits to right of binary points 0 0 of the two operands 1.1 and 1.0 + 1 1 Total of 2 places so place binary point 1.1 0 two places from the left in the product
Floating Point Multiplication Exercise Multiply 0.75 x 32 using IEEE 754 single-precision format Multiply the mantissas 1.1 x1.0 0 0 + 1 1 1.1 0 Add the biased exponents: 126 + 132 – 127 = 131 (unbiased exponent is 4)
Floating Point Multiplication Exercise Multiply 0.75 x 32 using IEEE 754 single-precision format Product of the mantissas 1.10 Add the biased exponents: 126 + 132 – 127 = 131 (unbiased exponent is 4) The product is already normalized Encode the product using IEEE 32-bit format Sign bit = 0 Exponent = 131 = 10000011 Mantissa = 10000000000000000000000 0 10000011 10000000000000000000000
Rounding of Floating Point Numbers Accurate rounding requires the hardware to use a few extra bits to hold intermediate results Use these extra bits to decide how to round when the final result is stored in the 32-bit single precision or 64-bit double precision format The IEEE 754 standard uses up to three additional bits called the guard, round, and sticky bits to assist in accurate rounding See pages 297-298 of Computer Organization and Design
Rounding of Floating Point Numbers Compute this base ten addition rounding all intermediate values to three significant digits 2.56 x 100 + 2.34 x 102 First shift the decimal point of the top number to align the exponents 0.02 x 102 Rounding to three digits looses information 2.36 x 102
Rounding of Floating Point Numbers Compute this base ten addition using intermediate values that keep an extra two digits 2.56 x 100 + 2.34 x 102 First shift the decimal point of the top number to align the exponents 0.0256 x 102 Intermediate values use two extra bits + 2.3400 x 102 2.3656 x 102 Use extra two bits to round the result down to three significant digits 2.37 x 102
Java Applets for IEEE Floating Point A Java applet that converts decimal numbers to IEEE single or double precision encodings can be found at... http://babbage.cs.qc.edu/courses/cs341/IEEE-754.html This applet may be used to make up your own sample problems to convert between decimal and IEEE format and to check the result of other calculations in IEEE floating point format Interactive floating point addition demo http://tima-cmp.imag.fr/~guyot/Cours/Oparithm/english/Flottan.htm