Problems with Floating-Point Representations Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007 by Douglas Wilhelm Harder. All rights reserved. ECE 204 Numerical Methods for Computer Engineers
Problems with Floating-Point Representations This topic will cover a number of the problems with using a floating-point representation, including: –underflow and overflow –subtractive cancellation –adding large and small numbers –non-associative (a + b) + c a + (b + c)
Problems with Floating-Point Representations Underflow and Overflow In our six–decimal-digit floating-point representation, the largest number we can represent is The largest double is 1.8 : >> format long; realmax realmax = e+308 >> format hex; realmax realmax = 7fefffffffffffff or more correctly,
Problems with Floating-Point Representations Underflow and Overflow Any number larger than these values cannot be represented using these formats To solve this problem, we can introduce a floating-point infinity: >> format long; 2e308 ans = Inf >> format hex; 2e308 ans = 7ff
Problems with Floating-Point Representations Underflow and Overflow The properties of infinity include: –any real plus infinity is infinity –one over infinity is 0 –any positive number times infinity is infinity –any negative number times infinity is –infinity For example: >> Inf + 1e100>> 325*Infans = Inf >> 1/Inf>> -2*Inf ans = 0ans = -Inf
Problems with Floating-Point Representations Underflow and Overflow The introduction of a floating-point infinity allows computations to continue and removes the necessity of signaling overflows through exceptions An example where infinity may not cause a problem is where its reciprocal is immediately taken: >> 5 + 1/2e400 ans = 5
Problems with Floating-Point Representations Underflow and Overflow Our six–decimal-digit floating-point representation, the smallest number we can represent is 10 –49 The smallest positive double (using the normal representation) is 2.2 10 –308 : >> format long; realmin realmax = e-308 >> format hex; realmin realmax = or more correctly, 2 –1022
Problems with Floating-Point Representations Underflow and Overflow Storing real numbers on a computer: –we must use a fixed amount of memory, –we should be able to represent a wide range of numbers, both large and small, –we should be able to represent numbers with a small relative error, and –we should be able to easily test if one number is greater than, equal to, or less than another
Problems with Floating-Point Representations Underflow and Overflow Any number smaller than these values is represented by 0 This is represented by a double with all 0s, with the possible exception of the sign bit: >> format hex; 0 ans = >> -0 ans = >> format long; 1/0 ans = Inf >> 1/-0 ans = -Inf
Problems with Floating-Point Representations Underflow and Overflow You may have noticed that we did not use both the largest and smallest exponents: >> format hex; realmax realmax = 7fefffffffffffff >> realmin realmin = The largest and smallest exponents should have been 7ff and 000, respectively
Problems with Floating-Point Representations Underflow and Overflow These “special” exponents are used to represent special numbers, such as: –infinity 7ff000 ··· fff000 ··· –not-a-number 7ff800 ··· – ··· ··· –denormalized numbers numbers existing between 0 and realmin, but at reduced precision
Problems with Floating-Point Representations Underflow and Overflow Thus, we can classify numbers which: –are represented by 0, –are not represented with full precision, –are represented using 53 bits of precision, and –are represented by infinity
Problems with Floating-Point Representations Subtractive Cancellation The next problem we will look at deals with subtracting similar numbers Suppose we take the difference between and the 3-digit approximation 3.14 using our six-digit floating-point approximation Performing the calculation: – = = 10 –3 which has the representation
Problems with Floating-Point Representations Subtractive Cancellation How accurate is this difference? Recall that the 3.14 is precisely by our floating-point representation, but our representation of has a relative error of By calculating the difference, of almost- equal numbers, we loose a significant amount of precision
Problems with Floating-Point Representations Subtractive Cancellation The actual value of the difference is – 3.14 = ··· and therefore, the relative error of our approximation of this difference is Thus, the relative error which we were trying to calculate is significant: 25.58%
Problems with Floating-Point Representations Subtractive Cancellation Subtractive cancellation is the phenomenon where the subtraction of similar numbers results in a significant reduction in precision
Problems with Floating-Point Representations Subtractive Cancellation As another example, recall the definition of the derivative: Assuming that this limit converges, then using a smaller and smaller value of h should result in a very good approximation to f (1) (x)
Problems with Floating-Point Representations Subtractive Cancellation Let’s try this out with f(x) = sin(x) and let us approximate f (1) (1) From calculus, we know that the actual derivative is cos(1) = ·· Let us use Matlab to approximate this derivative using h = 0.1, 0.001, ,...
Problems with Floating-Point Representations Subtractive Cancellation >> for i=1:8 h = 10^-i; (sin(1 + h) - sin(1))/h end ans = ans = ans = ans = ans = ans = ans = ans =
Problems with Floating-Point Representations Subtractive Cancellation >> for i=8:16 h = 10^-i; (sin(1 + h) - sin(1))/h end ans = ans = ans = ans = ans = ans = ans = ans = ans = 0
Problems with Floating-Point Representations Subtractive Cancellation What happened here? With h = 10 –8, we had an approximation which has a relative error of 2.6 10 –8, or 7 decimial-digits of precision With smaller and smaller values of h, the error, however, increases until we have a completely useless approximation when h = 10 –16
Problems with Floating-Point Representations Subtractive Cancellation Looking at sin(1 + h) and sin(1) when h = 10 –12 >> h = 1e-12 h = e-12 >> sin(1 + h) ans = >> sin(1) ans = Consequently, we are subtracting two numbers which are almost equal
Problems with Floating-Point Representations Subtractive Cancellation The next slide shows the bits using h = 2 –n for n = 1, 2,..., 53 Note that double-precision floating-point numbers have 53 bits of precision The red digits show the results are a result of the subtractive cancellation
ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = Approximating the derivative of sin(x) at x = 1 : –green digits show accuracy, while –red digits show loss of precision >> for i=1:53 h = 2^-i; (sin(1 + h) - sin(1))/h end
Problems with Floating-Point Representations Subtractive Cancellation Later in this course, we will find a formula which will approximate the derivative of sin(x) at x = 1 using h = by which is significantly closer to cos(1) = than any approximation we saw before
Problems with Floating-Point Representations Subtractive Cancellation Thus, we cannot simply use the formulae covered in Calculus to calculate numbers numerically We will now see how an algebraic formula you learned in high-school can also fail: –the quadratic equation
Problems with Floating-Point Representations Subtractive Cancellation Rather than using doubles, we will use our six-digit floating-point numbers to show how the quadratic formula can fail Suppose we wish to find the smaller root of the quadratic equation x x This equation has roots at x = – , x = –
Problems with Floating-Point Representations Subtractive Cancellation Using four decimal-digits of precision for each calculation, we find that our approximation to the smaller of the two roots is x = – The relative error of this approximation is , or 34%
Problems with Floating-Point Representations Subtractive Cancellation Approximating the larger of the two roots, we get x = –144.2 The relative error of this approximation is only , or % Why does one formula work so well while the other fails so miserably?
Problems with Floating-Point Representations Subtractive Cancellation Stepping through the calculation: b = b 2 = ac = b 2 – 4ac = The actual value is – ···
Problems with Floating-Point Representations Non-Associativity Normally, the operations of addition and multiplication are associative, that is: (a + b) + c = a + (b + c) (ab)c = a(bc) Unfortunately, floating-point numbers are not associative If we add a large number to a small number, the large number dominates: = 5593.
Problems with Floating-Point Representations Non-Associativity Consider the example – If we calculate the first sum first: ( ) – = – = 0.35 If we calculate the second sum first: (54.73 – 54.39) = =
Problems with Floating-Point Representations Order of Operations Consider calculating the following sum in Matlab: The correct answer is answer, to 20 decimal-digits of precision, is
Problems with Floating-Point Representations Order of Operations Adding the numbers in the natural order, from 1 to 10 6, we get the following result: Adding the number in the reverse order, we get the result The second result is off by only the last digit (and only by 0.76)
Problems with Floating-Point Representations Order of Operations To see why this happens, consider decimal floating-point model which stores only four decimal-digits of precision: Adding from left to right, we get: ( ) = = 52.37
Problems with Floating-Point Representations Order of Operations Adding the expression from right to left, we get: ( ) = = This second value has a smaller relative error when compared to the correct answer (if we keep all precision) of
Usage Notes These slides are made publicly available on the web for anyone to use If you choose to use them, or a part thereof, for a course at another institution, I ask only three things: –that you inform me that you are using the slides, –that you acknowledge my work, and –that you alert me of any mistakes which I made or changes which you make, and allow me the option of incorporating such changes (with an acknowledgment) in my set of slides Sincerely, Douglas Wilhelm Harder, MMath