Download presentation
Presentation is loading. Please wait.
Published byRoss Roberts Modified over 9 years ago
1
Problems with Floating-Point Representations Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007 by Douglas Wilhelm Harder. All rights reserved. ECE 204 Numerical Methods for Computer Engineers
2
Problems with Floating-Point Representations This topic will cover a number of the problems with using a floating-point representation, including: –underflow and overflow –subtractive cancellation –adding large and small numbers –non-associative (a + b) + c a + (b + c)
3
Problems with Floating-Point Representations Underflow and Overflow In our six–decimal-digit floating-point representation, the largest number we can represent is 9.999 10 50 The largest double is 1.8 10 308 : >> format long; realmax realmax = 1.79769313486232e+308 >> format hex; realmax realmax = 7fefffffffffffff or more correctly, 1.1111111111111111111111111111111111111111111111111111 2 2 1023
4
Problems with Floating-Point Representations Underflow and Overflow Any number larger than these values cannot be represented using these formats To solve this problem, we can introduce a floating-point infinity: >> format long; 2e308 ans = Inf >> format hex; 2e308 ans = 7ff0000000000000
5
Problems with Floating-Point Representations Underflow and Overflow The properties of infinity include: –any real plus infinity is infinity –one over infinity is 0 –any positive number times infinity is infinity –any negative number times infinity is –infinity For example: >> Inf + 1e100>> 325*Infans = Inf >> 1/Inf>> -2*Inf ans = 0ans = -Inf
6
Problems with Floating-Point Representations Underflow and Overflow The introduction of a floating-point infinity allows computations to continue and removes the necessity of signaling overflows through exceptions An example where infinity may not cause a problem is where its reciprocal is immediately taken: >> 5 + 1/2e400 ans = 5
7
Problems with Floating-Point Representations Underflow and Overflow Our six–decimal-digit floating-point representation, the smallest number we can represent is 1.000 10 –49 The smallest positive double (using the normal representation) is 2.2 10 –308 : >> format long; realmin realmax = 2.22507385850720e-308 >> format hex; realmin realmax = 0010000000000000 or more correctly, 2 –1022
8
Problems with Floating-Point Representations Underflow and Overflow Storing real numbers on a computer: –we must use a fixed amount of memory, –we should be able to represent a wide range of numbers, both large and small, –we should be able to represent numbers with a small relative error, and –we should be able to easily test if one number is greater than, equal to, or less than another
9
Problems with Floating-Point Representations Underflow and Overflow Any number smaller than these values is represented by 0 This is represented by a double with all 0s, with the possible exception of the sign bit: >> format hex; 0 ans = 0000000000000000 >> -0 ans = 8000000000000000 >> format long; 1/0 ans = Inf >> 1/-0 ans = -Inf
10
Problems with Floating-Point Representations Underflow and Overflow You may have noticed that we did not use both the largest and smallest exponents: >> format hex; realmax realmax = 7fefffffffffffff >> realmin realmin = 0010000000000000 The largest and smallest exponents should have been 7ff and 000, respectively
11
Problems with Floating-Point Representations Underflow and Overflow These “special” exponents are used to represent special numbers, such as: –infinity 7ff000 ··· fff000 ··· –not-a-number 7ff800 ··· –0 000000 ··· 800000 ··· –denormalized numbers numbers existing between 0 and realmin, but at reduced precision
12
Problems with Floating-Point Representations Underflow and Overflow Thus, we can classify numbers which: –are represented by 0, –are not represented with full precision, –are represented using 53 bits of precision, and –are represented by infinity
13
Problems with Floating-Point Representations Subtractive Cancellation The next problem we will look at deals with subtracting similar numbers Suppose we take the difference between and the 3-digit approximation 3.14 using our six-digit floating-point approximation 04931420493140 Performing the calculation: 3.142 – 3.140 = 0.002 = 2.000 10 –3 which has the representation 0462000
14
Problems with Floating-Point Representations Subtractive Cancellation How accurate is this difference? Recall that the 3.14 is precisely by our floating-point representation, but our representation of has a relative error of 0.00012 By calculating the difference, of almost- equal numbers, we loose a significant amount of precision
15
Problems with Floating-Point Representations Subtractive Cancellation The actual value of the difference is – 3.14 = 0.001592654··· and therefore, the relative error of our approximation 0.002 of this difference is Thus, the relative error which we were trying to calculate is significant: 25.58%
16
Problems with Floating-Point Representations Subtractive Cancellation Subtractive cancellation is the phenomenon where the subtraction of similar numbers results in a significant reduction in precision
17
Problems with Floating-Point Representations Subtractive Cancellation As another example, recall the definition of the derivative: Assuming that this limit converges, then using a smaller and smaller value of h should result in a very good approximation to f (1) (x)
18
Problems with Floating-Point Representations Subtractive Cancellation Let’s try this out with f(x) = sin(x) and let us approximate f (1) (1) From calculus, we know that the actual derivative is cos(1) = 0.5403023058681397·· Let us use Matlab to approximate this derivative using h = 0.1, 0.001, 0.0001,...
19
Problems with Floating-Point Representations Subtractive Cancellation >> for i=1:8 h = 10^-i; (sin(1 + h) - sin(1))/h end ans = 0.497363752535389 ans = 0.536085981011869 ans = 0.539881480360327 ans = 0.540260231418621 ans = 0.540298098505865 ans = 0.540301885010308 ans = 0.540302264040449 ans = 0.540302291796024
20
Problems with Floating-Point Representations Subtractive Cancellation >> for i=8:16 h = 10^-i; (sin(1 + h) - sin(1))/h end ans = 0.540302291796024 ans = 0.540302358409406 ans = 0.540302247387103 ans = 0.540301137164079 ans = 0.540345546085064 ans = 0.539568389967826 ans = 0.532907051820075 ans = 0.555111512312578 ans = 0
21
Problems with Floating-Point Representations Subtractive Cancellation What happened here? With h = 10 –8, we had an approximation which has a relative error of 2.6 10 –8, or 7 decimial-digits of precision With smaller and smaller values of h, the error, however, increases until we have a completely useless approximation when h = 10 –16
22
Problems with Floating-Point Representations Subtractive Cancellation Looking at sin(1 + h) and sin(1) when h = 10 –12 >> h = 1e-12 h = 1.00000000000000e-12 >> sin(1 + h) ans = 0.841470984808437 >> sin(1) ans = 0.841470984807897 Consequently, we are subtracting two numbers which are almost equal
23
Problems with Floating-Point Representations Subtractive Cancellation The next slide shows the bits using h = 2 –n for n = 1, 2,..., 53 Note that double-precision floating-point numbers have 53 bits of precision The red digits show the results are a result of the subtractive cancellation
24
ans = 0011111111010011111110001001100000110000100011011000001001110100 ans = 0011111111011011100001100000001101111000010000011010010011110000 ans = 0011111111011111001000001011101110110001001110100000111000000000 ans = 0011111111100000011011111110110111010001101110101110100101110000 ans = 0011111111100000110111011011110010010001110111011111011010000000 ans = 0011111111100001000101000001111110010011011101000110100000000000 ans = 0011111111100001001011110010111100111101010111001011000110000000 ans = 0011111111100001001111001010111010000100110110111000100000000000 ans = 0011111111100001010000110110110000000010010010001101100000000000 ans = 0011111111100001010001101100101000110111000011001000110000000000 ans = 0011111111100001010010000111100100101110111001011110000000000000 ans = 0011111111100001010010010101000010100010001011101111000000000000 ans = 0011111111100001010010011011110001011001101010100110000000000000 ans = 0011111111100001010010011111001000110100110111011100000000000000 ans = 0011111111100001010010100000110100100010010101010000000000000000 ans = 0011111111100001010010100001101010011001000010000000000000000000 ans = 0011111111100001010010100010000101010100010111100000000000000000 ans = 0011111111100001010010100010010010110010000010000000000000000000 ans = 0011111111100001010010100010011001100000111000000000000000000000 ans = 0011111111100001010010100010011100111000010000000000000000000000 ans = 0011111111100001010010100010011110100100000000000000000000000000 ans = 0011111111100001010010100010011111011001110000000000000000000000 ans = 0011111111100001010010100010011111110100100000000000000000000000 ans = 0011111111100001010010100010100000000010000000000000000000000000 ans = 0011111111100001010010100010100000001000000000000000000000000000 ans = 0011111111100001010010100010100000001100000000000000000000000000 ans = 0011111111100001010010100010100000010000000000000000000000000000 ans = 0011111111100001010010100010100000000000000000000000000000000000 ans = 0011111111100001010010100010000000000000000000000000000000000000 ans = 0011111111100001010010100000000000000000000000000000000000000000 ans = 0011111111100001010010000000000000000000000000000000000000000000 ans = 0011111111100001010000000000000000000000000000000000000000000000 ans = 0011111111100001000000000000000000000000000000000000000000000000 ans = 0011111111100000000000000000000000000000000000000000000000000000 ans = 0000000000000000000000000000000000000000000000000000000000000000 Approximating the derivative of sin(x) at x = 1 : –green digits show accuracy, while –red digits show loss of precision >> for i=1:53 h = 2^-i; (sin(1 + h) - sin(1))/h end
25
Problems with Floating-Point Representations Subtractive Cancellation Later in this course, we will find a formula which will approximate the derivative of sin(x) at x = 1 using h = 0.001 by 0.540302305868125 which is significantly closer to cos(1) = 0.540302305868140 than any approximation we saw before
26
Problems with Floating-Point Representations Subtractive Cancellation Thus, we cannot simply use the formulae covered in Calculus to calculate numbers numerically We will now see how an algebraic formula you learned in high-school can also fail: –the quadratic equation
27
Problems with Floating-Point Representations Subtractive Cancellation Rather than using doubles, we will use our six-digit floating-point numbers to show how the quadratic formula can fail Suppose we wish to find the smaller root of the quadratic equation 0.05231 x 2 + 7.539 x + 0.1094 This equation has roots at x = –144.1070702, x = –0.01451266977
28
Problems with Floating-Point Representations Subtractive Cancellation Using four decimal-digits of precision for each calculation, we find that our approximation to the smaller of the two roots is x = –0.009560 The relative error of this approximation is 0.3411, or 34%
29
Problems with Floating-Point Representations Subtractive Cancellation Approximating the larger of the two roots, we get x = –144.2 The relative error of this approximation is only 0.0006449, or 0.0645% Why does one formula work so well while the other fails so miserably?
30
Problems with Floating-Point Representations Subtractive Cancellation Stepping through the calculation: b = 7.539 b 2 = 56.84 4ac = 0.02289 b 2 – 4ac = 56.82 The actual value is –0.0015183155···
31
Problems with Floating-Point Representations Non-Associativity Normally, the operations of addition and multiplication are associative, that is: (a + b) + c = a + (b + c) (ab)c = a(bc) Unfortunately, floating-point numbers are not associative If we add a large number to a small number, the large number dominates: 5592. + 0.5923 = 5593.
32
Problems with Floating-Point Representations Non-Associativity Consider the example 0.005312 + 54.73 – 54.39 If we calculate the first sum first: (0.005312 + 54.73) – 54.39 = 54.74 – 54.39 = 0.35 If we calculate the second sum first: 0.005312 + (54.73 – 54.39) = 0.05312 + 0.34 = 0.3453
33
Problems with Floating-Point Representations Order of Operations Consider calculating the following sum in Matlab: The correct answer is answer, to 20 decimal-digits of precision, is 14.392726722865723632
34
Problems with Floating-Point Representations Order of Operations Adding the numbers in the natural order, from 1 to 10 6, we get the following result: 14.3927267228648 Adding the number in the reverse order, we get the result 14.3927267228658 The second result is off by only the last digit (and only by 0.76)
35
Problems with Floating-Point Representations Order of Operations To see why this happens, consider decimal floating-point model which stores only four decimal-digits of precision: 52.37 + 0.004291 + 0.0009023 Adding from left to right, we get: (52.37 + 0.004291) + 0.0009023 = 52.37 + 0.0009023 = 52.37
36
Problems with Floating-Point Representations Order of Operations Adding the expression from right to left, we get: 52.37 + (0.004291 + 0.0009023) = 52.37 + 0.005193 = 52.38 This second value has a smaller relative error when compared to the correct answer (if we keep all precision) of 52.3751933
37
Usage Notes These slides are made publicly available on the web for anyone to use If you choose to use them, or a part thereof, for a course at another institution, I ask only three things: –that you inform me that you are using the slides, –that you acknowledge my work, and –that you alert me of any mistakes which I made or changes which you make, and allow me the option of incorporating such changes (with an acknowledgment) in my set of slides Sincerely, Douglas Wilhelm Harder, MMath dwharder@alumni.uwaterloo.ca
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.