Recall our hypothetical computer Marc-32 Sign: 1 bit Mantissa: 23 bits Exponent: 8 bits Normalized floating point representation 0 𝒙= 𝑞 2𝒎 (1𝑞2,−126𝑚127) Unit roundoff error: 2𝟐𝟒 Floating point machine number 𝒇𝒍 𝒙 𝒇𝒍 𝒙 =𝒙 𝟏+ ||
Example 1 (P46) What’s the binary form of x=2/3 Example 1 (P46) What’s the binary form of x=2/3? What are two nearby machine numbers x- and x+ in the Marc-32? Which one is taken to be fl(x)? What are the absolute roundoff error and relative roundoff error in representing x by fl(x)?
Solution. First, we write 2/3 in the binary form 2/3 = (0.a1a2a3…)2 (1) where ai’s are either 0 or 1. We multiply by 2 for both sides to obtain 4/3 = (a1.a2a3…)2 Thus, we get a1=1 by taking the integer part of both sides. So, 1/3 = 4/3 1 = (a1.a2a3…)2 1= (0.a2a3a4…)2 2/3 = (a2.a3a4a5…)2 (2)
x? = (1.0101…011)2 2-1 (by rounding up) From (1)-(2), we have 1=a1=a3=a5=a7=… 0=a2=a4=a6=a8=… Thus, x = 2/3 = (0.101010…)2 = (1.01010…)2 2-1 In Marc-32, the two nearby machine numbers are x? = (1.0101…010)2 2-1 (by chopping) x? = (1.0101…011)2 2-1 (by rounding up) 23 bits Recall: x< x < x+
So, fl(x)=? Next, x x = (1.01010…)2 2-24 2-1 = (0.101010…)2 2-23 2-1 = 2/3 2-24 x x = (x x ) (x x) = 2-24 2/3 2-24 = 1/3 2-24 So, fl(x)=?
The absolute roundoff error |fl(x) x| = 1/3 2-24 The relative roundoff error |fl(x) x| |x| = 1/3 2−24 2/3 =2−25 Check your textbook (P47) for the decimal form of x, x and x x
Stored as machine numbers fl(a),fl(b),… Rounded off (舍入) Input numbers a,b,c,… Normalized (标准化) Stored as machine numbers fl(a),fl(b),… Rounded off (舍入) Do one arithmetic operation/calculation Obtain a number (result) e.g. fl(a)fl(b)
The computer with 5 decimal digits stores those results in rounded form as The relative errors are respectively
The computer with 5 decimal digits stores those results in rounded form as The relative errors are respectively
Denote one of the four basic arithmetic operations: , Assume x,y are machine numbers, then there is some constant s.t. fl(xy) = [xy] (1+) where ||; here, can be taken to be the unit roundoff error for the machine. In Marc-32, =2-24. Q: How to compute xy if x,y are not machine numbers?
If x,y are not machine numbers, then there is still some constant s.t. fl(x) = x (1+1) fl(y) = y (1+2) fl(xy) = fl(fl(x)fl(y)) = (fl(x)fl(y)) (1+3) = [(x(1+1)) (y(1+2))](1+3) = (xy)(1+1+2+12) (1+3) xy where |1|,|2|,|3|; still, can be taken to be the unit roundoff error for the machine.
Q: How about compand arithmetic operations? Assume x,y,z A={machine numbers of Marc-32}. fl(x(y+z)) = [x fl(y+z)] (1+1) |1| 2-24 = [x (y+z) (1+2)] (1+1) |2| 2-24 = x (y+z) (1+2+1 +12) x (y+z) (1+2+1) = x (y+z) (1+3) |3| ? Here 3=2+1
Exercise. Find fl(x(y+z)) for x, y, z A={machine numbers of Marc-32}.
Theorem on Relative Roundoff Error in Adding
2.2 Absolute & Relative Errors: Loss of Significance/Precision Assume a real number 𝑥 is approximated by another number 𝑥 ∗ , the error is 𝑥−𝑥 ∗ . The absolute error |𝑥−𝑥 ∗ | The relative error |𝑥−𝑥 ∗ | |𝑥|
The relative error involved in representing a real number 𝑥 by a nearby floating-point machine number fl(𝑥) is bounded by the unit roundoff error |𝒙−𝒇𝒍(𝒙)| |𝒙| Roundoff errors are inevitable & difficult to control.
Loss of Significance The subject of numerical analysis is largely involved in understanding and controlling errors of various kinds.
For example, 𝑥=0.3721478693,𝑦=0.3720230572 𝑥−𝑦=0.0001248121 If this calculation were to be performed in a decimal computer having a five-digit mantissa, we would have 𝑓𝑙 𝑥 =0.37215, 𝑓𝑙(𝑦)=0.37202 𝑓𝑙(𝑥)−𝑓𝑙(𝑦)=0.00013 The relative error |𝒙−𝒚−[𝒇𝒍 𝒙 −𝒇𝒍 𝒚 ]| |𝒙−𝒚| 𝟒%
Loss of Significance The result is usually stored as a normalized floating-point number, i.e., 𝒇𝒍 𝒙 −𝒇𝒍 𝒚 =𝟎.𝟎𝟎𝟎𝟏𝟑 =𝟎.𝟏𝟑𝟎𝟎𝟎𝟏𝟎𝟑 The added three 0’s in above do NOT represent additional accuracy, i.e., those three additional 0’s are NOT significant numbers (有效数字).
Subtraction of Nearly Equal Quantities Example 1 The assignment statement 𝑦 𝑥 2 +1 −1 can cause loss of significance for small values of 𝑥. How to avoid this trouble? Solution. The statement can be replaced by 𝑦 𝑥 2 /( 𝑥 2 +1 +1) in programming to avoid such trouble.
Ex. 2: 求根(保留小数点后10位) 5.00000125000062500039e-4
Loss of Precision Theorem 1(P57) Theorem on Loss of Precision If 𝑥 and 𝑦 are positive normalized floating-point binary machine numbers such that 𝑥𝑦 and 2 −𝑞 1− 𝑦 𝑥 2 −𝑝 then at most 𝑞 and at least 𝑝 significant binary bits are lost in the subtraction 𝑥𝑦.
Proof. Only prove the lower bound and leave the upper bound as your after-class exercise. The normalized binary floating-point forms for 𝑥,𝑦 are 𝒙=𝒓 𝟐𝒏 , 𝒚=𝒔 𝟐𝒎 , ( 1 2 𝑟,𝑠1) Since 𝑥𝑦, the computer may have to shift 𝑦 so that 𝑦 has the same exponent as 𝑥 before performing 𝑥𝑦. So, we must write 𝑦 as 𝑦= 𝒔 𝟐𝒎𝒏 𝟐𝒎 and then 𝑥𝑦=(𝒓 𝒔 𝟐𝒎𝒏) 𝟐𝒏
𝑟−𝒔𝟐𝒎𝒏=𝒓 𝟏− 𝒔𝟐𝒎 𝒓𝟐𝒏 =𝒓 𝟏− 𝑦 𝑥 𝟏− 𝑦 𝑥 2 −𝑝 By assumption, we have 𝑟−𝒔𝟐𝒎𝒏=𝒓 𝟏− 𝒔𝟐𝒎 𝒓𝟐𝒏 =𝒓 𝟏− 𝑦 𝑥 𝟏− 𝑦 𝑥 2 −𝑝 WLOG, assume the mantissa in the computer has 𝑝+𝑘 digits (𝑘1), then 𝒓−𝒔𝟐𝒎𝒏= 𝟎.𝟎𝟎𝟎𝒂𝟏𝒂𝟐𝒂𝒌 𝟐 𝑝 The normalized floating point form of 𝑥𝑦 is 𝑥−𝑦= 𝟎. 𝒂𝟏𝒂𝟐𝒂𝟑𝒂𝒌𝟎𝟎 𝟐 𝟐𝒏𝒑 𝑖𝑓 𝒂𝟏 0 𝟎. 𝒂𝟐𝒂𝟑𝒂𝒌𝟎𝟎𝟎 𝟐 𝟐𝒏𝒑𝟏 𝑖𝑓 𝒂𝟏=0,𝒂𝟐 0
i.e., a shift of at least 𝑝 bits to the left is required; meanwhile, at least 𝑝 spurious 0’s are attached to the right end of the mantissa, which means that at least 𝑝 bits of precision have been lost.
𝑦𝑥−sin(𝑥) Example 3. Consider the assignment statement This calculation involves a loss of significance for small values of 𝑥. How to avoid this trouble?
Solution. By the Taylor series for sin(𝑥), we have 𝑦=𝑥− sin 𝑥 =𝑥−(𝑥− 𝑥3 3! + 𝑥5 5! − 𝑥7 7! +) = 𝑥3 3! − 𝑥5 5! + 𝑥7 7! − 𝑥9 9! If 𝑥 is near 0, a truncated series can be used, e.g., 𝑦 (𝑥 3 /6)(1− (𝑥 2 /20)(1− (𝑥 2 /42)(1− 𝑥 2 /72))) Note that both assignment statements may be used for a wide range of values of 𝑥.
Homework & Programming Check the course’s webpage for Homework #4 Due Thursday, 9. 29 Programming #1 Due Thursday, 9. 29