Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Numerical Analysis I

Similar presentations


Presentation on theme: "Introduction to Numerical Analysis I"— Presentation transcript:

1 Introduction to Numerical Analysis I
MATH/CMPSC 455 Introduction to Numerical Analysis I Floating Point Representation of Real Numbers

2 Floating Point Representation of Real Numbers
This is about how computers represent and operate real numbers. Help us to understand the rounding errors We discuss IEEE 754 Floating Point Standard Represent binary numbers in computer: format machine representation

3 Floating Point Format Formats for decimal system Standard Notation
Scientific Notation Normalized Scientific Notation

4 Floating Point Format Format for floating point number (binary representation) Normalized IEEE floating point standard: sign (+ or -) mantissa , which contains the significant bits. (N b’s) exponent (p, M-bit binary number representing)

5 Precision sign Exponent (M) Mantissa (N) single 1 8 23 double 11 52 Long double 15 64 Definition (machine epsilon, ): It is the distance between 1 and the smallest floating point number greater than 1. For the IEEE double precision floating point standard: It is NOT the smallest representable number!!!

6 Rounding How do we fit the infinite binary number in a finite number of bits? IEEE Rounding to Nearest Rule: For double precision, if the 53rd bit to the right of the binary point is 0, then the round down (truncate after the 52nd bit). If the 53rd bit is 1, then round up (add 1 to 52 bit), unless all known bits to the right of the 1 are 0’s, in which case 1 is added to bit 52 if and only if bit 52 is 1.

7 Rounding Notation: Denote the IEEE double precision floating point number associated to x, using the Rounding to the Nearest Rule, by fl(x). Definition (absolute error & relative error): Let be a computed version of the exact quantity .

8 Rounding Example: Example: Relative rounding error:

9 Machine Representation
Sign: 1 bit, 0 for positive, 1 for negative; Mantissa: 52 bits, … Exponent: 11 bits, positive binary integer resulting from adding 1023 to the exponent 1~2046  ~ 1023; 2046  infinity if the mantissa is allzeros, NaN otherwise; 0 subnormal floating point numbers (small numbers including 0)

10 Addition of Floating Point Numbers
Step 1: line up the two numbers Double Precision Step 2: add them Higher Precision Step 3: store the result as a floating point number Double Precision

11 Example : Example :


Download ppt "Introduction to Numerical Analysis I"

Similar presentations


Ads by Google