Representing fractions – Fixed point The problem: How to represent fractions with finite number of bits ?
Representing fractions – Fixed point a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 A number with 10 bits
Representing fractions – Fixed point a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 A number with 10 bits a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8. a 9 a 10 Fixing the point
Representing fractions – Fixed point NumberFraction Signed (2 complement) -2 7 …2 7 -1Quanta of 0.25 Unsigned0…2 8 -1Quanta of 0.25 Range of representation:
Fixed point : the problem Cannot represent wide ranges of numbers. In scientific applications.
Representing Fractions – Floating point 10 1 * * Base (radix) - r
Representing Fractions – Floating point 10 1 * * exponent
Representing Fractions – Floating point 10 1 * * Number (Mantissa)
Representing Fractions – Floating point (-1) 0 *1 * 10 1 (-1) 1 *1.23 * Sign bit
Problem of uniqueness * *10 2 Representation is not Unique
Problem of uniqueness - Normalization * *10 2 Standardization 6.1*10 -1 One digit to the Left of the point
Normalized Binary Floating point D = (-1) a 0 * (1.a 1 a 2 a 3 …)*2 b 1 b 2 b 3 … a 0 b 1 b 2 …b n a 1 a 2 a 3 …a m String of bits
Floating point - Questions Representing the (signed) exponent How to represent zero? And Nan, infinity ? How to add, subtract and multiply? Rounding Errors.
Floating point – Representing the exponent How to represent singed number ? Sign bit 2-Complement
Floating point – Representing the exponent How to represent singed number ? Sign bit 2-Complement Neither
Floating point – Representing the exponent We want the exponent to be binary ordered: 0000 < 0001 < …. < 1000 < … < 1111
Floating point – Representing the exponent Number = Number - B 000… …1110 e min e max Usually B = 2 n-1 -1 We define the following sizes like this:
Floating point – Representing zero,NAN, ± ExponentMantissaRepresent e = e min -1M = 0±0 e = e min -1M ≠ 00.M*2 e min e min ≤ e ≤ e max 1.M*2 e e = e max +1M = 0 ±± e = e max +1M ≠ 0NaN IEEE754 special values Denormalized number normalized number
IEEE 754 ParameterSingleDouble Mantissa2453 e max e min Exponent width 811 Format width 3264 (Including the sign Bit)
What is NaN (not a number) OperationNaN produced by + + (- ) * 0 * / 0/0, / Partial list Operation X±NaN X*NaN X/NaN
Infinity Provide a safe was to continue calculation when overflow is encountered.
Calculations with Floating Point numbers Addition: Equalize the exponents (smaller larger exponent) Sum the mantissa Renormalize if necessary
Calculations with Floating Point numbers Example (in base 10): 91 9.10*10 1 |E| = 1, |M| = 9.70*10 0
Calculations with Floating Point numbers 9.10* *10 0 Not The same Order.
Calculations with Floating Point numbers 9.10* * * * *10 1 renormalize 1.07*10 2
Calculations with Floating Point numbers Example II (in base 10): 91 9.10*10 1 |E| = 1, |M| = 9.75*10 0
Calculations with Floating Point numbers 9.10* *10 0 Not The same Order.
Calculations with Floating Point numbers 9.10* * * * *10 1 renormalize 1.07* (rounding error)
Rounding Errors The Problem: Squeezing infinite many real numbers into a finite number of bits
Measuring Rounding Errors Units in last place (Ulps) Relative Error
Measuring Rounding Errors – ULP error = |d.dddd – (z/r e )|*r p-1 If d.dddd*r e represent z p digits
Measuring Rounding Errors – ULP Example I: r = 10, p = 3 The number 3.14*10 -2 represents Error = 0.159
Measuring Rounding Errors – ULP What is the maximum ULP if the rounding is toward the nearest number? 0.5 ULP
Measuring Rounding Errors – Relative Error Relative error = |d.dddd*r e – z|/z If d.dddd*r e represent z p digits
Measuring Rounding Errors – Relative errors Example I: r = 10, p = 3 The number 3.14*10 -2 represents Relative Error ~