Number Representation Fixed and Floating Point No Method Capable of Representing ALL Real Numbers Using Finite Register Lengths Must Use Approximations to Represent Values Concentrate on Two Forms: Fixed Point Floating Point Others are: Rational Number Systems – uses ratios of integers Logarithmic Number Systems – uses signs and logarithms of values
Fixed Versus Floating Point Fixed Point Values Represent Values where Any Two Differ by 1 unit in the last place (ulp) Equal Spacing Between Numbers Floating Point Values Use Two Multi-Bit Words Mantissa Exponent Both Forms Must be Capable of Representing Signed Quantities Fixed Point Values CAN be Used to Represent Fractional Quantities
Floating Point Characteristics Total Number of Representations = Total Bit Strings For n-bit Register we have 2n Range of Value is Larger than Fixed Point Precision of Value is Smaller Distance Between Two Consecutive Values Increases
Floating Point s e m s – Sign Bit (signed magnitude) e – Exponent (in 2’s Complement Form) m – Mantissa (significand or fraction) mMAX=1 - ulp; [0,1) hidden bit float – BIAS = 127 (32 bits-23 for m and 8 for e) double – BIAS=1023 (64 bits-52 for m and 11 for e) Sign of Exponent is Complement of it’s MSb Thus, adding/subtracting bias is just complementation of MSb
Floating Point Example double = 00000000 bfe80000 Big Endian – MSW has Higher Address s e m 1 011 1111 1110 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 s = 1; e = 1022; m = 0.5 Value = (-1)11.5 2(1022-1023) Value = -(1.5)(0.5) = -0.75
Floating Point Normalization Redundant /representations are Possible! Hidden Bit Helps Out of All Possible Representations, Choose One With Fewest Leading Zeros in Significand This is Normalization After Performing Arithmetic, Renormalization May Need to be Accomplished
Floating Point Special Numbers Value v when exponent e and fraction f are special values (IEEE standard) Note: NaN = Not a Number
IEEE/ANSI 754/854 Standard
Denormalized Numbers Allows for Gradual Degradation for Underflow
Denormals
Operations – Internal Precision
Floating Point Addition/Subtraction
Floating Point Multiplication/Division
Conversions and Roundings
Exceptions
Rounding Schemes Signed Magnitude Two’s Complement
Round to Nearest (Signed Magnitude)
Rounding Comments
Round to Nearest Even/Odd Round to Nearest Odd (R*)
Jamming/von Neumann Rounding
ROM Rounding
Rounding
Rounding Examples Round Towards + Downward Directed Rounding
Floating Point Operations
Adders/Subtractors
Operand Packing/Unpacking
Other Key Parts of FP Add/Sub Unit
Pre-Shifting
Four-stage Combinational Shifter Pre-shifts Operand by 0 to 15 Bits
Leading Zeros/Ones – Counting vs. Prediction
Leading Zeros Prediction
Guard Digits What is the smallest number of extra digits needed for rounding? post-normalization? Multiplication – Double Length Result Add/Sub w/ differing exp. – Can have Double Length Result FP Unit Provides One Length Result
Significand Ranges Assume Significand M(0,1-ulp] Then Normalized M ranges as: Multiplication: prod=M1M2 For postnormalization need at most one shift left to get:
Significand Ranges (cont) Division: quot=M1M2 Need at most one shift right to get: Conclusion: 1 Extra Digit Needed for Postnormalization 1 Extra Digit Needed for Round-to-Nearest 2 Extra Digits Needed G - guard R - round
“Sticky Bit” in std754 Round-to-Nearest-Even Requires 1 Extra Bit The “sticky bit”, S Turns out to be Logical-OR of Other Additional Bits
Floating Point Multiplier
Floating Point Divider