Floating Point Arithmetic

Name: Floating Point Arithmetic
Uploaded: 2017-10-11T23:43:51+00:00
Duration: PTM5S58
Channel: Derrick Fields
Description: Floating Point Arithmetic

Floating Point Arithmetic

Hardware vs. Software Can build the ALU (Arithmetic Logic Unit) to perform Floating Point Arithmetic Faster More expensive Less of an issue as technology improves Can simulate the operations of Floating Point with multiple integer operations Done by the compiler Slower Cheaper Hardware

IEEE Floating Point Layout
Single Precision – 32 bits Left bit is a sign bit Next 8 are exponent Next 23 are mantissa Double Presicion – 64 bits Left bit is a sign bit Next 11 are exponent Next 52 are mantissa

Floating Point Addition
Performed in several steps Line up the decimal points Now the exponents are the same Add the mantissas Exponent of the result is the same as the exponents of the operands Normalize if necessary Place in proper scientific notation

Equalizing Exponents In math, we can shift the value with the larger exponent left while decreasing the exponent until the exponents are equal But the hardware has no place to shift the value left into. There is the implied decimal point We must shift the value with the smaller exponent right and increase the exponent The right values are lost Insignificant – low order bits – won’t affect the answer much Some hardware has extra bits just for computation, not for answer

Adding Now the bits for the mantissa can be added.
Just like adding integers (but with fewer than 32 bits) The exponent of the answer is the same exponent as the operands.

Normalizing In scientific notation, the mantissa of the operands is between 1 and 2. After getting the exponents equal the mantissa is between 0 and 2. So, the result is between 1 and 4 Unless one of the operands is negative, then the result can be between 0 and 4 (in absolute value) We may need to shift the result left to get a 1 bit into the leftmost bit of the answer We may need to shift the result right to get the result in the proper range

Correct Results What happens when we add two values of very different magnitude? We must shift one of the values many places The rightmost bits “fall off” the end The answer will not be “exact”, but very close. When would this happen? What if we are summing many, many values. Sum=Sum + A[I] Sum can get so big compared to A[I] that Sum does not change.

Multiplication Actually a little easier
Do unsigned multiplication with the mantissas Add the exponents Normalize the result Set the sign bit of the result

Multiplication Details
We have already done unsigned multiplication. To add the exponents we need to look at the notation. The exponents use excess 127 notation fpe1=reale1+127 Result = fpe1+fpe2 = reale1+reale Need to subtract 127 from the result to get appropriate value

Sign The sign of the result depends on the sign of the operands
If both operands have the same sign, the result is positive, otherwise the result is negative. S1 S2 R 1 This is the XOR function Of course, must normalize the result May have many more shifts

True Division Do unsigned division on the mantissas
Discussed with integers. Subtract the exponents Now need to add 127 to get the correct representation of the value Normalize the result Same as previous methods Set the sign Same as with multiplication

Division by Reciprocal
Calculate a/b as a* (1/b) This is useful only if we can compute (1/b) without using division. Use a Newton-Raphson technique (discussed in CSCI 381) Repeat r = r * (2 – r*b) Until r does not change r starts with a first guess at the reciprocal and gets closer with each iteration

Errors Floating point numbers are not exact
Do NOT compare floating point numbers for equality * 10 ≠ 1. Instead of using “if (a == b)” when a and b are floating point, use if (abs(a-b) < .0001) or some other reasonable measure of “close enough”

Rounding in Base 2 Round to the nearest. Round towards 0
Ties are such that the least significant bit is 0 Round towards 0 Truncation Round towards positive infinity Round up (careful with negative values) Round towards negative infinity Round down

Overflow and Underflow
Overflow for integers is when the result is too big to be held with the number of bits allocated. The same is true for Floating Point. However, this is determined more by the size of the exponent field than the size of the mantissa field. Underflow is when a value becomes so small that it becomes 0. Again, this is related to the exponent field but with negative exponents

Floating Point Arithmetic

Similar presentations

Presentation on theme: "Floating Point Arithmetic"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Floating Point Arithmetic

Similar presentations

Presentation on theme: "Floating Point Arithmetic"— Presentation transcript:

Similar presentations

About project

Feedback