Download presentation
Presentation is loading. Please wait.
Published byMadeline Lamb Modified over 9 years ago
1
Ellen Spertus MCS 111 October 11, 2001 Floating Point Arithmetic
2
2 Decimal addition (1) Problem: 9.999×10 1 + 1.610×10 -1 Estimate answer:
3
3 Decimal addition (2) Problem: 9.999×10 1 + 1.610×10 -1 Calculate answer: 9.999×10 1 +1.610×10 -1
4
4 Decimal addition (3) Problem: 9.999×10 1 + 1.610×10 -1 How should we add them?
5
5 Floating point addition Adjust numbers to have same exponent Add the significands Normalize the sum
6
6 Binary addition (1) Problem: 1.01×2 2 + 1.101×2 -1 Adjust numbers to have same exponent: Add the significands Normalize the sum
7
7 Binary addition (2) Problem: 1.11×2 1 + 1.01×2 3 Adjust numbers to have same exponent: Add the significands Normalize the sum
8
8 8-bit floating-point format (2) Exponent (3 bits) is biased by 3 The leading one of significand is implicit Zero is represented by all zeros
9
9 Practice Add two numbers from previous slide
10
10 Problem
11
11 Rounding (1) Round 1.00011 to have one fewer digit Modes –Always round up (IRS) –Always round down –Truncate –Round to nearest even
12
12 Rounding (2) Round -1.00011 to have one fewer digit Modes –Always round up (IRS) –Always round down –Truncate –Round to nearest even
13
13 Ensuring accurate results Our significands are 4 bits wide. We use 6 bits when adding two significands. –Guard bit –Round bit Purpose: Accurate rounding
14
14 Adding large numbers What if we add 1.1111×2 4 + 1.1111×2 4
15
15 How can we get underflow?
16
16 Associativity of arithmetic (x+y)+z = x+(y+z) When is this true?
17
17 Breakdown of associativity Values –x = 1.0000 –y = 0.00001 –z = 0.00001 Assume rounding by truncation. (x+y)+zx+(y+z)
18
18 MIPS floating point 32 floating-point registers (32 bits each) Instructions –Addition: add.s, add.d –Subtraction: sub.s, sub.d –Multiplication: mul.s, mul.d –Division: div.s, div.d –Comparison: c.x.s and c.x.d where x is: eq, neq, lt, le, gt, ge –Conditional branch: bc1t, bc1f
19
19 Summary Computers aren’t limited to integers Floating-point arithmetic is quirky –Loss of precision due to rounding –Underflow –Overflow Big picture: Floating point arithmetic can be implemented with enough ______________________.
20
20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.