Download presentation
Presentation is loading. Please wait.
Published byCorey Griffith Modified over 9 years ago
1
Floating Point in computers Comply with standards: IEEE 754 ISO/IEC 559
2
Timeline Introductionquite short Binary reviewnot so long Integer Arithmetic1/3 Floating Point1/3 Floating Point Arithmetic1/3 Other issuesextra short
3
Introduction Who does computer arithmetic? Intel’s spare money How is it done in hardware? How Integer relates to Floating point Now, we go back to “computer structure”
4
Binary numbers What is 1 0 0 1 0 1 1. 0 0 1 0 1 ? 64 8 2 1
5
Signed Binary Integers Sign-magnitude 2’s complement 1’s complement biased
6
Sign-Magnitude High order bit = Sign 0101 = 5 1101 = -5 2 zero’s
7
2’s complement Number + Negative = 2 n 0101 = 5 1011 = -5 Easy addition (drop carry) Formula: -a n-1 2 n-1 + a n-2 2 n-2 + … +a 1 2 1 + a 0
8
1’s Complement Negative - complement to 1 0101 = 5 1010 = -5 2 zero’s Number + Negative = 2 n -1
9
Biased Binary = Number + Bias Bias = 5: 1101 = 55+5=10 0000 = -5(-5)+5 = 0 Relative order remains
10
Integer Arithmetic
11
Adding (usigned) Integers Elementry school : 1 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 + 110 1 0 1 1010 1 1 Result has n+1 bits!
12
Adding Integers - hardware Half Adder ab C in s C out ab s Full Adder 2 logical levels
13
Ripple carry Adder a n-1 b n-1 s n-1 C out a n-2 b n-2 C in s n-2 a1a1 b1b1 s1s1 a0a0 b0b0 s0s0 Slow - 2n logical levels Small constant (CMOS) Other ways exist
14
Adding Signed Integers In 2’s complement: b + (-a)= b + (2 n -a)= 2 n + (b-a) hence - add as integers, discard carry out Example:0011 + 1100 = ? = (2 n - (b+a)) + 2 n = (2n-b)+(2n-a)(-b) + (-a)
15
Substracting Integers Add the negation Negating 2’s complement: 11010100101011000110000 = ? 00001001010110101001110
16
Integer (unsigned) Multiplication Elementry school :1 1 0 1 1 0 0 1 1 1 0 1 0 0 1 1 0 1 0 1 1 1 0 1 0 1 * Result is 2n bits !
17
Hardware Multiplier P=0 loop:(i) if A 0 =1, add B to P (ii) right-shift P & A AP B Shift n n Carry n
18
Integer (unsigned) Division Elementry school : 110111 0 00 011 1 11 000 0 00 001 0 00 01 Result: 0100, Rem 1 Dec: 13/3=4, Rem 1
19
Hardware Divider P=0 loop:(i) left-shift P & A (ii) Sub. B from P: positive: a 0 =1 negative: a 0 =0, restore P (add B) AP B Shift n n+1 0
20
Example 13 / 3 = 4 (1) n=4 A=1101B=00011P=00000
21
PAB 0 0 0 1 10 0 0 0 0 1 1 0 1
22
PAB 0 0 0 1 10 0 0 0 1 0 1 0 0 Quotient Remainder
23
Division - remarks Non-restoring Algorithm Load P only if positive Check for 0 (Total) Result is 2n bits!
24
Integer arithmetic - remarks Signed Multiply and Division –Algorithms exist –We will not use them What to do with extra bits? Faster methods
25
Floating Point
26
Non Integers - Other Methods Fixed Point –example: # # #. # –Binary point shifted –Integer arithmetic (extra shifting) –Small number magnitude Rational –a/b(a,b Z)
27
Floating Point Exponent + Significand (= Mantisa) x = s 2 e Example: s=101 e=011 x = 101 2 11 = 40= 5 2 3 = 101000
28
Uniqueness Denormal Numbers:123.456 10 7 0.123 10 4 Normalized:#.### 10 # 1.123 10 4 What about 0 ?
29
Floating Point Standard Why Standartize? –Hardware accelerators –Software compatibility –Build Software Libraries –etc….. IEEE 754-1985ISO/IEC 559 Includes: Structure, Arithmetic results
30
Float Types 4 Precision Types: –Single –Single extended –Double –Double extended
31
Single Precision 32 bits: Exponent (e):Biased ( + 127) Significand (f):Fixed fraction: 0. # # # … Nuber:1.f 2 e-127 11111111111111111111111111111111 Sign(1)Exponent(8)Significand(23)
32
Single Precision - Example 1 10000001 01000000000000000000000 10000001 = 129 01000… = 0.01000… 129-127=2 X = - 1.25 2 2 X = - 5 1.01= 1.25
33
Single Precision - Range E max = 127(e = 254) E min = -126(e = 1) Why |E min |<|E max |? –1/2 E min does not overflow Why Biased notation? What about 0 and 255 ?
34
Floating Point Precision
35
Exmaples We shall use base 10 sometimes: f will have 3 digits E max will be 98 E min will be -97 Ex:5.34 10 70
36
NaN Not a Number Result of ilegal computation: – –Any computation involving a NaN e = E max + 1&f 0 # 11111111 ####################### Many NaN’s (different f’s)
37
NaN’s in use Zero finder outside domain –f(x) = sqrt(x) - 1 Works since all computations NaN No exception caused !
38
Zero’s 0 00000000 00000000000000000000000 ? this is NOT 1.0 2 E min 1 00000000 00000000000000000000000 ? 0 is signed! 0 both exits! What is the difference?
39
Signed 0’os +0 = -0 BUT: Multiply/Divide keep sign rules: Monivation: –Using inf correctly (describe later) –log(x): log(0)=-inflog(negative)=Nan log(x) if x (-0) ?
40
± inf More logic: e = E max + 1&f = 0 # 11111111 00000000000000000000000
41
Inf usage Example (If tan -1 is defined properly)
42
More on 0’os and inf’s General Rule for 0/inf arithmetic: –Take appropriate limit: 1/(1/x) where x=0 or inf Why not Max # instead?
43
Zero’s and inf’s - yet again X/(x 2 +1) is bad!Why? 1/(x+x -1 ) is better Do we need to check for x=0? Using 2 zero’s and inf’s saves some special cases checks.
44
Denormalized numbers Example: –x=1.2310 - 98 y=1.1110 - 98 –x-y = 1.2010 - 99 = 0 –so: x-y=0 but: x y –think of:if(x y) then z=1/(x-y) Soluition: –use denormalized numbers!
45
Denormal Numbers Smallest normal: 1.0 2 E min Below, use denormal: 0.f 2 E min e = E min - 1&f 0 # 00000000 ####################### Gradual underflow: 1.23 10 -4 ( /10 ) 0.12 10 -4 ( /10 ) 0.01 10 -4 ( /10 ) 0
46
Denormal Numbers Back to our Example: –x=1.2310 - 98 y=1.1110 - 98 –x-y = 0.1210 - 98 –and this is not 0 !
47
Flush to 0 Vs Gradual Underflow 02 -4 2 -3 2 -1 2 -2 02 -4 2 -3 2 -1 2 -2
48
Special Values - Summary ExponentFractionRepresents E min -1 f=0 0 E min -1 f 0 0.f 2 E min E min e E max ---- 1.f 2 e E max +1 f=0 0 E max +1 f 0 0.f 2 E min
49
Rounding Why is rounding needed? Infinit numbers Finit representation Integers only overflow Almost all operations need rounding IEEE - specifies algorithms for arithmetic
50
Numbers need rounding Out of range: –x>2 2 E max x<1 2 E min Between 2 floats: –0.1 10 = 0.00011001100…. 2 = 1.1001100…. 2 -4 –1.1001 2 -4
51
Measuring Error ULPS(units in last place) –1.12 10 -1 Vs 0.124: 0.4 ulps –1.12 10 -1 Vs 0.118: 0.2 ulps Relative Error –Difference/Original –1.12 10 -1 Vs 0.124: Err=0.004/0.124=0.032
52
Calculate Using Rounding Benign cancellation –Calculate 10.1-9.93 (= 0.17) 1.01 10 1 0.99 10 1 0.02 10 1 = 2.00 10 -1 –30 upls!
53
Rounding problems Catastrophic cancellation –b 2 -4ac –both b 2 and 4ac are rounded –the (-) exposes the error –b=3.34 a=1.22 c=2.28 b 2 =11.2 4ac=11.1 b 2 -4ac=0.10 correct=0.0292(70.08 upls)
54
IEEE Arithmetic Requirement: + - shold be EXACTLY rounded remaindershold be EXACTLY rounded Integer conv.shold be EXACTLY rounded Not all (transcendental, binary to decimal) “Tie break” - Round to Even
55
Round to Even How will 1.005 be rounded ? –Round Up:1.01 –Round Even:1.00 Why? Example: –x i =x i-1 +y-yx0=1.00 y=0.125 –Round up:1.00, 1.01, 1.02, …. –Round even:1.00, 1.00, 1.00, ….
56
Float Multiplication Integer multiply Biased additio n “ Biased addition ” : detect Overflow: Use n+1 bit adder detect Underflow:Harder (Denormals)
57
Rounding Multiplication 1.23 6.78 8.3394 X Round to 8.34 2.83 4.47 12.6501 X Round to 1.27 1.28 7.81 09.9968 X Round to 1.00 1.0001 1 1.0010 0 1.0010 1 0.1101 0 Round bit 0 Round bit 1 All rest 0 Round bit 1 All rest 0 Shift needed
58
Round, Guard, Sticky 0. 1 1 0 1 0 0 0 1 0 numberguardroundsticky 1. 0 0 1 0 0 0 1 0 0 numberroundsticky
59
Rounding Multiplication AP B Shift n n Carry n x 0 x 1.x 2 x 3 x 4 x 5 g r s s s s x 1.x 2 x 3 x 4 x 5 g X 0. x 1 x 2 x 3 x 4 x 5 Case 1: x 0 =0, shift Case 2: x 0 =1, inc. exp Product Results: Roun d digit Sticky bit
60
Rounding rules r=0 rounded OK r=1, s=1 add 1 to LSB r=1, s=0 add 1 if LSB=1 Denormals Extra shifting
61
Float addition Compute all digits and round? –1.00 2 20 + 1.00 2 -20 = 10000000….0000001 –too long! Use Round and Sticky bits: –shift to same exponent –r = first discarded digit –s = OR of rest discarded
62
Float addition - example 1.10011.00001 1.10100 + r=1, s=1 Round needed! 1.10101 Calculate:1.10011 2 0 + 1.10001 2 -5 Shift exponents:1.10011 2 0 + 0.0000110001 2 0 r=1 s=0|0|0|1= 1
63
Signed Addition/Substraction Simplest way- convert to 2’s cmpl. Cancellation of high order bit - shift more bits cancel - How many guard digits? 1.00000 1.11111 0.11111 + 1.00000 0.00000101111 - 1.1111101000 1 cmpl
64
Float Division Integer division Biased substractio n Very similar to Multiplication Dividing using integer divide Compute 2 more bits (round, guard) Use remainder as sticky bit (Why?) Sign bit: XOR
65
More on floats
66
Rounding modes IEEE specifies 4 modes: –Nearest(default) –towards 0 –towards +inf –towards -inf affects overflow (How?)
67
Exceptions Set a flag at: –Underflow1.0 2 E min x 1.0 2 E min –Overflow1.0 2 E max x 1.0 2 E max –divide by 01/0 –inexactRounded was needed –invalidNaN return operations flags are sticky
68
Speeding up Different algorithms may be used Result should be exact divide SRT algorithm in pentium –5/2048 entries in a table –1/9,000,000 chance –check:
69
Precision Why extended precisions? –Return higher accuracy (D*D ext. D) –use for computations:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.