Floating Point in computers Comply with standards: IEEE 754 ISO/IEC 559.

Floating Point in computers Comply with standards: IEEE 754 ISO/IEC 559

Timeline Introductionquite short Binary reviewnot so long Integer Arithmetic1/3 Floating Point1/3 Floating Point Arithmetic1/3 Other issuesextra short

Introduction Who does computer arithmetic? Intel’s spare money How is it done in hardware? How Integer relates to Floating point Now, we go back to “computer structure”

Binary numbers What is 1 0 0 1 0 1 1. 0 0 1 0 1 ? 64 8 2 1

Signed Binary Integers Sign-magnitude 2’s complement 1’s complement biased

Sign-Magnitude High order bit = Sign 0101 = 5 1101 = -5 2 zero’s

2’s complement Number + Negative = 2 n 0101 = 5 1011 = -5 Easy addition (drop carry) Formula: -a n-1 2 n-1 + a n-2 2 n-2 + … +a 1 2 1 + a 0

1’s Complement Negative - complement to 1 0101 = 5 1010 = -5 2 zero’s Number + Negative = 2 n -1

Biased Binary = Number + Bias Bias = 5: 1101 = 55+5=10 0000 = -5(-5)+5 = 0 Relative order remains

Integer Arithmetic

Adding (usigned) Integers Elementry school : 1 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 + 110 1 0 1 1010 1 1 Result has n+1 bits!

Adding Integers - hardware Half Adder ab C in s C out ab s Full Adder 2 logical levels

Ripple carry Adder a n-1 b n-1 s n-1 C out a n-2 b n-2 C in s n-2 a1a1 b1b1 s1s1 a0a0 b0b0 s0s0 Slow - 2n logical levels Small constant (CMOS) Other ways exist

Adding Signed Integers In 2’s complement: b + (-a)= b + (2 n -a)= 2 n + (b-a) hence - add as integers, discard carry out Example:0011 + 1100 = ? = (2 n - (b+a)) + 2 n = (2n-b)+(2n-a)(-b) + (-a)

Substracting Integers Add the negation Negating 2’s complement: 11010100101011000110000 = ? 00001001010110101001110

Integer (unsigned) Multiplication Elementry school :1 1 0 1 1 0 0 1 1 1 0 1 0 0 1 1 0 1 0 1 1 1 0 1 0 1 * Result is 2n bits !

Hardware Multiplier P=0 loop:(i) if A 0 =1, add B to P (ii) right-shift P & A AP B Shift n n Carry n

Integer (unsigned) Division Elementry school : 110111 0 00 011 1 11 000 0 00 001 0 00 01 Result: 0100, Rem 1 Dec: 13/3=4, Rem 1

Hardware Divider P=0 loop:(i) left-shift P & A (ii) Sub. B from P: positive: a 0 =1 negative: a 0 =0, restore P (add B) AP B Shift n n+1 0

Example 13 / 3 = 4 (1) n=4 A=1101B=00011P=00000

PAB 0 0 0 1 10 0 0 0 0 1 1 0 1

PAB 0 0 0 1 10 0 0 0 1 0 1 0 0 Quotient Remainder

Division - remarks Non-restoring Algorithm Load P only if positive Check for 0 (Total) Result is 2n bits!

Integer arithmetic - remarks Signed Multiply and Division –Algorithms exist –We will not use them What to do with extra bits? Faster methods

Floating Point

Non Integers - Other Methods Fixed Point –example: # # #. # –Binary point shifted –Integer arithmetic (extra shifting) –Small number magnitude Rational –a/b(a,b  Z)

Floating Point Exponent + Significand (= Mantisa) x = s 2 e Example: s=101 e=011 x = 101 2 11 = 40= 5 2 3 = 101000

Uniqueness Denormal Numbers:123.456  10 7 0.123  10 4 Normalized:#.###  10 # 1.123  10 4 What about 0 ?

Floating Point Standard Why Standartize? –Hardware accelerators –Software compatibility –Build Software Libraries –etc….. IEEE 754-1985ISO/IEC 559 Includes: Structure, Arithmetic results

Float Types 4 Precision Types: –Single –Single extended –Double –Double extended

Single Precision 32 bits: Exponent (e):Biased ( + 127) Significand (f):Fixed fraction: 0. # # # … Nuber:1.f 2 e-127 11111111111111111111111111111111 Sign(1)Exponent(8)Significand(23)

Single Precision - Example 1 10000001 01000000000000000000000 10000001 = 129 01000… = 0.01000…  129-127=2 X = - 1.25 2 2 X = - 5  1.01= 1.25

Single Precision - Range E max = 127(e = 254) E min = -126(e = 1) Why |E min |<|E max |? –1/2 E min does not overflow Why Biased notation? What about 0 and 255 ?

Floating Point Precision

Exmaples We shall use base 10 sometimes: f will have 3 digits E max will be 98 E min will be -97 Ex:5.34  10 70

NaN Not a Number Result of ilegal computation: – –Any computation involving a NaN e = E max + 1&f  0 # 11111111 ####################### Many NaN’s (different f’s)

NaN’s in use Zero finder outside domain –f(x) = sqrt(x) - 1 Works since all computations NaN No exception caused !

Zero’s 0 00000000 00000000000000000000000 ? this is NOT 1.0  2 E min 1 00000000 00000000000000000000000 ?  0 is signed!  0 both exits! What is the difference?

Signed 0’os +0 = -0 BUT: Multiply/Divide keep sign rules: Monivation: –Using inf correctly (describe later) –log(x): log(0)=-inflog(negative)=Nan log(x) if x  (-0) ?

± inf More logic: e = E max + 1&f = 0 # 11111111 00000000000000000000000

Inf usage Example (If tan -1 is defined properly)

More on 0’os and inf’s General Rule for 0/inf arithmetic: –Take appropriate limit: 1/(1/x) where x=0 or inf Why not Max # instead?

Zero’s and inf’s - yet again X/(x 2 +1) is bad!Why? 1/(x+x -1 ) is better Do we need to check for x=0? Using 2 zero’s and inf’s saves some special cases checks.

Denormalized numbers Example: –x=1.2310 - 98 y=1.1110 - 98 –x-y = 1.2010 - 99 = 0 –so: x-y=0 but: x  y –think of:if(x  y) then z=1/(x-y) Soluition: –use denormalized numbers!

Denormal Numbers Smallest normal: 1.0 2 E min Below, use denormal: 0.f 2 E min e = E min - 1&f  0 # 00000000 ####################### Gradual underflow: 1.23 10 -4 ( /10 ) 0.12 10 -4 ( /10 ) 0.01 10 -4 ( /10 ) 0

Denormal Numbers Back to our Example: –x=1.2310 - 98 y=1.1110 - 98 –x-y = 0.1210 - 98 –and this is not 0 !

Flush to 0 Vs Gradual Underflow 02 -4 2 -3 2 -1 2 -2 02 -4 2 -3 2 -1 2 -2

Special Values - Summary ExponentFractionRepresents E min -1 f=0  0 E min -1 f  0 0.f  2 E min E min  e  E max ---- 1.f  2 e E max +1 f=0  0 E max +1 f  0 0.f  2 E min

Rounding Why is rounding needed? Infinit numbers  Finit representation Integers only overflow Almost all operations need rounding IEEE - specifies algorithms for arithmetic

Numbers need rounding Out of range: –x>2  2 E max x<1  2 E min Between 2 floats: –0.1 10 = 0.00011001100…. 2 = 1.1001100….  2 -4 –1.1001  2 -4

Measuring Error ULPS(units in last place) –1.12  10 -1 Vs 0.124: 0.4 ulps –1.12  10 -1 Vs 0.118: 0.2 ulps Relative Error –Difference/Original –1.12  10 -1 Vs 0.124: Err=0.004/0.124=0.032

Calculate Using Rounding Benign cancellation –Calculate 10.1-9.93 (= 0.17) 1.01  10 1 0.99  10 1 0.02  10 1 = 2.00  10 -1 –30 upls!

Rounding problems Catastrophic cancellation –b 2 -4ac –both b 2 and 4ac are rounded –the (-) exposes the error –b=3.34 a=1.22 c=2.28 b 2 =11.2 4ac=11.1 b 2 -4ac=0.10 correct=0.0292(70.08 upls)

IEEE Arithmetic Requirement: + -    shold be EXACTLY rounded remaindershold be EXACTLY rounded Integer conv.shold be EXACTLY rounded Not all (transcendental, binary to decimal) “Tie break” - Round to Even

Round to Even How will 1.005 be rounded ? –Round Up:1.01 –Round Even:1.00 Why? Example: –x i =x i-1 +y-yx0=1.00 y=0.125 –Round up:1.00, 1.01, 1.02, …. –Round even:1.00, 1.00, 1.00, ….

Float Multiplication Integer multiply Biased additio n “ Biased addition ” : detect Overflow: Use n+1 bit adder detect Underflow:Harder (Denormals)

Rounding Multiplication 1.23 6.78 8.3394 X Round to 8.34 2.83 4.47 12.6501 X Round to 1.27 1.28 7.81 09.9968 X Round to 1.00 1.0001 1 1.0010 0 1.0010 1 0.1101 0 Round bit 0 Round bit 1 All rest 0 Round bit 1 All rest 0 Shift needed

Round, Guard, Sticky 0. 1 1 0 1 0 0 0 1 0 numberguardroundsticky 1. 0 0 1 0 0 0 1 0 0 numberroundsticky

Rounding Multiplication AP B Shift n n Carry n x 0 x 1.x 2 x 3 x 4 x 5 g r s s s s x 1.x 2 x 3 x 4 x 5 g X 0. x 1 x 2 x 3 x 4 x 5 Case 1: x 0 =0, shift Case 2: x 0 =1, inc. exp Product Results: Roun d digit Sticky bit

Rounding rules r=0  rounded OK r=1, s=1  add 1 to LSB r=1, s=0  add 1 if LSB=1 Denormals  Extra shifting

Float addition Compute all digits and round? –1.00  2 20 + 1.00  2 -20 = 10000000….0000001 –too long! Use Round and Sticky bits: –shift to same exponent –r = first discarded digit –s = OR of rest discarded

Float addition - example 1.10011.00001 1.10100 + r=1, s=1 Round needed!  1.10101 Calculate:1.10011  2 0 + 1.10001  2 -5 Shift exponents:1.10011  2 0 + 0.0000110001  2 0 r=1 s=0|0|0|1= 1

Signed Addition/Substraction Simplest way- convert to 2’s cmpl. Cancellation of high order bit - shift more bits cancel - How many guard digits? 1.00000 1.11111 0.11111 + 1.00000 0.00000101111 - 1.1111101000 1 cmpl

Float Division Integer division Biased substractio n Very similar to Multiplication Dividing using integer divide Compute 2 more bits (round, guard) Use remainder as sticky bit (Why?) Sign bit: XOR

Floating Point in computers Comply with standards: IEEE 754 ISO/IEC 559.

Similar presentations

Presentation on theme: "Floating Point in computers Comply with standards: IEEE 754 ISO/IEC 559."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Floating Point in computers Comply with standards: IEEE 754 ISO/IEC 559.

Similar presentations

Presentation on theme: "Floating Point in computers Comply with standards: IEEE 754 ISO/IEC 559."— Presentation transcript:

Similar presentations

About project

Feedback