CSE 246: Computer Arithmetic Algorithms and Hardware Design Instructor: Prof. Chung-Kuan Cheng Fall 2006 Lecture 9: Floating Point Numbers
CSE 2462 Motivation Maximal information with given bit numbers. Arithmetic with proper precision. Fairness of rounding. Features at the expenses of the complexity of the operations.
CSE 2463 Topics: Floating Point Numbers (IEEE P754) Standard Operations Exceptional Situations Rounding Modes Numerical Computing with IEEE Floating Point Arithmetic, Michael L. Overton, SIAM
CSE 2464 Standard 2 32 Typically Goal: Dynamic Range: largest #/ smallest # If too large, holes between # ’ s
CSE 2465 Standard ulp (unit in the last place) Difference between two consecutive values of the significand. 3 Parts x = ~s b e :sign, significand, exponent Sign Bit 8-bit exponent 23-bit Significand
CSE 2466 Standard ~ e 1 e 2 e 3 e 4 e 5 e 6 e 7 e 8 s 1 s 2 s 3 … s 22 s s 1 s 2 s 3 … s 22 s 23 normalized number 0. s 1 s 2 s 3 … s 22 s 23 denormalized number e 1 e 2 e 3 e 4 e 5 e 6 e 7 e x= 0.s 1 s 2 s 3 … s 22 s x= 1.s 1 s 2 s 3 … s 22 s x= 1.s 1 s 2 s 3 … s 22 s x= 1.s 1 s 2 s 3 … s 22 s x= 1.s 1 s 2 s 3 … s 22 s x= 1.s 1 s 2 s 3 … s 22 s x= Inf if (s 1 … s 23 )= 0, NaN otherwise. NaN Not a Number
CSE 2467 Standard 0.01x2 -3 = 0.001x2 -2 Same number, so normalize to remove redundancy Use a default 1 in front for one more bit precision. Smallest Number 0.00 … 01x = 1.0x2 -23 x = 1x2 -149
CSE 2468 Standard - Example ~ eeeeeeee sssss sssss sssss sssss sss = … 0x = … 0x = … 1x = … 0x normalized minimum = … 1x = … 0x = … 1x = … 1x2 1
CSE 2469 Standard – Example Cont = … 0x = … 1x = … 1x Normalized Maximum = Inf N min = 1.0 x N max = (2 – )2 127
CSE Double Floating Point ~ e 1 e 2 … e 11 s 1 s 2 … s … 000 s 1 s 2 … s 52 x=0.s 1 s 2 … s … 001 s 1 s 2 … s 52 x=1.s 1 s 2 … s … 111 s 1 s 2 … s 52 x=1.s 1 s 2 … s … 000 s 1 s 2 … s 52 x=1.s 1 s 2 … s … 110 s 1 s 2 … s 52 x=1.s 1 s 2 … s … 111 s 1 s 2 … s 52 x=Inf if (s 1 … s 52 )=0
CSE Overflow/Underflow N max N min SparserDenser Overflow Underflow
CSE Addition/Multiplication ~s 1 xb e1 + (~s 2 xb e2 ) = ~sxb e = ~s 1 xb e1 + ~s 2 /b e1-e2 x b e1 = (~s 1 + ~s 2 /b e1-e2 ) x b e1 (~s 1 xb e1 ) x (~s 2 xb e2 ) = ~(s 1 xs 2 )b e1+e2
CSE Exceptions a/0 = Inf if a > 0 a/Inf = 0if a != 0 a · 0 = 0 a · Inf = Inf if a > 0 a + Inf = Inf 0 · Inf = invalid operation (NaN) 0/0 = invalid operation (NaN) Inf - Inf = NaN NaP op a = NaN
CSE Rounding Mode Adder Output = C out z 1 z 0.z -1 z -2 … z - l GRS Guard Bit Round Bit Sticky Bit, OR of all bits below bit R x x x x2 4 Normalize – need to round or
CSE Rouding normalize Guard bit
CSE Rounding Round to the nearest even toward Toward + Inf Toward - Inf
CSE Conventional Rounding Error Rounding Error 1.101= 1.101= = = Average Error = 0.5/4 = 0.125