Chapter 6: Computer Arithmetic and the ALU

Slides:

Advertisements

Similar presentations

Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.

Advertisements

CENG536 Computer Engineering department Çankaya University.

Topics covered: Floating point arithmetic CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

2-1 Chapter 2 - Data Representation Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer Architecture.

Binary Arithmetic Binary addition Binary subtraction

Princess Sumaya Univ. Computer Engineering Dept. Chapter 3:

Princess Sumaya Univ. Computer Engineering Dept. Chapter 3: IT Students.

Floating Point Numbers

1 Lecture 9: Floating Point Today’s topics:  Division  IEEE 754 representations  FP arithmetic Reminder: assignment 4 will be posted later today.

Assembly Language and Computer Architecture Using C++ and Java

Number Systems Standard positional representation of numbers:

Chapter 6 Arithmetic. Addition Carry in Carry out

Assembly Language and Computer Architecture Using C++ and Java

CSE 378 Floating-point1 How to represent real numbers In decimal scientific notation –sign –fraction –base (i.e., 10) to some power Most of the time, usual.

1 ECE369 Chapter 3. 2 ECE369 Multiplication More complicated than addition –Accomplished via shifting and addition More time and more area.

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.

COE 308: Computer Architecture (T041) Dr. Marwan Abu-Amara Integer & Floating-Point Arithmetic (Appendix A, Computer Architecture: A Quantitative Approach,

3-1 Chapter 3 - Arithmetic Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer Architecture.

Dr. Bernard Chen Ph.D. University of Central Arkansas

Computer Organization and Architecture Computer Arithmetic Chapter 9.

Computer Arithmetic Nizamettin AYDIN

Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.

Computer Architecture Lecture 3: Logical circuits, computer arithmetics Piotr Bilski.

3-1 Chapter 3 - Arithmetic Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles of Computer Architecture.

Computer Arithmetic.

Computing Systems Basic arithmetic for computers.

Arithmetic Chapter 4.

ECE232: Hardware Organization and Design

Floating Point. Agenda  History  Basic Terms  General representation of floating point  Constructing a simple floating point representation  Floating.

07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.

CH09 Computer Arithmetic  CPU combines of ALU and Control Unit, this chapter discusses ALU The Arithmetic and Logic Unit (ALU) Number Systems Integer.

Oct. 18, 2007SYSC 2001* - Fall SYSC2001-Ch9.ppt1 See Stallings Chapter 9 Computer Arithmetic.

Computer Arithmetic and the Arithmetic Unit Lesson 2 - Ioan Despi.

Fixed and Floating Point Numbers Lesson 3 Ioan Despi.

Lecture 9: Floating Point

CSC 221 Computer Organization and Assembly Language

Princess Sumaya Univ. Computer Engineering Dept. Chapter 3:

Lecture notes Reading: Section 3.4, 3.5, 3.6 Multiplication

Computer Arithmetic See Stallings Chapter 9 Sep 10, 2009

Chapter 3 Arithmetic for Computers. Chapter 3 — Arithmetic for Computers — 2 Arithmetic for Computers Operations on integers Addition and subtraction.

Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.

S 2/e C D A Computer Systems Design and Architecture Second Edition© 2004 Prentice Hall Chapter 6 Overview Number Systems and Radix Conversion Fixed point.

6-1 Chapter 6—Computer Arithmetic and the Arithmetic Unit Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H.

1 CE 454 Computer Architecture Lecture 4 Ahmed Ezzat The Digital Logic, Ch-3.1.

Chapter 9 Computer Arithmetic

William Stallings Computer Organization and Architecture 8th Edition

Floating Point Representations

NxN Crossbar design for Barrel Shifter

William Stallings Computer Organization and Architecture 7th Edition

How to represent real numbers

Computer Arithmetic Multiplication, Floating Point

ECEG-3202 Computer Architecture and Organization

Computer Architecture

Chapter 8 Computer Arithmetic

Presentation transcript:

Chapter 6: Computer Arithmetic and the ALU Topics 6.1 Number Systems and Radix Conversion 6.2 Fixed Point Arithmetic 6.3 Seminumeric Aspects of ALU Design 6.4 Floating-Point Arithmetic

Number Systems Number systems consist of a base or radix and a set of symbols: decimal notation (0d1234) Base: 10 Symbols: 0-9 Values: obvious binary notation (0b1010) Base: 2 Symbols: 0,1 Values: obvious hexidecimal notation (0x10EF) Base: 16 Symbols: 0-9,A,B,C,D,E,F Values: 0-9 = 0-9 A = 10 B = 11 C = 12 D = 13 E = 14 F = 15 octal notation (0c077) Base: 8 Symbols: 0-7 Values: obvious

Number Representation An m digit, base b number is written as a string of m digits: x = xm-1 xm-2 ... x1 x0 where digits xi are in the range of 0  xi  b-1 So: base 2: 0  xi  1 xi = 0,1 base 8: 0  xi  7 xi = 0-7 base 10: 0  xi  9 xi = 0-8 base 16: 0  xi  15 xi = 0-9, A-F The value of the ith digit is: value(xi) = xi * b i The value of x is:

Number Representation Examples 32710 = 3 * 102 + 2 * 101 + 7 * 100 102 101 100 3 2 7 10112 = 1 * 23 + 0 * 22 + 1 * 21 + 1 * 20 = 1110 23 22 21 20 1 0 1 1 ED716 = 14(E) * 162 + 13(D) * 161 + 7 * 160 = 3584 + 208 + 7 = 379910 162 161 160 E D 7 37558 = 3 * 83 + 7 * 82 + 5 * 81 + 5 * 80 = = 1536 + 448 + 40 + 5 = 202910 83 82 81 80 3 7 5 5

Number Representation For a base b number of m digits, the maximum number that can be represented is: Examples 4 digit base 10 xmax = 104 -1 = 999910 4 digit base 2 xmax = 24 -1 = 11112 = 1510 3 digit base 8 xmax = 83 -1 = 7778 = 51110

Fractions value(xi) = xi * b i, but now i goes from -1 to -m Fractions are numbers with a radix point: xn-1xn-2...x1x0 . x-1x-2...x-m A number in a fixed-length computer register with its radix point assumed to be in a fixed position (even all the way to the right) is called a fixed point number Each digit to the right of the radix point has a value of: radix point value(xi) = xi * b i, but now i goes from -1 to -m Examples: 43.510 = 4 * 101 + 5 * 100 + 5 * 10-1 1101.10102 = 1 * 23 + 1 * 22 + 1 * 20 + 1 * 2-1 + 1 * 2-3 = 13.62510

Radix Conversion Converting from base b to calculator’s base c - eg., converting numbers into base 10 1) Start with base b x = xm-1 xm-2 ... x1 x0 2) Initialize base c value y = 0 3) Left to right, get next digit (symbol) xi 4) Convert base b symbol xi to base c number yi (by means of a table) 5) Update the base c vaule by: y = yb + yi 6) If there are more digits, repeat from step 3 Example: convert AC216 to base 10 y = 0 y = 0 + A(=10) = 10 y = 10*16 + C(=12) = 172 y = 172*16 + 2 = 275410 Example: convert 7538 to base 10 y = 0 y = 0 + 7 = 7 y = 17*8 + 5 = 61 y = 61*8 + 3 = 49110

Radix Conversion Converting from calculator’s base c to base b - eg., converting numbers from base 10 to another base 1) Start with the base c integer x to be converted 2) Initialize i = 0 and v = x and produce digits right to left 3) Calculate Di = v mod b and v = v/b. Convert Di to base b to get yi 4) Set i = i + 1; If v  0, repeat from step 3 Example: convert 366110 to base 16 3661  16 = 228 (rem = 13)  y0 = D(=13) 228  16 = 14 (rem = 4)  y1 = 4 14  16 = 0 (rem = 14)  y2 = E Thus 366110 = E4D16

Radix Conversion Examples Example: convert 23510 to base 2 235  2 = 117 (rem = 1)  y0 = 1 117  2 = 58 (rem = 1)  y1 = 1 58  2 = 29 (rem = 0)  y2 = 0 29  2 = 14 (rem = 1)  y3 = 1 14  2 = 7 (rem = 0)  y4 = 0 7  2 = 3 (rem = 1)  y5 = 1 3  2 = 1 (rem = 1)  y6 = 1 1  2 = 0 (rem = 1)  y7 = 1 Thus 23510 = 111010112

More Radix Conversion Examples Example: convert 125710 to base 8 1257  8 = 157 (rem = 1)  y0 = 1 157  8 = 19 (rem = 5)  y1 = 5 19  8 = 2 (rem = 3)  y2 = 3 2  8 = 0 (rem = 2)  y3 = 2 Thus 125710 = 23518

Radix Conversion - a more ad-hoc approach Example: convert 113510 to base 2 210 29 28 27 26 25 24 23 22 21 20 1 0 0 0 1 1 0 1 1 1 1 111 47 15 7 3 1 0 Thus 113510 = 100011011112 64 32 16 8 4 2 1 1024 512 256 128       

Radix Conversion - Digit Grouping Hex Binary 0 0000 1 0001 2 0010 3 0011 4 0100 5 0101 6 0110 7 0111 8 1000 9 1001 A 1010 B 1011 C 1100 D 1101 E 1110 F 1111 Example: convert 100011011112 to base 16 0100 0110 1111 4 6 F Thus 100011011112 = 46F16 Example: convert 100011011112 to base 0 010 001 101 111 2 1 5 7 Thus 100011011112 = 21578 Octal Binary 0 000 1 001 2 010 3 011 4 100 5 101 6 110 7 111

Representing Negative Integer Numbers There are four methods that can be used to represent negative numbers: Sign-magnitude decimal: +435, -3102, etc. binary: use a sign bit, e.g. +310 = 00112, -310 = 10112 radix compliment (eg. 2’s compliment) diminished radix compliment (eg. 1’s compliment) bias or excess - used in floating point numbers

Complement Operations for m-Digit Base b Numbers Radix complement of m-digit base b number x is: xc = (bm - x) mod bm Diminished radix complement of x is: xc = bm - 1 - x The complement operation is used to define the relationship between the unsigned number representing a negative number and its absolute value Complement number systems use unsigned base b numbers to represent both positive and negative numbers The complement of a number in the range 0xbm-1 is in the same range ˆ

Tbl 6.1 Complement Representations of Negative Numbers Radix Complement Diminished Radix Complement Number Representation Number Representation 0 or bm-1 0<x<bm/2 x 0<x<bm/2 x |x|c = bm - 1 - |x| -bm/2x<0 |x|c = bm - |x| -bm/2<x<0 For even b, radix complement system represents one more negative than positive value While diminished radix complement system has 2 zeros but represents same number of positive & negative values

Tbl 6.2 Base 2 Complement Representations 8 Bit 2’s Complement 8 Bit 1’s Complement Number Representation Number Representation 0 or 255 0<x<128 x 0<x<128 x 255 - |x| -128x<0 256 - |x| -127x<0 Numbers go from -128 to 127 Numbers go from -127 to 127 In 1’s complement, 255 = 111111112 is often called -0 In 2’s complement, -128 = 100000002 is a legal value, but trying to negate it gives overflow

Example: base 2 - 4 bits ˆ Ex. 7 Ex. 7 2’s complement Number Representation Negative 0 0000 0000 1 0001 1111 2 0010 1110 3 0011 1101 4 0100 1100 5 0101 1011 6 0110 1001 8 ----- 1000 calculation of negative value: xc = (bm - x) mod bm Ex. 7 -7 = 10000 - 0111 10000 0111 01001 simply invert and add 1: 0111c = 0111 + 1 = 1000 + 1 = 1001 calculation of negative value: xc = (bm - 1 - x) Ex. 7 -7 = 10000 - 1 - 0111 10000 0111 01001 1 1000 simply invert: 0111c = 1000 ˆ 1’s complement Number Representation Negative 0 0000 1111 1 0001 1110 2 0010 1101 3 0011 1100 4 0100 1011 5 0101 1010 6 0110 1001 8 0111 1000

Shifting Shifting a number left one digit is equivalent to multiplication by base b (insert a 0 on the right) Examples: 32710 left shift Þ 327010 01012 = 510 left shift Þ 10102 = 1010 11012 = -310 left shift Þ 10102 = -62 (2’s complement) In a left shift, overflow occurs if the sign changes Example: 10102 = -610 left shift Þ 01002 Overflow!

Shifting (cont.) Shifting a number right one digit is equivalent to division by base b (copy sign bit from MSB to MSB-1 bit for signed numbers - called arithmetic right shifting) No overflow can occur, but any fractional part of the division is lost Examples: 01112 = 710 right shift Þ 00112 = 310 (-½ - fraction lost) 10112 = -510 right shift Þ 10102 = -62 (+½ - fraction lost)

Fixed Point Arithmetic - Addition Complement number systems allow the addition of signed numbers with an adder for unsigned numbers Overflow only occurs when numbers are of the same sign and overflow can be detected when the result is of opposite sign from the operands Unsigned addition of m digit base b numbers x and y: 1) Initialize digit counter j = 0 and carry in co = 0 2) Produce digit j of sum = (xj + yj + cj) mod b and carry cj+1 = (xj + yj + cj)/b 3) Increment j = j + 1 and repeat from 2 if j < m

Fixed Point Arithmetic - Addition Example: 1 1 1 1011012 0110012 10001102 Hardware implementation - ripple carry adder: c m – 1 s x y 2 FA FA FA Sj = xj Ä yj Ä cj cj+1 = xj•yj + xj•cj + yj•cj Carry and sum of jth+1 stage depends on the carry from the jth stage - thus the correct values ripple from right to left

Fig 6.2 Base b Radix Complement Subtracter To do subtraction in the radix complement system, it is only necessary to negate (radix complement) the 2nd operand It is easy to take the diminished radix complement, and the adder has a carry-in for the +1 + 1 B a s e b d r ( – ) ' c o m p l n t x y

Fig 6.3 2’s Complement Adder/Subtracter A multiplexer to select y or its complement becomes an exclusive OR gate c m – 1 q F A s x y 3 2 S u b t r a o n l . . . qj = yj•r + yj•r = yj Ä r

Problem with Ripple Carry Addition/Subtraction tc - full adder carry propagation time ts - full adder sum propagation time c m – 1 s x y 2 mtc 2tc tc FA FA FA (m-1)tc + ts tc + ts ts Total propagation time of adder is proportional to the number of bits A 64 bit adder with tc=200ps would take 12.8ns to compute the final carry out - which corresponds to less than an 80MHz clock speed - SLOW!

Solution - Carry Look Ahead Generate a carry across groups of bits using only the bits to be summed and the carry into that group Consider the jth bit (stage) of an addition operation - there will be a carry out of this stage if: the bits in this stage generate a carry, or there is a carry into this stage and the bits in this stage propagate that carry to the next stage What conditions generate and propagate carry bits?

Binary Propagate and Generate Signals Generate for digit j is Gj = xjyj Propagate for digit j is Pj = xj+yj Of course xj+yj covers xjyj but it still corresponds to a carry out for a carry in Carries can then be written: cj+1 = Gj + Pjcj c1 = G0 + P0c0 c2 = G1 + P1G0 + P1P0c0 c3 = G2 + P2G1 + P2P1G0 + P2P1P0c0 c4 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0c0 Eg., in words, the c2 logic is: c2 is one if digit 1 generates a carry, or if digit 0 generates one and digit 1 propagates it, or if digits 0 and 1 both propagate a carry-in

Speed Gains with Carry Lookahead It takes one gate to produce a G or P, two levels of gates for any carry, and 2 more for full adders The number of OR gate inputs (terms) and AND gate inputs (literals in a term) grows as the number of carries generated by lookahead The real power of this technique comes from applying it recursively For a group of 4 digits an overall generate is: G10 = G3 + P3G2 + P3P2G1 + P3P2P1G0 An overall propagate is: P10 = P3P2P1P0

Recursive Carry Lookahead Scheme It is not practical to apply this directly to wide (eg. 64) bit widths directly - however, the idea extends to a multilevel scheme: Rewrite equation for c4: c4 = G01 + P01 c0 where: G01 = G3 + P3G2 + P3P2G1 + P3P2P1G0 and: P01 = P3P2P1P0 For the next level up (level 2 - assuming 4 “bits” per group) for group 0: G02 = G31 + P31G21 + P31P21G11 + P31P21P11G01 P02 = P31P21P11P01 Keep building up in groups of k bits (tree structure) Now delay is proportional to logkm because there are logkm levels in the carry-look ahead tree

Fig 6.4 Carry Lookahead Adder for Group Size k = 2 7 y x G P 6 1 3 c 5 4 2 L o k a h e d v l C m p u t g n r

64 Bit Adder using Carry Look Ahead with Group Size, k = 4 Delays: Level 0 Propagate & Generate 1 delay Level 1 Propagate & Generate 2 delays Level 2 Propagate & Generate 2 delays Level 3 intermediate carry 2 delays Level 2 intermediate carry 2 delays Level 1 intermediate carry 2 delays MSB sum & co from carry in 1 delay Total gate delays: 12

Binary Multiplication (unsigned) Multiplication can be done using successive addition (as you have seen in the lab assignment) A faster method is to multiply Y by X, k bits at a time and add the resulting terms (k usually equals 1) - as is done by hand Example: 1010 Multiplicand Y 1101 Multiplier X=X3X2X1X0 1010 X0Y20 0000 X1Y21 1010 X2Y22 1010 X3Y23 10000010 There are many ways to implement this in hardware, but perhaps the simplest is the shift-add mechanism

Unsigned Binary Multiplication Datapath Multiplicand Mn-1 M0 Control Unit n-Bit Adder Multiplier C An-1 A0 Qn-1 Q0 Product 2n+1 bit shift register

Unsigned Binary Multiplication Algorithm START C, A  0 M  Multiplicand Q Multiplier Count  0 No Yes Q0 = 1? C, A A + M Shift C,A,Q Count Count + 1 No Yes Count = n? END

Unsigned Binary Multiplication Example (Y) (X) M C A Q 1010 0 0000 1101 initialize registers; count = 0; test:Q0=1 - 0 1010 1101 A ¬ A+M - 0 0101 0110 shift C,A,Q; count=1; test:Q0=0 - 0 0010 1011 shift C,A,Q; count=2; test: Q0=1 - 0 1100 1011 A ¬ A+M - 0 0110 0101 shift C,A,Q; count=3; test: Q0=1 - 1 0000 0101 A ¬ A+M - 1 1000 0010 shift C,A,Q; count=4; end Product

Signed Binary Multiplication Many algorithms for signed multiplication exist, but the simplest one is similar to the one for unsigned multiplication except: Shifts are arithmetic (sign bit is copied) Adder is 2’s complement - carry out is ignored if the sign bit of the multiplier is ‘1’ (it is negative) then the last arithmetic operation before the final shift is a subtraction vs. an addition A data path very similar to the one for unsigned multiplication can be used - the C register is not needed, the shift register must be arithmetic, and the adder must be modifed to do 2’s complement addition and subtraction The algorithm is also very similar with the exception of the final arithmetic operation

Signed Binary Multiplication Example (Y=3) (X= -5) M A Q 00000011 00000000 11111011 initialize registers; count = 0; test:Q0=1 - 00000011 11111011 A ¬ A+M - 00000001 11111101 arithmetic shift A,Q; count=1; test:Q0=1 - 00000100 11111101 A ¬ A+M - 00000010 01111110 arithmetic shift A,Q; count=2; test:Q0=0 - 00000001 00111111 arithmetic shift A,Q; count=3; test:Q0=1 - 00000100 00111111 A ¬ A+M - 00000010 00011111 arithmetic shift A,Q; count=4; test:Q0=1 - 00000101 00011111 A ¬ A+M - 00000010 10001111 arithmetic shift A,Q; count=5; test:Q0=1 - 00000101 10001111 A ¬ A+M - 00000010 11000111 arithmetic shift A,Q; count=6; test:Q0=1 - 00000101 11000111 A ¬ A+M - 00000010 11100011 arithmetic shift A,Q; count=7; test:Q0=1 - 11111111 11100011 A ¬ A-M (subtract because X was neg.) - 11111111 11110001 arithmetic shift A,Q; count=8; end Product = -15

Booth’s Algorithm for Signed Multiplication A more efficient algorithm was developed by Booth in 1951 It uses a single bit register to the right of the least significant bit of Q called Q-1 The action to be taken on the next step depends on the value of Q0 and Q-1 if Q0Q-1 = 01 then A ¬ A+ M ; arithmetic shift A,Q,Q-1 if Q0Q-1 = 10 then A ¬ A- M; arithmetic shift A,Q,Q-1 if Q0Q-1 = 00 or 11 then arithmetic shift A,Q,Q-1 This algorithm is more efficient that the previous one in that it “skips over” groups of all ‘1’s or all ‘0’s

Booth’s Algorithm Datapath Multiplicand Mn-1 M0 Control Unit n-Bit 2’s Complement Adder/Subtractor Multiplier An-1 A0 Qn-1 Q0 Q-1 Product 2n+1 bit arithmetic shift register

Booth’s Algorithm

Booth’s Algorithm Example (Y=3) (X= -5) M A Q Q-1 00000011 00000000 11111011 0 initialize registers; count = 0; test:Q0Q-1=10 - 11111101 11111011 0 A ¬ A - M - 11111110 11111101 1 arithmetic shift A,Q,Q-1; count=1; test:Q0Q-1=11 - 11111111 01111110 1 arithmetic shift A,Q,Q-1; count=2; test:Q0Q-1=01 - 00000010 01111110 1 A ¬ A+M - 00000001 00111111 0 arithmetic shift A,Q,Q-1; count=3; test:Q0Q-1=10 - 11111110 00111111 0 A ¬ A - M - 11111111 00011111 1 arithmetic shift A,Q,Q-1; count=4; test:Q0Q-1=11 - 11111111 10001111 1 arithmetic shift A,Q,Q-1; count=5; test:Q0Q-1=11 - 11111111 11000111 1 arithmetic shift A,Q,Q-1; count=6; test:Q0Q-1=11 - 11111111 11100011 1 arithmetic shift A,Q,Q-1; count=7; test:Q0Q-1=11 - 11111111 11110001 1 arithmetic shift A,Q,Q-1; count=8; end Product = -15

Floating Point Numbers Fixed point numbers with a limited number of bits suffer from two problems: limited range limited precision The solution is to use an equivalent representation to the decimal scientific notation of the form: -2.72 X 10-2 This format has four parts: sign (+,-) - significand (also sometimes called the mantissa) 2.72 exponent -2 base of the exponent 10

Fig 6.14 Floating-Point Number Format s S i g n e f m 1 + = , V a l u ( ) –   2 b t E x p o F r c s is sign, e is exponent, and f is significand base of the exponent is implied by the specific format radix point in the fraction is also implied (i.e., the fraction is really fixed point)

Signs in Floating-Point Numbers Both significand and exponent have signs A complement representation could be used for f, but sign magnitude is most common now The sign is placed at the left instead of with f so test for negative always looks at left bit The exponent could be 2’s complement, but it is better to use a biased exponent If -emin  e  emax, where emin, emax > 0, then e = emin + e is always positive, so e replaced by e emin is called the bias ^ ^

Normalized Floating-Point Numbers There are multiple representations for a floating-point number If f1 and f2 = 2df1 are both fractions and e2 = e1 - d, then (s, f1, e1) and (s, f2, e2) have same value Scientific notation example: 0.819 103 = 0.0819 104 A normalized floating-point number has a leftmost digit nonzero (exponent small as possible) Zero cannot fit this rule; usually written as all 0s In normal base 2, when the number is normalized, the bit to the immediate right if the radix point is always 1, so it can be left out So-called hidden or implied bit

Comparison of Normalized Floating Point Numbers If normalized numbers are viewed as integers, a biased exponent field to the left means an exponent unit is more than a significand unit The largest magnitude number with a given exponent is followed by the smallest one with the next higher exponent Thus normalized FP numbers can be compared for <, , >, , =,  as if they were integers This is the reason for the s,e,f ordering of the fields and the use of a biased exponent, and one reason for normalized numbers

Example 32 bit Floating Point Format S i g n E x p o n e n t F r a c t i o n s e f 31 30 23 22 MSB is sign bit 8 bit exponent to the left of the sign bit exponent is biased by 127 exponent base is 2 23 bit significand normalized - radix point is to the left of the significand implied ‘1’ to the left of the radix point yields 24 bit effective significand The largest positive number that can be represented is: 1.111111111111111111111112 X 2128 = 4.789048279756 X 1052

Numbers in the Example Format Sign: 02 Exponent: 000000002 Significand: 11.000000000000000000000002 Normalization requires moving the radix point one place to the left New exponent: 000000012 New significand: 1.100000000000000000000002 Biasing the exponent requires adding 12710 (11111112) New exponent: 100000002 Now construct the final result (sometimes called packing) Result: 0 10000000 10000000000000000000000 2 Sign bit Significand: 1.12 Exponent (leftmost ‘1’ and radix point are implied)

Numbers in the Example Format (cont.) Example: 4.6283 X 1015 4.6283 X 1015 = 1.027689044975 X 252 Exponent: 01101002 Significand: 1.000001110001011010100002 Biasing the exponent requires adding 12710 (11111112) New exponent: 101100112 Significand is already normalized Now pack the final result Result: 0 10110011 00000111000101101010000 2 Sign bit Significand Exponent (leftmost ‘1’ and radix point are implied)

Comparison of Numbers in the Example Format Example: 4.6283 X 1015 vs. 3.0 01011001100000111000101101010000 2 4.6283 X 1015 01000000010000000000000000000000 2 3.0 Example: 256.25 vs. 375.75 01000011100000000010000000000000 2 256.25 01000011101100101110000000000000 2 375.75 Example: 14.0 vs. 10.0 01000001011000000000000000000000 2 14.0 01000001001000000000000000000000 2 10.0

Floating Add, FA, and Floating Subtract, FS, Procedure Add or subtract (s1, e1, f1) and (s2, e2, f2) 1) Unpack (s, e, f); handle special operands 2) Shift fraction of number with smaller exponent right by |e1 - e2| bits 3) Set result exponent er = max(e1, e2) 4) For FA and s1 = s2 or FS and s1 s2, add significands, otherwise subtract them 5) Count lead zeros, z; carry can make z = -1; shift left z bits or right 1 bit if z = -1 6) Round result, shift right, and adjust z if rounding overflow occurs 7) er  er - z; check over- or underflow; bias and pack

Decimal Floating-Point Add and Subtract Examples Operands Alignment Normalize & round 6.144 102 0.06144 104 1.003644 105 +9.975 104 +9.975 104 + .0005 105 10.03644 104 1.004 105 Operands Alignment Normalize & round 1.076 10-7 1.076 10-7 7.7300 10-9 -9.987 10-8 -0.9987 10-7 + .0005 10-9 0.0773 10-7 7.730 10-9

Floating Point Add Example 91673.4375 - 01000111101100110000110010111000 357.75 - 01000011101100101110000000000000 Unpack: 10001111 1.01100110000110010111000 91673.4375 10000111 1.01100101110000000000000 357.75 Shift until exponents are equal: 10001111 1.01100110000110010111000 10001111 0.0000000101100101110000000000000 Add Significands: 10001111 1.0110011101111111001100000000000 Round: 10001111 1.01100111011111110011000 Pack: 010001111 01100111011111110011000 92031.1875

Another Floating Point Add Example 256.25 - 01000011100000000010000000000000 357.75 - 01000011101100101110000000000000 Unpack: 10000111 1.00000000010000000000000 256.25 10000111 1.01100101110000000000000 357.75 Add Significands: 10000111 10.01100110000000000000000 Renormalize: 10001000 1.001100110000000000000000 Round: 10001000 1.00110011000000000000000 Pack: 010001000 00110011000000000000000 614.0

Fig 6.17 Hardware Structure for Floating-Point Add and Subtract m e z + r o u n d i g b t s 2 F A / S | E x p a c w l h L N Adders for exponents and significands Shifters for alignment and normalize Multiplexers for exponent and swap of significands Lead zeros counter

Floating-Point Multiply of Normalized Numbers Multiply (sr, er, fr) = (s1, e1, f1)(s2, e2, f2) 1) Unpack (s, e, f); handle special operands 2) Compute sr = s1s2; er = e1+e2; fr = f1f2 3) If necessary, normalize by 1 left shift and subtract 1 from er; round and shift right if rounding overflow occurs 4) Handle overflow for exponent too positive and underflow for exponent too negative 5) Pack result, encoding or reporting exceptions

Decimal Floating-Point Examples for Multiply and Divide Multiply fractions and add exponents Sign, fraction & exponent Normalize & round ( -0.1403 10-3) -0.4238463 102 (+0.3021 106 ) -0.00005 102 -0.04238463 10-3+6 -0.4238 102 Divide fractions and subtract exponents Sign, fraction & exponent Normalize & round ( -0.9325 102) +0.9306387 109 ( -0.1002 10-6 ) +0.00005 109 +9.306387 102-(-6) +0.9306 109

Floating Point Multiply Example 201.5 - 01000011010010011000000000000000 14.0 - 01000001011000000000000000000000 Unpack: 10000110 1.10010011000000000000000 201.5 10000010 1.11000000000000000000000 14.0 Multiply significands - add exponents (subtract bias): 10000110 + 10000010 10.11000001010000000000000 -127 1111111 10001001 Normalize: 10001010 1.011000001010000000000000 Round: 10001010 1.01100000101000000000000 Pack: 01000101001100000101000000000000 2821.0

Floating-Point Divide of Normalized Numbers Divide (sr, er, fr) = (s1, e1, f1)(s2, e2, f2) 1) Unpack (s, e, f); handle special operands 2) Compute sr = s1s2; er = e1- e2; fr = f1f2 3) If necessary, normalize by 1 right shift and add 1 to er; round and shift right if rounding overflow occurs 4) Handle overflow for exponent too positive and underflow for exponent too negative 5) Pack result, encoding or reporting exceptions

Fig 6.15 IEEE Single-Precision Floating Point Format ^ e e Value Type 255 none none Infinity or NaN 254 127 (-1)s(1.f1f2...)2127 Normalized ... ... ... ... 2 -125 (-1)s(1.f1f2...)2-125 Normalized 1 -126 (-1)s(1.f1f2...)2-126 Normalized 0 -126 (-1)s(0.f1f2...)2-126 Denormalized Exponent bias is 127 for normalized #s

Special Numbers in IEEE Floating Point An all-zero number is a normalized 0 Other numbers with biased exponent e = 0 are called denormalized Denorm numbers have a hidden bit of 0 and an exponent of -126; they may have leading 0s Numbers with biased exponent of 255 are used for ± and other special values, called NaN (not a number) For example, one NaN represents 0/0

Fig 6.16 IEEE Standard, Double-Precision, Binary Floating Point Format Exponent bias for normalized numbers is 1023 The denorm biased exponent of 0 corresponds to an unbiased exponent of -1022 Infinity and NaNs have a biased exponent of 2047 Range increases from about 10-38|x|1038 to about 10-308|x|10308