Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMPUTER DESIGN CENG-212 Dr. Ceyhun ÇELİK

Similar presentations


Presentation on theme: "COMPUTER DESIGN CENG-212 Dr. Ceyhun ÇELİK"— Presentation transcript:

1 COMPUTER DESIGN CENG-212 Dr. Ceyhun ÇELİK celik.ceyhun@gmail.com
Computer Engineering Gazi University asdasdasd

2 Arithmetic for Computers
Chapter 3 Arithmetic for Computers

3 1. Introduction Chapter 2 shows that integers can be represented either in decimal or binary form, but what about the other numbers that commonly occur? What about fractions and other real numbers? What happens if an operation creates a number bigger than can be represented? And underlying these questions is a mystery: How does hardware really multiply or divide numbers? asdasdasd

4 2. Addition and Subtraction
Let’s try adding 6 to 7 in binary and then subtracting 6 from 7 in binary. asdasdasd

5 2. Addition and Subtraction
Let’s try adding 6 to 7 in binary and then subtracting 6 from 7 in binary. asdasdasd

6 2. Addition and Subtraction
Recall that overflow occurs when the result from an operation cannot be represented with the available hardware, in this case a 32-bit word. When can overflow occur in addition? No overflow can occur when adding positive and negative operands. = -6 There are similar restrictions to the occurrence of overflow during subtract, but it’s just the opposite principle: c - a = c + (-a) asdasdasd

7 2. Addition and Subtraction
Adding or subtracting two 32-bit numbers can yield a result that needs 33 bits to be fully expressed. The lack of a 33rd bit means that when overflow occurs, the sign bit is set with the value of the result instead of the proper sign of the result. asdasdasd

8 2. Addition and Subtraction
What about overflow with unsigned integers? Unsigned integers are commonly used for memory addresses where overflows are ignored. We need a way to ignore overflow in some cases and to recognize it in others. The MIPS solution is to have two kinds of arithmetic instructions to recognize the two choices: Add (add), add immediate (addi), and subtract (sub) cause exceptions on overflow. Add unsigned (addu), add immediate unsigned (addiu), and subtract unsigned (subu) do not cause exceptions on overflow. asdasdasd

9 2. Addition and Subtraction
The programmer or the programming environment must then decide what to do when overflow occurs. MIPS detects overflow with an exception, also called an interrupt on many computers. Exception (interrupt): An unscheduled event that disrupts program execution; used to detect overflow. MIPS includes a register called the exception program counter (EPC) to contain the address of the instruction that caused the exception. asdasdasd

10 2. Addition and Subtraction
The instruction move from system control (mfc0) is used to copy EPC into a general-purpose register. so that MIPS software has the option of returning to the off ending instruction via a jump register instruction. asdasdasd

11 2. Addition and Subtraction
Arithmetic Logic Unit (ALU): Hardware that performs addition, subtraction, and usually logical operations such as AND and OR. We construct an ALU from four hardware building blocks (AND and OR gates, inverters, and multiplexors). The main pupose is building a 32-bit ALU. Let’s assume that we will connect 32 1-bit ALUs to create the desired ALU. asdasdasd

12 2. Addition and Subtraction
A 1-Bit ALU: Hardware that performs addition, subtraction, and usually logical operations such as AND and OR. We construct an ALU from four hardware building blocks (AND and OR gates, inverters, and multiplexors). The main pupose is building a 32-bit ALU. Let’s assume that we will connect 32 1-bit ALUs to create the desired ALU. asdasdasd

13 2. Addition and Subtraction
A 1-Bit ALU The 1-bit logical unit for AND and OR 1 AND gate, 1 OR gate, 1 Multiplexer, asdasdasd

14 2. Addition and Subtraction
A 1-Bit ALU Addition 1 Full adder, asdasdasd

15 2. Addition and Subtraction
A 1-Bit ALU asdasdasd

16 2. Addition and Subtraction
32-Bit ALU The full 32-bit ALU is created by connecting adjacent “black boxes.” The adder created by directly linking the carries of 1-bit adders is called a ripple carry adder. asdasdasd

17 2. Addition and Subtraction
32-Bit ALU We use an inverter for subtraction. Recall that shortcut for negating a two’s complement number is to invert each bit and then add 1. How can we add 1 to inverted number? asdasdasd

18 2. Addition and Subtraction
32-Bit ALU A MIPS ALU also needs a NOR function. DeMorgan’s theorem. asdasdasd

19 2. Addition and Subtraction
32-Bit ALU One instruction that still needs support is the set on less than instruction (slt). 1 if rs < rt, and 0 otherwise We call that new input Less and use it only for slt. We must connect 0 to the Less input for the upper 31 bits of the ALU, since those bits are always set to 0. asdasdasd

20 2. Addition and Subtraction
32-Bit ALU slt instruction What happens if we subtract b from a? If the difference is negative, then a<b since (a-b)<0((a-b)+b)<(0+b)  a<b This desired result corresponds exactly to the sign bit values: 1 means negative and 0 means positive. asdasdasd

21 2. Addition and Subtraction
32-Bit ALU slt instruction ALU output for the slt operation is obviously the input value Less. Thus, we need a new 1-bit ALU for the most significant bit that has an extra output bit: the adder output (Set). asdasdasd

22 2. Addition and Subtraction
32-Bit ALU As long as we need a special ALU for the most significant bit, we added the overflow detection logic since it is also associated with that bit. Notice that every time we want the ALU to subtract, we set both CarryIn and Binvert to 1. For adds or logical operations, we want both control lines to be 0. Wecan therefore simplify control of the ALU by combining the CarryIn and Binvert to a single control line called Bnegate. asdasdasd

23 2. Addition and Subtraction
32-Bit ALU Conditional brunch The easiest way to test equality with the ALU is to subtract b from a and then test to see if the result is 0, since (a-b=0) ⇒ a=b if we add hardware to test if the result is 0, we can test for equality. asdasdasd

24 2. Addition and Subtraction
32-Bit ALU asdasdasd

25 3. Multiplication Let’s review the multiplication of decimal numbers in longhand to remind ourselves of the steps of multiplication and the names of the operands. Shifting the immediate product one digit to the left. asdasdasd

26 3. Multiplication if we ignore the sign bits, the length of the multiplication of an n-bit multiplicand and an m-bit multiplier is a product that is n + m bits long. We must cope with overflow because we frequently want a 32-bit product as the result of multiplying two 32-bit numbers. We have two rules, 1. Just place a copy of the multiplicand (1 x multiplicand) in the proper place if the multiplier digit is a 1, or 2. Place 0 (0 x multiplicand) in the proper place if the digit is 0. asdasdasd

27 3. Multiplication Let’s assume that we are multiplying only positive numbers. asdasdasd

28 3. Multiplication asdasdasd

29 3. Multiplication asdasdasd

30 3. Multiplication Replacing arithmetic by shifts can also occur when multiplying by constants as mentioned in Chapter 2. Let’s multiply 2x3 (0010 x 0011). asdasdasd

31 3. Multiplication The easiest way to understand how to deal with signed numbers is to first convert the multiplier and multiplicand to positive numbers and then remember the original signs. The algorithms should then be run for 31 iterations, leaving the signs out of the calculation. we remember that we are dealing with numbers that have infinite digits, and we are only representing them with 32 bits. When the algorithm completes, the lower word would have the 32-bit product. asdasdasd

32 3. Multiplication Fast multiplication hardware. Instead of waiting for 32 add times, we wait just the log2 (32) or five 32-bit add times. asdasdasd

33 3. Multiplication MIPS provides a separate pair of 32-bit registers to contain the 64-bit product, called Hi and Lo. To produce a properly signed or unsigned product, MIPS has two instructions: multiply (mult) and multiply unsigned (multu). To fetch the integer 32-bit product, the programmer uses move from lo (mflo). The MIPS assembler generates a pseudoinstruction for multiply that specifies three general-purpose registers, generating mflo and mfhi instructions to place the product into registers. asdasdasd

34 3. Multiplication Multiplication hardware simply shift s and add, as derived from the paper-and pencil method learned in grammar school. Compilers even use shift instructions for multiplications by powers of 2. With much more hardware we can do the adds in parallel, and do them much faster. asdasdasd

35 4. Division Let’s start with an example of long division using decimal numbers to recall the names of the operands and the grammar school division algorithm. Divide by 1000 asdasdasd

36 4. Division The Remainder register is initialized with the dividend.
asdasdasd

37 4. Division asdasdasd

38 4. Division Let’s try dividing 7 by 2 ( by 0010) asdasdasd

39 4. Division asdasdasd

40 4. Division So far, we have ignored signed numbers in division.
The simplest solution is to remember the signs of the divisor and dividend and then negate the quotient if the signs disagree. Moore’s Law applies to division hardware as well as multiplication, so we would like to be able to speed up division by throwing hardware at it. We used many adders to speed up multiply, but we cannot do the same trick for divide. So what can we do? asdasdasd

41 4. Division The SRT division technique tries to predict several quotient bits per step, using a table lookup based on the upper bits of the dividend and remainder. MIPS uses the 32-bit Hi and 32-bit Lo registers for both multiply and divide. Since the designs are similar for both multiply and divide. (except shift left or right) As we might expect from the algorithm above, Hi contains the remainder, and Lo contains the quotient aft er the divide instruction completes. asdasdasd

42 4. Division To handle both signed integers and unsigned integers, MIPS has two instructions: divide (div) and divide unsigned (divu). The MIPS assembler allows divide instructions to specify three registers, generating the mflo or mfhi instructions to place the desired result into a general-purpose register. We accelerate division by predicting multliple quotient bits and then correcting mispredictions later. asdasdasd

43 4. Division asdasdasd

44 5. Floating Point scientific notation : A notation that renders numbers with a single digit to the left of the decimal point. Normalized : A number in floating-point notation that has no leading 0s. floating point : Computer arithmetic that represents numbers in which the binary point is not fixed. asdasdasd

45 5. Floating Point Fraction: The value, generally between 0 and 1, placed in the fraction field. The fraction is also called the mantissa. Exponent: In the numerical representation system of floating-point arithmetic, the value that is placed in the exponent field. asdasdasd

46 5. Floating Point General form of floating-point numbers
F involves the value in the fraction field and E involves the value in the exponent field; overflow (floating-point): A situation in which a positive exponent becomes too large to fit in the exponent field. underflow (floating-point): A situation in which a negative exponent becomes too large to fit in the exponent field. asdasdasd

47 5. Floating Point IEEE 754 floating-point standard asdasdasd

48 5. Floating Point Example: what is the IEEE 754 binary representation of the number 0.75. Solution: asdasdasd

49 5. Floating Point Solution: Single precision Double precision
asdasdasd

50 5. Floating Point Example: What decimal number is represented by this single precision float? Solution: The sign bit is 1, the exponent field contains 129, and the fraction field contains 1/4 or Using the basic equation, asdasdasd

51 6. Parallelism and Computer Arithmetic: Subword Parallelism
Since every desktop microprocessor by definition has its own graphical displays, as transistor budgets increased it was inevitable that support would be added for graphics operations. 8 bits for colors + 8 bits for location of a pixel. The addition of speakers and microphones for teleconferencing and video games suggested support of sound as well. Audio samples need more than 8 bits of precision, but 16 bits are sufficient. asdasdasd

52 6. Parallelism and Computer Arithmetic: Subword Parallelism
Every microprocessor has special support so that bytes and halfwords take up less space when stored in memory. By partitioning the carry chains within a 128-bit adder, a processor could use parallelism to perform simultaneous operations on short vectors of sixteen 8-bit etc. Given that the parallelism occurs within a wide word, the extensions are classified as subword parallelism. It is also classified under the more general name of data level parallelism. asdasdasd

53 7. Real Stuff: Streaming SIMD Extensions and Advanced Vector Extensions in x86
Floating-point instructions of the x86. asdasdasd

54 8. Going Faster: Subword Parallelism and Matrix Multiply
1. void dgemm (int n, double* A, double* B, double* C) 2. { 3. for (int i = 0; i < n; ++i) 4. for (int j = 0; j < n; ++j) 5. { 6. double cij = C[i+j*n]; /* cij = C[i][j] */ 7. for( int k = 0; k < n; k++ ) 8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */ 9. C[i+j*n] = cij; /* C[i][j] = cij */ 10. } } Unoptimized C version of a double precision matrix multiply (dgemm) asdasdasd

55 8. Going Faster: Subword Parallelism and Matrix Multiply
1. #include <x86intrin.h> 2. void dgemm (int n, double* A, double* B, double* C) 3. { 4. for ( int i = 0; i < n; i+=4 ) 5. for ( int j = 0; j < n; j++ ) { 6. __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */ 7. for( int k = 0; k < n; k++ ) 8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */ 9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n), 10. _mm256_broadcast_sd(B+k+j*n))); 11. mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */ 12. } 13. } asdasdasd

56 8. Going Faster: Subword Parallelism and Matrix Multiply
1. vmovsd (%r10),%xmm0 2. mov %rsi,%rcx 3. xor %eax,%eax 4. vmovsd (%rcx),%xmm1 5. add %r9,%rcx 6.vmulsd (%r8,%rax,8),%xmm1,%xmm1 7. add $0x1,%rax 8. cmp %eax,%edi 9. vaddsd %xmm1,%xmm0,%xmm0 10. jg 30 <dgemm+0x30> 11. add $0x1,%r11d 12. vmovsd %xmm0,(%r10) 1. vmovapd (%r11),%ymm0 #(x4) 2. mov %rbx,%rcx 4. vbroadcastsd (%rax,%r8,1),%ymm1 #(x4) 5. add $0x8,%rax 6. vmulpd (%rcx),%ymm1,%ymm1 #(x4) 7. add %r9,%rcx 8. cmp %r10,%rax 9. vaddpd %ymm1,%ymm0,%ymm0 #(x4) 10. jne 50 <dgemm+0x50> 11. add $0x1,%esi 12. vmovapd %ymm0,(%r11) asdasdasd

57 8. Going Faster: Subword Parallelism and Matrix Multiply
The unoptimized DGEMM runs at 1.7 GigaFLOPS (FLoating point Operations Per Second) on one core of a 2.6 GHz Intel Core i7 (Sandy Bridge). The optimized code performs at 6.4 GigaFLOPS. The AVX version is 3.85 times as fast. You might hope for from performing 4 times as many operations at a time by using subword parallelism. asdasdasd

58 9. Fallacies and Pitfalls
Fallacy: Just as a left shift instruction can replace an integer multiply by a power of 2, a right shift is the same as an integer division by a power of 2. Pitfall: Floating-point addition is not associative. Fallacy: Parallel execution strategies that work for integer data types also work for floating-point data types. asdasdasd

59 9. Fallacies and Pitfalls
Pitfall: The MIPS instruction add immediate unsigned (addiu) sign-extends its 16-bit immediate field. Fallacy: Only theoretical mathematicians care about floating-point accuracy. asdasdasd

60 10. Concluding Remarks Two’s complement binary integer arithmetic is found in every computer sold today, and if it includes floating point support, it offers the IEEE 754 binary floating-point arithmetic. “overflow” or “underflow,” may result in exceptions or interrupts, emergency events similar to unplanned subroutine calls. asdasdasd

61 10. Concluding Remarks Long considered safe on sequential computers must be reconsidered when trying to find the fastest algorithm for parallel computers that still achieves a correct result. Data-level parallelism, specifically subword parallelism, offers a simple path to higher performance for programs that are intensive in arithmetic operations for either integer or floating-point data. asdasdasd

62 Thanks for your listening


Download ppt "COMPUTER DESIGN CENG-212 Dr. Ceyhun ÇELİK"

Similar presentations


Ads by Google