Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California:

Similar presentations


Presentation on theme: "Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California:"— Presentation transcript:

1 Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California: Berkeley

2 Problem Many power-limited applications for CPU  Media/Graphics  Portable applications Investigating the impact of different multiplier designs on power and performance of CPU:  SimpleScalar to model CPU and benchmarks  Modify SimpleScalar multiplier cycle times to model different multiplier architectures

3 Array Multipliers AND function to multiply bits Critical path in carry-chain

4 Wallace Multipliers Critical path shortened Final Adder still needed to combine partial products Power consumption approximately the same as Array Multiplier

5 Modified Booth Representation 3 bits examined at a time, even values of i traversed Reduces partial products by half However, overhead required to generate signals, MUXes Y -1 = 0 Examples: 1 1 1 1 [0] 0 -1 0 1 1 0 [0] 2 -2

6 Read Only Memory Desirable because of low power requirements Con stems from read delay, size 240 MHz -> 4.2 ns delay Consumes 3.24mW at 100MHz (10ns delay)

7 ROM-based multipliers ROM-based multipliers attractive  Issue of space 32-bit multiplier requires 2 32 *2 32 *64 bits—unrealistic Techniques to reduce table sizes  Karatsuba Algorithm: A=A 31-16 A 15-0, B=B 31-16 B 15-0 A*B=A 31-16 B 31-16 <<32+A 15-0 B 31-16 <<16+A 31-16 B 15-0 <<16+A 15- 0 B 15-0 Reduces table size to 2 16 *2 16 *32 bits, but requires 4 lookups and 3 additions. Using multiple, parallel lookups still uses fewer bits than regular table lookup

8 ROM-based multipliers cont.  Vinnakota’s approach – Use tables of squares Let x = floor([A + B]/2) and y = floor([A- B]/2) If A 0 xor B 0 = 0: A*B = x 2 -y 2 If A 0 xor B 0 = 1: A*B = x 2 -y 2 +B Reduces table size to 2 32 * 64 bits, further reducible with split-tables (introduced later), requires 2 table lookups and 3 (or 4) additions  Hybrid approach: Use tables of squares to find partial products for Karatsuba algorithm

9 Proposed Implementation A=A 1 A 0 B=B 1 B 0 x 11, y 11 … 2 16 * 32bit ROM x 11 2, y 11 2 … A 1 *B 1, A 1 *B 0 … 2 16 * 32bit ROM

10 Results  Most of the SPEC2000 benchmarks exhibited little or no performance loss (<.5%) from extra multiplier cycles: art, bzip*, gcc, gzip*, ijpeg, li, mcf, mesa, parser*, vpr  : Significant  * : Possibly significant  Of applications that did experience a drop in performance (extra cycles): go.outorder (6.41%) – go playing program m88ksim (5.39%) – chip simulator perl (0.72%) – perl interpreter vortex (2.33%) – Object Orientated Database

11 Further Work Measurements:  Accurate power measurements  More specific benchmarks—targeting multimedia Optimizations:  Tables: Vinnakota’s split-table work If A, B share lower k bits, A 2, B 2 share lower k+1 bits. Can change 2 N *N table to 2 N *(N-[k+1]) and 2 k *(k+1) tables. Gives somewhat faster lookups and lower memory requirements.  Adders: Adders can be optimized, final 64-bit additions are more like 48-bit additions. Pipelining multiplication operations can occur in up to 3 stages.


Download ppt "Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California:"

Similar presentations


Ads by Google