Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California: Berkeley
Problem Many power-limited applications for CPU Media/Graphics Portable applications Investigating the impact of different multiplier designs on power and performance of CPU: SimpleScalar to model CPU and benchmarks Modify SimpleScalar multiplier cycle times to model different multiplier architectures
Array Multipliers AND function to multiply bits Critical path in carry-chain
Wallace Multipliers Critical path shortened Final Adder still needed to combine partial products Power consumption approximately the same as Array Multiplier
Modified Booth Representation 3 bits examined at a time, even values of i traversed Reduces partial products by half However, overhead required to generate signals, MUXes Y -1 = 0 Examples: [0] [0] 2 -2
Read Only Memory Desirable because of low power requirements Con stems from read delay, size 240 MHz -> 4.2 ns delay Consumes 3.24mW at 100MHz (10ns delay)
ROM-based multipliers ROM-based multipliers attractive Issue of space 32-bit multiplier requires 2 32 *2 32 *64 bits—unrealistic Techniques to reduce table sizes Karatsuba Algorithm: A=A A 15-0, B=B B 15-0 A*B=A B <<32+A 15-0 B <<16+A B 15-0 <<16+A B 15-0 Reduces table size to 2 16 *2 16 *32 bits, but requires 4 lookups and 3 additions. Using multiple, parallel lookups still uses fewer bits than regular table lookup
ROM-based multipliers cont. Vinnakota’s approach – Use tables of squares Let x = floor([A + B]/2) and y = floor([A- B]/2) If A 0 xor B 0 = 0: A*B = x 2 -y 2 If A 0 xor B 0 = 1: A*B = x 2 -y 2 +B Reduces table size to 2 32 * 64 bits, further reducible with split-tables (introduced later), requires 2 table lookups and 3 (or 4) additions Hybrid approach: Use tables of squares to find partial products for Karatsuba algorithm
Proposed Implementation A=A 1 A 0 B=B 1 B 0 x 11, y 11 … 2 16 * 32bit ROM x 11 2, y 11 2 … A 1 *B 1, A 1 *B 0 … 2 16 * 32bit ROM
Results Most of the SPEC2000 benchmarks exhibited little or no performance loss (<.5%) from extra multiplier cycles: art, bzip*, gcc, gzip*, ijpeg, li, mcf, mesa, parser*, vpr : Significant * : Possibly significant Of applications that did experience a drop in performance (extra cycles): go.outorder (6.41%) – go playing program m88ksim (5.39%) – chip simulator perl (0.72%) – perl interpreter vortex (2.33%) – Object Orientated Database
Further Work Measurements: Accurate power measurements More specific benchmarks—targeting multimedia Optimizations: Tables: Vinnakota’s split-table work If A, B share lower k bits, A 2, B 2 share lower k+1 bits. Can change 2 N *N table to 2 N *(N-[k+1]) and 2 k *(k+1) tables. Gives somewhat faster lookups and lower memory requirements. Adders: Adders can be optimized, final 64-bit additions are more like 48-bit additions. Pipelining multiplication operations can occur in up to 3 stages.