Download presentation
Presentation is loading. Please wait.
Published byBlake Hensley Modified over 9 years ago
1
Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California: Berkeley
2
Problem Many power-limited applications for CPU Media/Graphics Portable applications Investigating the impact of different multiplier designs on power and performance of CPU: SimpleScalar to model CPU and benchmarks Modify SimpleScalar multiplier cycle times to model different multiplier architectures
3
Array Multipliers AND function to multiply bits Critical path in carry-chain
4
Wallace Multipliers Critical path shortened Final Adder still needed to combine partial products Power consumption approximately the same as Array Multiplier
5
Modified Booth Representation 3 bits examined at a time, even values of i traversed Reduces partial products by half However, overhead required to generate signals, MUXes Y -1 = 0 Examples: 1 1 1 1 [0] 0 -1 0 1 1 0 [0] 2 -2
6
Read Only Memory Desirable because of low power requirements Con stems from read delay, size 240 MHz -> 4.2 ns delay Consumes 3.24mW at 100MHz (10ns delay)
7
ROM-based multipliers ROM-based multipliers attractive Issue of space 32-bit multiplier requires 2 32 *2 32 *64 bits—unrealistic Techniques to reduce table sizes Karatsuba Algorithm: A=A 31-16 A 15-0, B=B 31-16 B 15-0 A*B=A 31-16 B 31-16 <<32+A 15-0 B 31-16 <<16+A 31-16 B 15-0 <<16+A 15- 0 B 15-0 Reduces table size to 2 16 *2 16 *32 bits, but requires 4 lookups and 3 additions. Using multiple, parallel lookups still uses fewer bits than regular table lookup
8
ROM-based multipliers cont. Vinnakota’s approach – Use tables of squares Let x = floor([A + B]/2) and y = floor([A- B]/2) If A 0 xor B 0 = 0: A*B = x 2 -y 2 If A 0 xor B 0 = 1: A*B = x 2 -y 2 +B Reduces table size to 2 32 * 64 bits, further reducible with split-tables (introduced later), requires 2 table lookups and 3 (or 4) additions Hybrid approach: Use tables of squares to find partial products for Karatsuba algorithm
9
Proposed Implementation A=A 1 A 0 B=B 1 B 0 x 11, y 11 … 2 16 * 32bit ROM x 11 2, y 11 2 … A 1 *B 1, A 1 *B 0 … 2 16 * 32bit ROM
10
Results Most of the SPEC2000 benchmarks exhibited little or no performance loss (<.5%) from extra multiplier cycles: art, bzip*, gcc, gzip*, ijpeg, li, mcf, mesa, parser*, vpr : Significant * : Possibly significant Of applications that did experience a drop in performance (extra cycles): go.outorder (6.41%) – go playing program m88ksim (5.39%) – chip simulator perl (0.72%) – perl interpreter vortex (2.33%) – Object Orientated Database
11
Further Work Measurements: Accurate power measurements More specific benchmarks—targeting multimedia Optimizations: Tables: Vinnakota’s split-table work If A, B share lower k bits, A 2, B 2 share lower k+1 bits. Can change 2 N *N table to 2 N *(N-[k+1]) and 2 k *(k+1) tables. Gives somewhat faster lookups and lower memory requirements. Adders: Adders can be optimized, final 64-bit additions are more like 48-bit additions. Pipelining multiplication operations can occur in up to 3 stages.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.