Download presentation
Presentation is loading. Please wait.
Published byMyles Harvey Modified over 9 years ago
1
200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable Computing Lab Virginia Tech
2
200/MAPLD 2004 Craven2 Outline Background Large Integer Multiplication GIMPS Algorithm Comparison Floating-point FFT All-integer FFT Fast Galois Transform Accelerator Design System Design Operation Performance Improvements & Future Work
3
200/MAPLD 2004 Craven3 Large Integer Multiplication Complexity Grade School: O(N 2 ) Fourier Transform: ~O(N log N) Efficient FFT-Based Multiplication Divide integers into sequences of smaller digits. 867530924601 86, 75, 30, 92, 46, 01 Convolution of two sequences equivalent to multiplication. Element-wise multiplication in frequency domain time domain convolution.
4
200/MAPLD 2004 Craven4 GIMPS Why multiply big numbers? Great Internet Mersenne Prime Search (GIMPS) Primality testing algorithm for Mersenne numbers (2 q – 1) requires squaring of multi-million digit numbers. Mersenne primes are largest primes known – used in cryptography. Large integer convolution Performance comparison of Pentiums and FPGAs in traditional floating-point domains. Lucas-Lehmer Primality Test Mq = 2 q – 1; v = 4; for i = 1:q-2, v = v 2 – 2 (mod Mq); if v == 0, Mq is prime else, Mq is composite
5
200/MAPLD 2004 Craven5 Discrete Weighted Transform Discrete Weighted Transform (DWT) Variable base – each sequence digit can contain differing numbers of bits. Creates power-of-two sequence needed by FFT. Eliminates need to zero pad to convert cyclic, FFT-based convolution into acyclic convolution needed for squaring. Steps: Number to be multiplied divided into variable-length digits. Sequence multiplied by a weight sequence. FFT performed on new, power-of-two length weighted sequence. Example for Mq = 2 37 – 1 with FFT length of 4: Bits / digit = { 10, 9, 9, 9 } To square 78,314,567,209 (mod Mq), our sequence would be: { 553, 93, 381, 291 } 553 + 93 * 2 10 + 381 * 2 19 + 291 * 2 28 = 78,314,567,209 Multiply sequence by weights then FFT.
6
200/MAPLD 2004 Craven6 Objective Compare performance of Pentium processors to FPGAs. GIMPS chosen because highly optimized code exists. GIMPS utilizes fast floating-point performance of Pentiums. Xilinx Virtex-II Pro 100 (2VP100) chosen as target device. Largest available 2VP device. Contains 444, 17x17 unsigned multipliers 888kB of embedded Block RAM Target 12 million digit numbers. Reward for first prime above 10 million.
7
200/MAPLD 2004 Craven7 Floating-point FFT GIMPS implementation uses floating-point – requires round off error checks. Using near double-precision floating-point (51-bit mantissa): 49 real multipliers can be placed on 2VP100 12 complex multipliers 12 million digit number -> 2 million point FFT 44 million complex multiplies -> 3.7 million cycles
8
200/MAPLD 2004 Craven8 All-integer FFT Perform FFT modulo special prime. Prime must have nice roots of one & two. Reductions modulo prime should be simple. Primes of the form 2 k – 2 m + 1 meet requirements. Prime# MultipliersFFT LengthIteration time 2 47 -2 24 +1494M1.9M cycles 2 64 -2 32 +1262M1.7M cycles 2 73 -2 37 +1172M2.6M cycles 2 113 -2 57 +191M2.3M cycles
9
200/MAPLD 2004 Craven9 Fast Galois Transform All-integer transform using complex numbers modulo a Mersenne Prime: a + b*i (mod Mp) Real input sequence folded into complex input with half the length. Modular reductions via Mersenne primes are simple addition. Prime# MultipliersFFT LengthIteration Time 2 61 - 16 (complex)1M3.5M cycles 2 89 - 13 (complex)512K3.3M cycles
10
200/MAPLD 2004 Craven10 Algorithm Selection Considered algorithms: Floating-point FFT3.7M cycles / iteration All-integer FFT1.7M cycles / iteration Galois Transform3.3M cycles / iteration Winograd Transform – no acceptable run lengths Chinese Remainder Theorem – added complexity
11
200/MAPLD 2004 Craven11 FFT Design Multipliers and adder generated by CoreGen. 10 cycle butterfly latency.
12
200/MAPLD 2004 Craven12 Complete Design 8-point FFTs lower cache throughput. Multiple caches allow for overlapping computation with memory reads and writes.
13
200/MAPLD 2004 Craven13 Performance Estimates XC2VP100-6ff1696 ISE version 6.2i Iteration time: 34 milliseconds FFT Engine frequency: 80 MHz 2VP 100 utilization: 70% slices * Not Implemented 24% BRAMs 86% multipliers Iteration StageTime (us) Weighted sequence creation* 250 Forward FFT11,500 DFT coefficient squaring250 Inverse FFT11,500 Weight removal*250 Carry releasing*5,000 Mersenne mod reduction*5,000
14
200/MAPLD 2004 Craven14 Performance Comparison Pentium 4 Performance: Non-SIMD (64-bit multiplies) 6.4 GFLOPs All-Integer transform leverages FPGA strengths: 1.9 billion integer multiplies /sec Transform performance exceeds P4. FPGA vs. Pentium 4: 34 ms vs. 60 ms => 1.76x speed-up! $10,000 vs. $500 => 20x more costly. 600 sq mm* vs. 146 sq mm => 4.1x more die area. † FPGAs would likely be less costly if volume equaled the P4. † The P4 area estimate does not include the area required by all of the support chips. * 2VP100 die area extrapolated from 2VP20 data supplied by Semiconductor Insights (www.semiconductor.com).
15
200/MAPLD 2004 Craven15 Improvements & Future Work Pentium assemble code highly-optimized while HW accelerator is a first draft. Algorithm exploration Nussbaumer’s method using 17-bit primes Utilize “nice” form of prime to implement shift-only multiply for first two FFT stages. Cluster Implementation Configurable Computing Lab constructing a 16-node 2VP cluster with gigabit transceivers as interconnect. Alternative reduced-multiplier butterfly structures Floorplanning
16
200/MAPLD 2004 Craven16 Conclusions All-integer FFTs attractive for hardware implementations of filters / convolutions. GIMPS accelerator designed: Operates at 80 MHz 176% faster than 3.2 GHz Pentium 4 Cost of accelerator outweighs benefit in this application.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.