HPEC 2008-1 SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.

HPEC 2008-1 SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government

MIT Lincoln Laboratory HPEC 2008-2 SMHS 9/24/2008 1D Fourier Transform Mapping 1D FFTs onto Cell 1D as 2D Traditional Approach Outline Introduction Technical Challenges Design Performance Summary

MIT Lincoln Laboratory HPEC 2008-3 SMHS 9/24/2008 1D Fourier Transform g j =  f k e -2  ijk/N k = 0 N-1 This is a simple equation A few people spend a lot of their careers trying to make it run fast This is a simple equation A few people spend a lot of their careers trying to make it run fast

MIT Lincoln Laboratory HPEC 2008-4 SMHS 9/24/2008 Mapping 1D FFT onto Cell Small FFTs can fit into a single LS memory. 4096 is the largest size. Large FFTs must use XDR memory as well as LS memory. FFT Data Cell FFTs can be classified by memory requirements Medium and large FFTs require careful memory transfers Cell FFTs can be classified by memory requirements Medium and large FFTs require careful memory transfers Medium FFTs can fit into multiple LS memory. 65536 is the largest size.

MIT Lincoln Laboratory HPEC 2008-5 SMHS 9/24/2008 1D as 2D Traditional Approach 0123 4567 891011 12131415 04812 15913 261014 371115 0123 4567 891011 12131415 w0w0 w0w0 w0w0 w0w0 w0w0 w1w1 w2w2 w3w3 w0w0 w2w2 w4w4 w6w6 w0w0 w3w3 w6w6 w9w9 0123 4567 891011 12131415 1. Corner turn to compact columns 2. FFT on columns 3. Corner turn to original orientation 4. Multiply (elementwise) by central twiddles 5. FFT on rows 6. Corner turn to correct data order 1D as 2D FFT reorganizes data a lot –Timing jumps when used Can reduce memory for twiddle tables Only one FFT needed 1D as 2D FFT reorganizes data a lot –Timing jumps when used Can reduce memory for twiddle tables Only one FFT needed

MIT Lincoln Laboratory HPEC 2008-6 SMHS 9/24/2008 Communications Memory Cell Rounding Outline Introduction Technical Challenges Design Performance Summary

MIT Lincoln Laboratory HPEC 2008-7 SMHS 9/24/2008 Communications Bandwidth to XDR memory 25.3 GB/s SPE connection to EIB is 50 GB/s Minimizing XDR memory accesses is critical Leverage EIB Coordinating SPE communication is desirable –Need to know SPE relative geometry Minimizing XDR memory accesses is critical Leverage EIB Coordinating SPE communication is desirable –Need to know SPE relative geometry EIB bandwidth is 96 bytes / cycle

MIT Lincoln Laboratory HPEC 2008-8 SMHS 9/24/2008 Memory XDR Memory is much larger than 1M pt FFT requirements Each SPE has 256 KB local store memory Each Cell has 2 MB local store memory total Need to rethink algorithms to leverage the memory –Consider local store both from individual and collective SPE point of view Need to rethink algorithms to leverage the memory –Consider local store both from individual and collective SPE point of view

MIT Lincoln Laboratory HPEC 2008-9 SMHS 9/24/2008 Cell Rounding The cost to correct basic binary operations, add, multiply, and subtract, is prohibitive Accuracy should be improved by minimizing steps to produce a result in algorithm IEEE 754 Round to NearestCell (truncation) b00 b01b10b01b10 1 bit Average value – x01 + 0 bits Average value – x01 +.5 bit

MIT Lincoln Laboratory HPEC 2008-10 SMHS 9/24/2008 Using Memory Well Reducing Memory Accesses Distributing on SPEs Bit Reversal Complex Format Computational Considerations Outline Introduction Technical Challenges Design Performance Summary

MIT Lincoln Laboratory HPEC 2008-11 SMHS 9/24/2008 FFT Signal Flow Diagram and Terminology 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 0 4 12 2 10 14 6 1 9 5 13 3 11 7 15 Size 16 can illustrate concepts for large FFTs –Ideas scale well and it is “drawable” This is the “decimation in frequency” data flow Where the weights are applied determines the algorithm Size 16 can illustrate concepts for large FFTs –Ideas scale well and it is “drawable” This is the “decimation in frequency” data flow Where the weights are applied determines the algorithm butterfly block radix 2 stage

MIT Lincoln Laboratory HPEC 2008-12 SMHS 9/24/2008 Reducing Memory Accesses Columns will be loaded in strips that fit in the total Cell local store FFT algorithm processes 4 columns at a time to leverage SIMD registers Requires separate code from row FFTS Data reorganization requires SPE to SPE DMAs No bit reversal 1024 64 4

MIT Lincoln Laboratory HPEC 2008-13 SMHS 9/24/2008 1D FFT Distribution with Single Reorganization 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 0 4 12 2 10 14 6 1 9 5 13 3 11 7 15 reorganize One approach is to load everything onto a single SPE to do the first part of the computation After a single reorganization each SPE owns an entire block and can complete the computations on its points One approach is to load everything onto a single SPE to do the first part of the computation After a single reorganization each SPE owns an entire block and can complete the computations on its points

MIT Lincoln Laboratory HPEC 2008-14 SMHS 9/24/2008 1D FFT Distribution with Multiple Reorganizations 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 0 4 12 2 10 14 6 1 9 5 13 3 11 7 15 reorganize A second approach is to divide groups of contiguous butterflies among SPEs and reorganize after each stage until the SPEs own a full block reorganize

MIT Lincoln Laboratory HPEC 2008-15 SMHS 9/24/2008 Selecting the Preferred Reorganization Number of SPEs Number of Exchanges Data Moved in 1 DMA Number of Exchanges Data Moved in 1 DMA 22N / 42 412N / 168N / 8 856N / 6424N / 16 Single ReorganizationMultiple Reorganizations Evaluation favors multiple reorganizations –Fewer DMAs have less bus contention Single Reorganization exceeds the number of busses –DMA overhead (~.3  s) is minimized –Programming is simpler for multiple reorganizations Evaluation favors multiple reorganizations –Fewer DMAs have less bus contention Single Reorganization exceeds the number of busses –DMA overhead (~.3  s) is minimized –Programming is simpler for multiple reorganizations N - the number of elements in SPE memory, P - number of SPEs Number of exchanges P * log 2 (P) Number of elements exchanged (N / 2) * log 2 (P) Number of exchanges P * (P – 1) Number of elements exchanged N * (P – 1) / P Typical N is 32k complex elements

MIT Lincoln Laboratory HPEC 2008-16 SMHS 9/24/2008 Column Bit Reversal Bit reversal of columns can be implemented by the order of processing rows and double buffering Reversal row pairs are both read into local store and then written to each others memory location 000000001 100000000 Binary Row Numbers Exchanging rows for bit reversal has a low cost DMA addresses are table driven Bit reversal table can be very small Row FFTs are conventional 1D FFTs

MIT Lincoln Laboratory HPEC 2008-17 SMHS 9/24/2008 Complex Format Interleaved complex format reduces number of DMAs Two common formats for complex –interleaved –split real 1 real 0 imag 0imag 1 real 0real 1imag 1 imag 0 Complex format for user should be standard Internal format conversion is light weight Internal format should benefit the algorithm –Internal format is opaque to user SIMD units need split format for complex arithmetic

MIT Lincoln Laboratory HPEC 2008-18 SMHS 9/24/2008 Using Memory Well Computational Considerations Central Twiddles Algorithm Choice Outline Introduction Technical Challenges Design Performance Summary

MIT Lincoln Laboratory HPEC 2008-19 SMHS 9/24/2008 Central Twiddles Central twiddles can take as much memory as the input data Reading from memory could increase FFT time up to 20% For 32-bit FFTs central twiddles can be computed as needed –Trigonometric identity methods require double precision Next generation Cell should make this the method of choice –Direct sine and cosine algorithms are long w0w0 w0w0 w0w0 w1w1 w0w0 w2w2 w0w0 w2w2 w0w0 w4w4 w3w3 w3w3 w0w0 w0w0 w 1023 * 1023 w0w0 … …... Central twiddles are a significant part of the design Central Twiddles for 1M FFT

MIT Lincoln Laboratory HPEC 2008-20 SMHS 9/24/2008 Algorithm Choice Cooley-Tukey has a constant operation count –2 muladd to compute each result for each stage Gentleman-Sande varies widely in operation count –1 – 3 operations for each result DC term has the same accuracy on both Gentleman-Sande worst term has 50% more roundoff error when fused multiply-add is available Cooley-Tukey -w k wkwk a b a + b * w k a - b * w k Gentleman-Sande -w k wkwk a b a + b (a – b) * w k Computational Butterflies

MIT Lincoln Laboratory HPEC 2008-21 SMHS 9/24/2008 Radix Choice What counts for accuracy is how many operations from the input to a particular result Cooley Tukey Radix 4 t 1 = x l * w z t 2 = x k * w z/2 t 3 = x m * w 3z/2 s 0 = x j + t 1 a 0 = x j – t 1 s 1 = t 2 + t 3 a 1 = t 2 – t 3 y j = s 0 + s 1 y k = s 0 – s 1 y l = a 0 – i * a 1 y m = a 0 + i * a 1 Number of operations for 1 radix 4 stage (real or imaginary: 9 ( 3 mul, 3 muladd, 3 add) Number of operations for 2 radix 2 stages (real or imaginary) : 6 ( 6 muladd) Higher radices reuse computations but do not reduce the amount of arithmetic needed for computation Fused multiply add instructions are more accurate than multiply followed by add Radix 2 will give the best accuracy

MIT Lincoln Laboratory HPEC 2008-22 SMHS 9/24/2008 Outline Introduction Technical Challenges Design Performance Summary

MIT Lincoln Laboratory HPEC 2008-23 SMHS 9/24/2008 Estimating Performance Timing estimates typically cannot include all factors –Computation is based on the minimum number of instructions to estimate –I/O timings are based on bus speed and amount of data –Experience is a guide for the efficiency of I/O and computations I/O estimate: Number of byte transfers to/from XDR memory: 33560192 (minimum) Bus speed to XDR: 25.3 GHz Estimated efficiency: 80% Minimum I/O time: 1.7 ms Computation estimate: Number of operations: 104875600 Maximum FLOPS: 205 Estimated efficiency: 85% Minimum Computation time:.6 ms (8 SPEs) 1M FFT estimate (without full ordering): 2 ms

MIT Lincoln Laboratory HPEC 2008-24 SMHS 9/24/2008 Preliminary Timing Results Timing results are close to predictions –4 SPEs about a factor of 4 from prediction –8 and 16 SPEs closer to prediction Timings were performed on Mercury CTES –QS21 @3.2 GHz Dual Cell Blades

MIT Lincoln Laboratory HPEC 2008-25 SMHS 9/24/2008 Summary A good FFT design must consider the hardware features –Optimize memory accesses –Understand how different algorithms map to the hardware Design needs to be flexible in the approach –“One size fits all” isn’t always the best choice –Size will be a factor Estimates of minimum time should be based on the hardware characteristics 1M point FFT is difficult to write, but possible

HPEC 2008-1 SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.

Similar presentations

Presentation on theme: "HPEC 2008-1 SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HPEC 2008-1 SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.

Similar presentations

Presentation on theme: "HPEC 2008-1 SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This."— Presentation transcript:

Similar presentations

About project

Feedback