Download presentation
Presentation is loading. Please wait.
1
High Performance Discrete Fourier Transforms on Graphics Processors Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, John Manferdelli Microsoft Corporation
2
Discrete Fourier Transforms (DFTs) Given an input signal of N values f(n), project it onto a basis of complex exponentials – Often computed using Fast Fourier Transforms (FFTs) for efficiency Fundamental primitive for signal processing – Convolutions, cryptography, computational fluid dynamics, large polynomial multiplications, image and audio processing, etc. A popular HPC benchmark – HPC Challenge benchmark – NAS parallel benchmarks
3
DFT: Challenges HPC Challenge 2008 – DFT on Cray XT3: 0.9 TFLOPS – HPL: 17 TFLOPS Complex memory access patterns – Limited data reuse – For a balanced system, if compute-to-memory ratio doubles, the cache size needs to be squared for the system to be balanced again [Kung86] Architectural issues – Cache associativity, memory banks
4
4 GPU: Commodity Processor Cell phones Consoles PSP Desktops
5
Parallelism in GPUs GPU Memory DRAM TPC DRAM TPC Local Memory SP Local Memory SP Local Memory SP
6
GPU Memory Domain Programmability GPU Memory DRAM TPC Local Memory SP Local Memory SP TPC Local Memory SP Local Memory SP TPC Local Memory SP Local Memory SP TPC Local Memory SP Local Memory SP TPC Local Memory SP Local Memory SP TPC Local Memory SP Local Memory SP TPC Local Memory SP Local Memory SP TPC Local Memory SP Local Memory SP Thread Execution Manager Thread Block local memory Regs Thread Block local memory Regs High-level programming abstractions: Microsoft DirectX11, OpenCL, NVIDIA CUDA, AMD CAL, etc.
7
Discrete Fourier Transforms Objectives: – Efficiency: Achieve high performance exploiting the memory hierarchy and high parallelism – Accuracy: Design algorithms that achieve comparable numerical accuracy with CPU libraries – Scalability: Demonstrate scalable performance based on underlying hardware capabilities Focus on computing single-precision DFTs that fit in GPU memory – Demonstrate DFT performance of 100-300 GFLOPS per GPU for typical large sizes – Concepts applicable to double-precision algorithms
8
FFT Overview
9
FFT along columnsFFT along rows Transpose
10
Registers (16K) Global memory (1GB) Shared memory (16KB/multi-processor) Significant literature on FFT algorithms. Detailed survey in [Van Loan 92]
11
DFTs on GPUs: Challenges Coalescing issues – Access contiguous blocks of data to achieve high DRAM bandwidth Bank conflicts – Affine access patterns can map to same banks Transpose overheads – Reduce memory access overheads Occupancy – Require several threads to hide memory latency
12
Outline FFT Algorithms – Global Memory – Shared Memory – Hierarchical Memory – Other FFT algorithms Experimental Results Conclusions and Future Work
13
Outline FFT Algorithms – Global Memory – Shared Memory – Hierarchical Memory – Other FFT algorithms Experimental Results Conclusions and Future Work
14
Overview Global Memory Algorithm – Large N – Uses high memory bandwidth of GPUs Shared Memory Algorithm – Small N – Data re-use in shared memory of GPU MPs Hierarchical Algorithm – Intermediate sizes – Combines data transposes with shared memory algorithm
15
Global memory algorithm Proceeds in log R N steps (radix=R) Decompose N into blocks B, and threads T such that B*T=N/R Each thread: – reads R values from global memory – multiplies by twiddle factors – performs an R-point FFT – writes R values back to global memory
16
Global Memory Algorithm Thread 0 Thread 1 Thread 2 Thread 3 N/R RjRj R=4 Step j=1 If N/R > coalesce width (CW), no coalescing issues during reads If R j > CW, no coalescing issues during writes If R j <=CW, write to shared memory, rearrange data across threads, write to global memory with coalescing
17
Shared memory algorithm Applied when FFT is computed on data in shared memory of a MP Each block has N*M/R threads – M is number of FFTs performed together in a block – Each MP performs M FFTs at a time Similar to global memory algorithm – Use Stockham formulation to reduce compute overheads
18
Shared Memory Algorithm Thread 0 Thread 1 Thread 2 Thread 3 N/R RjRj R=4 Step j=1 If N/R > numbanks, no bank conflicts during reads If R j > numbanks, no bank conflicts during writes
19
Shared Memory Algorithm Thread 0 Thread 1 Thread 2 Thread 3 N/R RjRj R=4 Step j=1 If R j <=numbanks, add padding to avoid bank conflicts Thread 4Thread 5Thread 6 Thread 7 0 Banks 4 8 12 0 4 8
20
Shared Memory Algorithm Account for bank conflicts : – Needed for radices that are not relatively prime with NumBanks – Add padding for steps j <= log R (NumBanks) – Requires more indexing arithmetic, slightly more shared memory – Improves performance for higher radices Need sufficient number of threads to hide latency – Perform multiple FFTs together for small FFTs
21
Hierarchical FFT Decompose FFT into smaller-sized FFTs – Evaluate efficiently using shared memory algorithm – Combine transposes with FFT computation – Achieve memory coalescing
22
Multiprocessor Shared Memory SP DRAM Hierarchical FFT W=N/H CW H Perform CW FFTs of size H in shared memory …
23
Hierarchical FFT W=N/H H Perform H FFTs of size W recursively Transpose In-place algorithm Final set of transposes can also be combined with FFT computation
24
Other FFTs Non-Power-of-Two sizes – Mixed Radix Using powers of 2, 3, 5, etc. – Bluestein’s FFT For large prime factors Multi-dimensional FFTs – Perform FFTs independently along each dimension Real FFTs – Exploit symmetry to improve the performance – Transformed into a complex FFT problem DCTs – Computed using a transformation to complex FFT problem
25
Microsoft DFT Library Key features supported in our GPU DFT library Dimension 1D 2D 3D Algorithms Shared memory Global memory Hierarchical Data type Single Real Complex Runtime Auto-tuning Virtualization Size Large prime factors 2 a, 3 b, etc. Mixed-radix Multiple transforms
26
Outline FFT Algorithms – Global Memory – Shared Memory – Hierarchical Memory – Other FFT algorithms Experimental Results Conclusions and Future Work
27
Experimental Methodology Hardware – Intel QX9650 3.0 GHz quad-core processor Two dual core dies Each pair of cores shares 6 MB L2 cache – NVIDIA GTX280 GPU Driver version 177.41 NameMulti- procs Shader Clock (MHz) Peak Perf. (GFlops) Memory Capacity (MB) Peak Bandwidth (GiB/s) GTX2803013009301024140 8800 GTX16130052076880 8800 GTS16162562051260
28
Experimental Methodology Libraries – Our FFT library written in CUDA Tested on various GPUs – NVIDIA’s CUFFT library (v. 1.1) Results for GTX280 only – DX9FFT library [Lloyd et al. 2007] Results for GTX280 only – Intel’s MKL (v. 10.0.2) Run on CPU with 4 threads
29
Experimental Methodology Notation – N: Size of the FFT – M: Number of FFTs Performance – GFlops: M 5N lg(N) / time – Minimum time over multiple runs Warm caches on CPU Accuracy – Perform forward transform and inverse – Compare result to original input – Root mean square error (RMSE) / 2
30
1D Single FFT M = 1
31
1D Multi-FFT M = 2 23 / N *Driver 177.11 0 50 100 150 200 250 300 1357911131517192123 GFlops log 2 N Ours GTX280* Ours GTX280 MKL Ours 8800GTS CUFFT DX9FFT Entire FFT in shared memory kernel
32
1D Multi-FFT M = 2 23 / N 40x 20x 5x
33
1D Mixed Radix N = 2 a 3 b 5 c M= 2 23 /N
34
1D Primes M= 2 20 /N
35
1D Large Primes M= 2 22 /N
36
RMSE Error (N=2 a )
37
RMSE Error (Mixed radix)
38
RMSE Error (primes)
39
Limitations Current implementation – Works only on data in GPU memory – No multi-GPU support – No support for double precision Hardware Issues – Large data sizes needed to fully utilize GPU – Slow data transfer between GPU and system memory – High accuracy twiddle factors are slow Use a table (especially for double precision) – Need to virtualize block index Fixed in Microsoft DirectX11
40
Outline FFT Algorithms – Global Memory – Shared Memory – Hierarchical Memory – Other FFT algorithms Experimental Results Conclusions and Future Work
41
Conclusions Several algorithms for performing FFTs on GPUs – Handle different sizes efficiently – Library chooses appropriate algorithms for a given size and hardware configuration – Optimized for memory performance Combined transposes with FFT computation – Address numerical accuracy issues High performance – Up to 300 GFLOPS on current high-end GPUs – Significantly faster than existing GPU-based libraries and CPU-based libraries for typical large sizes
42
Future Work More sophisticated auto-tuning Add additional functionality: – Double precision – Multi-GPU support – Out-of-core support for very large FFTs Port to DirectX11 using Compute Shaders
43
Future of GPGPU GPUs are becoming more general purpose – Fewer limitations. Microsoft DirectX11 API: IEEE floating point support and optional double support Integer instruction support More programmable stages, etc. – Significant advance in performance – Higher level programming languages – Uniform abstraction layer over different hardware vendors
44
Future of GPGPU Widespread adoption of GPUs in commercial applications – Image and media processing, signal processing, finance, etc. High performance computing – Can benefit from data-parallel programming – Many opportunities – Microsoft GPU Station at Booth number 1309
45
Acknowledgments Microsoft: Chas Boyd, Craig Mundie, Ken Oien NVIDIA: Henry Moreton, Sumit Gupta, and David Peter-Pike Sloan Vasily Volkov
46
Questions Contact: nagag@microsoft.com brandon.lloyd@microsoft.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.