Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU ECE 734 PROJECT Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU Accelerating Spherical Harmonic Transforms on the NVIDIA® GPGPU -Vikrant Soman
Agenda Problem Statement Motivation Introduction to SPH – analysis and synthesis Overview of GPU architecture CPU-GPU implementation Results Conclusions and Future work References and Acknowledgements
Problem Statement Critical computational kernel in numerical weather prediction and climate modeling and other global geo-potential related applications Resolution of satellites is improving leading to enormous global datasets of very high degrees and orders becoming available
Motivation The computational aspects of SHTs have become challenging and time consuming. Makes SPH more DATA INTENSIVE and SLOWER ! No one has tried using GPU for SHT before. Try Google search for “Spherical Harmonic Transforms on GPU” !!
Spherical Harmonic Transforms Spherical Harmonic Transforms (SHTs) are essentially Fourier transforms on the sphere. Consists of an Analysis step and Synthesis step. Analysis: Project grid point data on the sphere onto the spectral modes. Synthesis: Inverse transform reconstructs grid point data from the spectral information.
Analysis Synthesis FFT of grid point along longitudes (F) * gaussian weights (G) Spectral values (S) Legendre polynomial functions Spectral values (X) Compute IFFT and Normalize results
GPU architecture - Overview Consists of 4 types of memory – Global(Device) Shared Constant Texture
Cuda CUDA extends C by allowing the programmer to define C functions, called kernels. Executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. // Kernel definition __global__ void vecAdd(float* A, float* B, float* C) { } int main() // Kernel invocation vecAdd<<<1, N>>>(A, B, C);
One of the best parts of the GPGPU – Heterogeneous programming BLAS operation acceleration. Allows the implementation of CPU-GPU architecture which I have used.
Implementation Details Exploit the heterogeneous programming model CPU code implemented in MATLAB. Identified data intensive loops in the code. Map the loop indexing to GPGPU architecture to exploit parallelism Offload computation to GPU retrieve data back to CPU
Part of the kernel program Loop mapped to GPU AS(ty, tx) = A[k*wA*wA + aBegin + wA * ty + tx]; BS(ty, tx) = B[bBegin_x + wB * ty + tx]; Csub (ty,tx) = 0; // Synchronize to make sure the matrices are loaded __syncthreads(); Csub(ty,tx) = AS(ty,tx) * BS(ty,tx); int c = bx*BLOCK_SIZE + by*BLOCK_SIZE*BLOCK_SIZE*(wA/BLOCK_SIZE); A[k*wA*wA + c + tx + ty*wA] = Csub(ty,tx); for n=0:nn Pn = (legendre(n,yg))'; % Note error in Matlab normalization for m= 0:n Nmn = (-1)^m * sqrt((2*n+1)/2 * factorial(n-m)/factorial(n+m) ); P(1:njo2,n+1,m+1) = Nmn*Pn(1:njo2,m+1); end
Legendre polynomial calculation Offload data intensive operation to GPU
Analysis step Compute FFT on CPU side. MATLAB has highly optimized FFT operation.
Synthesis step IFFT is again given to CPU. GPU FFT is good only for very high points ! ( >10000 etc.)
CPU side – DELL, Intel Quad Core @2.5Ghz and 2.5GB RAM GPU – NVIDIA® 8800 GT CPU side code on MATLAB GPU code written in MATLAB extensions provided by NVIDIA® called NVMEX Interfacing between CPU-GPU via plug-in for MATLAB.
Results For grid size of 512 speed up of almost 42x !! Shows upward trend for higher sizes Not much speed up for analysis kernel. Values are comparable though
Conclusions and Future work Improves the on-the-fly Legendre polynomial calculation. Good speed up overall Errors are low. ( less than E-10 on average) Need to look into performance for higher grid sizes. Complete synthesis step results Possible exchange of ideas with PhD student at SMU, Dallas
References Drake, J. B., Worley, P., and D’Azevedo, E. 2008. Algorithm 888: Spherical harmonic transform algorithms. ACM Trans. Math. Softw. 35, 3, Article 23 (October 2008) Akshara Kaginalkar, Sharad Purohit, Benchmarking of Medium Range Weather Forecasting Model on PARAM -A parallel machine, Center for Development of Advanced Computing (C-DAC), Pune University Campus, Pune 411007 India Martin J. Mohlenkamp, A Fast Transform for Spherical Harmonics, The Journal of Fourier Analysis and Applications, 1999 Huadong Xiao, Yang Lu, Parallel computation for spherical harmonic synthesis and analysis, Computers & Geosciences, Volume 33, Issue 3, March 2007 5. NVIDIA CUDA Programming Guide 2.0 “Special thanks to Prof. Dan Negrut and Makarand Datar, UW Mech department for access to their GPU machines”