Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology.

Similar presentations


Presentation on theme: "Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology."— Presentation transcript:

1 Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology CACHES 2011 Tucson, Arizona, June 4th, 2011

2 Outline  Motivation  Spherical Harmonic Transforms (SHT)  Methods  Direct Method  Efficiency of Threads Utilization  Reshaped Method  Concurrent Kernel Execution  Experiments 2

3 Motivation  Computing the S.H.T with GPUs  S.H.T is widely used  But with complexity of O(N 3 )  GPUs are powerful  Performance Metric in the SM level  Only emphasizing on the OCCUPANCY  Finding another metric to measure how the launched threads are efficiently used 3

4 Spherical Harmonic Transforms(1/2) ξ: state variable ξ n m : spectral coefficients of state variable ξ μ: Gaussian latitude λ: Longitude M: model truncation wavenumber N(m): highest degree of associated Legendre function for wavenumber m P n m (μ)e imλ : associated Legendre functions 4

5 Spherical Harmonic Transforms(2/2) Forward Fourier Forward Legendre Inverse Legendre Inverse Fourier 5

6 Methods – Direct (1/9)  Forward Legendre  m ≤ n CUDA Thread Thread Block 6

7 Methods – Direct (2/9)  Inverse Legendre  m ≤ n CUDA Threads of block j 7

8 Methods – ETU Metric (3/9)  Efficiency of Thread Utilization(ETU)  Measures the proportion of launched threads doing useful work during the entire execution interval  Mainly used as a algorithm design guideline  Assumption  Algorithms consist of many micro steps  tu(t,s) function  t: thread  s: micro step 8

9 Methods – ETU (4/9) Algorithm 2: Direct Inverse Legendre Transform (DILT) Input: ξ n m, P n m, J, M Output: ξ m Execution configuration: (J, M+1) Declaration: tid, bid, fc_sh(M+1) // fc_sh: shared memory 1 initialize fc_sh(tid) to null; // 1 m_s 2 for n=0 to M do // M+1 m_s 3 if tid ≤ n then 4 fc_sh(tid) += ξ n tid ×P n tid (μ bid ); end if 5 end for 6 ξ tid (μ bid ) = fc_sh(tid); // 1 m_s  ETU Metric  Example 9

10 Methods – Reshaped (5/9)  Forward Legendre reshape ETU ≈ 1/2 ETU ≈ 1 10

11 Methods – Reshaped (6/9)  Inverse Legendre  T213 model reshape 11

12 Methods – Reshaped (7/9)  Inverse Legendre  T213 model reconstruct 12

13 Methods – Reshaped (8/9)  Inverse Legendre  T213 model  computation for trapezium α and β 13

14 Methods – Concurrent Kernel (9/9)  Concurrent Kernel Execution  Supported by Fermi and later architectures  Programs with many small kernels can efficiently executed on GPUs  The consideration of software scalability in the future  T213 model Kernel Concurrent Forward LegendreConcurrent Inverse Legendre nGrid sizeBlock sizemGrid sizeBlock size 1 [ 0,53 ]5464[ 0,53 ]32064 2 [ 54,117]64128[ 54,117]32064 3 [118,213]96224[118,213]32096 14

15 Experiments (1/4)  Validation of ETU metric  T341 model  Variable Block size  Observations  Basically larger ETU indicates better performance  No direct relationship shows between OCCUPANCY and performance  Same OCCUPANCY doesn't mean equal performance  Same-OCCUPANCY, larger-ETU, better performance BSETUOCCUPANCYTime (ms) 960.80390.3121.975 1280.74800.4172.239 1600.78310.4172.038 1920.65190.6252.198 15

16 Experiments (2/4)  Performance Forward Legendre Inverse Legendre 16

17 Experiments (3/4)  Case Study: STSWM  A global shallow water model based on S.H.T.  Exhibits many mathematical and computational properties of more complete models  Used to investigate and compare numerical methods for simulating atmospheric models  T213 truncation  Forward Legendre: ftrnve, ftrndi and ftrnpi  Invserse legendre: shtrns 17

18 Experiments (4/4)  Case Study: STSWM 18

19 Review  Motivation  Spherical Harmonic Transforms  Methods  Direct Method  Efficiency of Threads Utilization  Reshaped Method  Concurrent Kernel Execution  Experiments 19

20 20


Download ppt "Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology."

Similar presentations


Ads by Google