Peng Wang, Ph.D. HPC Developer Technology, NVIDIA

Peng Wang, Ph.D. HPC Developer Technology, NVIDIA
GPU Programming Techniques for Plasma Simulations Peng Wang, Ph.D. HPC Developer Technology, NVIDIA

Overview GPU computing introduction GPU programming basics
GTC GPU porting

New Era of Computing: Parallel Computing
General purpose single processor hits the power wall Single-core performance no longer increases linearly after 2004 The free lunch is over One has to do parallel computing to get higher performance At 2007, NVIDIA begins releasing a new brand of GPU called “Tesla” specifically for parallel computing Still named GPU just due to historic reason

Why Tesla GPU: The Performance Gap Widens Further
8x double prec ECC L1, L2 Caches 1 TF Single Prec 4GB Memory Perf gap keeps increasing – peak gflops & mem b/w If you bet on GPUs in 2006, your app perf would have be 2x and with Fermi will be almost 2x again With CPUs you would have seen very modest performance increases. Also, as Nehelam goes from 4 cores to 6 cores, the memory bandwidth per core will decrease, thus choking the Nehalem cores and reducing application performance NVIDIA GPU X86 CPU

GPU Computing Concept Plug a GPU into a PC Plug a GPU into a cluster
Personal supercomputer with 1 Teraflop computing power on your desk Requirement: PCIe in your motherboard, 300W power supply Plug a GPU into a cluster GPU supercomputer with Petaflop power 3 of the top 5 supercomputers in the world is GPU cluster Tianhe-1A Nebula Tsubame2

Widespread use of GPUs in HPC Softwares Monte Carlo Options Pricing
Oil and gas Edu/Research Government Life Sciences Finance Manufacturing Seismic Processing Reservoir Simulation Astrophysics Molecular Dynamics Weather / Climate Signal Processing Satellite Imaging Video Analytics Bio-chemistry Bio-informatics Material Science Genomics Risk Analytics Monte Carlo Options Pricing Insurance Structural Mechanics Comp. Fluid Dynamics Electromagnetics

MHD on GPU Numerical Space Weather Modeling: Xueshang Feng & Dingkun Zhong Space Weather Center, CAS Grid system from the Sun to Earth: enormous computing requirement! 5X speedup

Biocomputing on GPU QTLnetwork: Jun Zhu, Futao Zhang, Zhihong Zhu Zhejiang University Searching in multi-dimensional gene space: enormous computing requirement! > 20X speedup

GTC GPU porting

OpenACC: New Open Standard for GPU Computing Faster, Easier, Portability
In order to accelerate the adoption of directives, the 3 major parallelizing compiler providers for GPUs, Cray, PGI, and CAPS have come together to propose a new Open Standard for directives that they will all support, called OpenACC. OpenACC is a huge step forward for developers giving them a common approach to using accelerators. OpenACC 1.0 Specification and a quick reference card available now for download:

Calculating Pi: CPU Code
program picalc implicit none integer, parameter :: n= integer :: i real(kind=8) :: t, pi pi = 0.0 do i=0, n-1 t = (i+0.5)/n pi = pi + 4.0/(1.0 + t*t) end do print *, 'pi=', pi/n end program picalc

Calculating Pi: Adding OpenACC
program picalc implicit none integer, parameter :: n= integer :: i real(kind=8) :: t, pi pi = 0.0 !$acc parallel do i=0, n-1 t = (i+0.5)/n pi = pi + 4.0/(1.0 + t*t) end do !$acc end parallel print *, 'pi=', pi/n end program picalc Just two lines of modifications to CPU code.

Calculating Pi Using OpenACC: Result
pgfortran compute_pi.F time ./a.out pi= real 0m16.049s user 0m16.030s sys 0m0.010s pgfortran -acc compute_pi.F pi= real 0m0.398s user 0m0.150s sys 0m0.250s 40x speedup! Platform: CPU: intel Core i GHz GPU: NVIDIA Tesla C2070 OS: Ubuntu 10.04 DRAM: 4 GB

Mature CUDA Development Ecosystem
Debuggers & Profilers cuda-gdb NV Visual Profiler Parallel Nsight Visual Studio Allinea TotalView C C++ Fortran OpenCL DirectCompute Java Python GPU Compilers PGI Accelerator CAPS HMPP mCUDA OpenMP Parallelizing Compilers MATLAB Mathematica NI LabView pyCUDA Numerical Packages BLAS FFT LAPACK NPP Video Imaging GPULib Libraries GPGPU Consultants & Training ANEO GPU Tech OEM Solution Providers

CUDA Code Example

GTC GPU porting

GTC Gryrokinetic simulation code to study turbulence in magnetically confined plasma UC-Irvine, ZJU, USTC, PKU, etc. Fortran, MPI/OpenMP Scales to thousands of nodes

Benchmark Problem Physics parameters Grid parameters
nonlinear=1, magnetic=0 Grid parameters mpsi=400, mthetamax=1568, mtoroidal=32 Particle parameters micell=mfcell=mecell=100 nhybrid=2, ncycle=8 For 128 MPI processes run: 15.7M ions, 6.5M electrons per MPI process

Profile Benchmark: 128 MPI processes on 128 nodes, with 6 OpenMP threads 128 CPU % loop 521.43 field 0.63 0.12% ion 54.4 10.43% shifte 84.6 16.22% pushe 365.4 70.08% poisson 4.4 0.84% electron other 12 2.30% Platform: Tianhe-1A Compute node: 2 Intel Xeon 5670 (6c, 2.93 GHz) 1 NVIDIA Tesla M2050 24 GB

Profile Benchmark: 128 MPI processes on 128 nodes, with 6 OpenMP threads 128 CPU % loop 521.43 field 0.63 0.12% ion 54.4 10.43% shifte 84.6 16.22% pushe 365.4 70.08% poisson 4.4 0.84% electron other 12 2.30% Platform: Tianhe-1A Compute node: 2 Intel Xeon 5670 (6c, 2.93 GHz) 1 NVIDIA Tesla M2050 24 GB DRAM Focus on pushe+shifte first

Result 128 CPU % 128 GPU speedup loop 521.43 167.46 3.1 field 0.63
128 CPU % 128 GPU speedup loop 521.43 167.46 3.1 field 0.63 0.12% 0.66 0.39% ion 54.4 10.43% 54.6 32.60% shifte 84.6 16.22% 51.8 30.93% 1.6 pushe 365.4 70.08% 44 26.27% 8.3 poisson 4.4 0.84% 2.63% electron other 12 2.30% 7.17%

Weak Scaling on Tianhe-1A

Pushe 2 major kernels Gather fields (gather_fields) Update guiding center position (update_gcpos) 1 thread per particle to replace the particle loop “do m=1,me” Key optimization technique: texture cache on GPU is ideal for the data locality of field data. This technique gives to 3X kernel speedup

Conclusions All future computing chips are more and more parallel
All codes need to be massively parallel to utilize the next generation of chips GPU programming is mature and easy GPU = Great Plasma Unit

Backup Slides

GPU Programming Based on Directive
Suitable for manifestly parallel algorithms May need to rewrite the CPU code to expose parallelism Fast prototyping Mixing CUDA and OpenACC: best of both worlds Most important hotspot use CUDA: best performance Other parts use OpenACC: fast development

Bottleneck of pushe Not very good cache behavior
e1=e1+wt00*(wz0*gradphi(1,0,ij)+wz1*gradphi(1,1,ij)) e2=e2+wt00*(wz0*gradphi(2,0,ij)+wz1*gradphi(2,1,ij)) e3=e3+wt00*(wz0*gradphi(3,0,ij)+wz1*gradphi(3,1,ij)) e4=e4+wt00*(wz0*phit(0,ij)+wz1*phit(1,ij)) Not very good cache behavior

Texture Cache Texture cache is optimized for 2D spatial locality

Texture Prefetch This gives 3X kernel time speedup!
e1=e1+wt00*(wz0*gradphi(1,0,ij)+wz1*gradphi(1,1,ij)) e2=e2+wt00*(wz0*gradphi(2,0,ij)+wz1*gradphi(2,1,ij)) e3=e3+wt00*(wz0*gradphi(3,0,ij)+wz1*gradphi(3,1,ij)) e4=e4+wt00*(wz0*phit(0,ij)+wz1*phit(1,ij)) float4 tmp1 = tex1Dfetch(texGradphiPhit, ij*2); float4 tmp2 = tex1Dfetch(texGradphiPhit, ij*2+1); e1 += wt00*(wz0*tmp1.x + wz1*tmp1.w); e2 += wt00*(wz0*tmp1.y + wz1*tmp2.x); e3 += wt00*(wz0*tmp1.z + wz1*tmp2.y); e4 += wt00*(wz0*tmp2.z + wz1*tmp2.w); This gives 3X kernel time speedup!

Future Plans Continue optimizing pushe, shifte
Ion is now ~30%, worth porting Ongoing

Peng Wang, Ph.D. HPC Developer Technology, NVIDIA

Similar presentations

Presentation on theme: "Peng Wang, Ph.D. HPC Developer Technology, NVIDIA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peng Wang, Ph.D. HPC Developer Technology, NVIDIA

Similar presentations

Presentation on theme: "Peng Wang, Ph.D. HPC Developer Technology, NVIDIA"— Presentation transcript:

Similar presentations

About project

Feedback