GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower Street - London - WC1E 6BT
Computing Intensities three-j symbolsprecomputed time-consuming
TROVE Doing this for each transition is tough!! However we can split it into two parts A half-linestrength for a particular initial state A simple dot product to complete it
TROVE Relegate majority of the computation for each initial state Each transition therefore reduces to a simple dot product However, the half-linestrength can still take a long time Exomol line-lists can have billions of transitions as well This sight is common for particularly dense J: hours = 1.5 months for one J’ J’’ !!
Life is too short to wait around for transitions Question: How can you complete a line-list quickly? (1) Reduce quality of the line-lists (2) Make it faster Hint: The answer is not (1)
The half-linestrength Focus of the talk will be here: H 2 CO: 30 seconds PH 3 : 1 minute SO 3 : 7-8 mins! Tens of thousands of initial states!! High J times:
Half line strength Initial basis-set Final basis-set
Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9
Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9
Half line strength Initial basis-set T:0 T:1 T:2 ….. T: hours was with 16 cores!
Enter the GPU Graphics Processing Units can have around 2000 cores Highly parallel nature with lots of arithmetic capabilities
OpenMP thread OpenMP thread Half line strength For all elements in the J’’ basis-set Get K f, tau f For all elements in the J’ basis-set Get K i, tau i, c i Get dipole Accumulate half-ls vector Do maths
Baseline Kernal Why?
Optimising But we have so many cores!!! WHY!?!?! 1 - Read Ji, Ki, taui 2 - Read dipole matrix 3 - Read coefficients 4 - Do math and accumulate Turns out memory operations are fairly slow. We are doing a lot of memory operations CPUs have really large and multiple caches GPUs have very simple caches……………..
Optimising We are provided a user-managed cache called: Shared memory It’s a small chunk of memory thats REALLY fast A lot of the global memory reads are redundant
Optimising Initial basis-set Final basis-set Each thread is reading the same Ji,Ki, taui and coeffs
Optimizing Why not have the threads cache it instead? Final Initial Cache quanta and coefficients
Optimizing Do math and repeat Final Initial
Optimizing Final Initial This is the Cache and Reduce (CR) Kernal
GPU thread Cache and Reduce For all elements in the J’’ basis-set Get K f, tau f For all elements in the J’ basis-set, step 256 Get K i, tau i, c i at thread point Get dipole Accumulate half-ls vector Do maths Block: 256 threads Store in shared memory For all elements shared memory Get K i, tau i, c i
Optimizing Have each thread cache a part of the initial basis-set Final Initial Cache quanta and coefficients
Optimizing
SO 3 molecule:
Porting to the GPU Half line strength Line strength completion
Simple dot product, replace with cuBLAS version. ~5x faster for H 2 CO However we have lots of final state eigenvectors Strategy is to get lots done in ‘parallel’ Use stream execution Use multiple GPUs Why not both?
Stream execution Run multiple independant kernals simultaneously
Multiple GPUs Run multiple initial states on multiple GPUs
Line strength completion
Porting to the GPU Half line strength Line strength completion
Result:
Future Work Port code to DVR3D Remove dot product and switch to DGEMM Integrate fully into TROVE Finish my PhD
Thanks