GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower.

GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower Street - London - WC1E 6BT ahmed.al-refaie.12@ucl.ac.uk

Computing Intensities three-j symbolsprecomputed time-consuming

TROVE Doing this for each transition is tough!! However we can split it into two parts A half-linestrength for a particular initial state A simple dot product to complete it

TROVE Relegate majority of the computation for each initial state Each transition therefore reduces to a simple dot product However, the half-linestrength can still take a long time Exomol line-lists can have billions of transitions as well This sight is common for particularly dense J: 1043.19 hours = 1.5 months for one J’ J’’ !!

Life is too short to wait around for transitions Question: How can you complete a line-list quickly? (1) Reduce quality of the line-lists (2) Make it faster Hint: The answer is not (1)

The half-linestrength Focus of the talk will be here: H 2 CO: 30 seconds PH 3 : 1 minute SO 3 : 7-8 mins! Tens of thousands of initial states!! High J times:

Half line strength Initial basis-set Final basis-set

Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9

Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9 1043.19 hours was with 16 cores!

Enter the GPU Graphics Processing Units can have around 2000 cores Highly parallel nature with lots of arithmetic capabilities

OpenMP thread OpenMP thread Half line strength For all elements in the J’’ basis-set Get K f, tau f For all elements in the J’ basis-set Get K i, tau i, c i Get dipole Accumulate half-ls vector Do maths

Baseline Kernal Why?

Optimising But we have so many cores!!! WHY!?!?! 1 - Read Ji, Ki, taui 2 - Read dipole matrix 3 - Read coefficients 4 - Do math and accumulate Turns out memory operations are fairly slow. We are doing a lot of memory operations CPUs have really large and multiple caches GPUs have very simple caches……………..

Optimising We are provided a user-managed cache called: Shared memory It’s a small chunk of memory thats REALLY fast A lot of the global memory reads are redundant

Optimising Initial basis-set Final basis-set Each thread is reading the same Ji,Ki, taui and coeffs

Optimizing Why not have the threads cache it instead? Final Initial Cache quanta and coefficients

Optimizing Do math and repeat Final Initial

Optimizing Final Initial This is the Cache and Reduce (CR) Kernal

GPU thread Cache and Reduce For all elements in the J’’ basis-set Get K f, tau f For all elements in the J’ basis-set, step 256 Get K i, tau i, c i at thread point Get dipole Accumulate half-ls vector Do maths Block: 256 threads Store in shared memory For all elements shared memory Get K i, tau i, c i

Optimizing Have each thread cache a part of the initial basis-set Final Initial Cache quanta and coefficients

Optimizing

SO 3 molecule:

Porting to the GPU Half line strength Line strength completion

Simple dot product, replace with cuBLAS version. ~5x faster for H 2 CO However we have lots of final state eigenvectors Strategy is to get lots done in ‘parallel’ Use stream execution Use multiple GPUs Why not both?

Stream execution Run multiple independant kernals simultaneously

Multiple GPUs Run multiple initial states on multiple GPUs

Line strength completion

Porting to the GPU Half line strength Line strength completion

Result:

Future Work Port code to DVR3D Remove dot product and switch to DGEMM Integrate fully into TROVE Finish my PhD

Thanks

GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower.

Similar presentations

Presentation on theme: "GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower.

Similar presentations

Presentation on theme: "GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower."— Presentation transcript:

Similar presentations

About project

Feedback