Presentation is loading. Please wait.

Presentation is loading. Please wait.

GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower.

Similar presentations


Presentation on theme: "GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower."— Presentation transcript:

1 GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower Street - London - WC1E 6BT ahmed.al-refaie.12@ucl.ac.uk

2 Computing Intensities three-j symbolsprecomputed time-consuming

3 TROVE Doing this for each transition is tough!! However we can split it into two parts A half-linestrength for a particular initial state A simple dot product to complete it

4 TROVE Relegate majority of the computation for each initial state Each transition therefore reduces to a simple dot product However, the half-linestrength can still take a long time Exomol line-lists can have billions of transitions as well This sight is common for particularly dense J: 1043.19 hours = 1.5 months for one J’ J’’ !!

5 Life is too short to wait around for transitions Question: How can you complete a line-list quickly? (1) Reduce quality of the line-lists (2) Make it faster Hint: The answer is not (1)

6 The half-linestrength Focus of the talk will be here: H 2 CO: 30 seconds PH 3 : 1 minute SO 3 : 7-8 mins! Tens of thousands of initial states!! High J times:

7

8 Half line strength Initial basis-set Final basis-set

9 Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9

10 Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9

11 Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9 1043.19 hours was with 16 cores!

12 Enter the GPU Graphics Processing Units can have around 2000 cores Highly parallel nature with lots of arithmetic capabilities

13 OpenMP thread OpenMP thread Half line strength For all elements in the J’’ basis-set Get K f, tau f For all elements in the J’ basis-set Get K i, tau i, c i Get dipole Accumulate half-ls vector Do maths

14 Baseline Kernal Why?

15 Optimising But we have so many cores!!! WHY!?!?! 1 - Read Ji, Ki, taui 2 - Read dipole matrix 3 - Read coefficients 4 - Do math and accumulate Turns out memory operations are fairly slow. We are doing a lot of memory operations CPUs have really large and multiple caches GPUs have very simple caches……………..

16 Optimising We are provided a user-managed cache called: Shared memory It’s a small chunk of memory thats REALLY fast A lot of the global memory reads are redundant

17 Optimising Initial basis-set Final basis-set Each thread is reading the same Ji,Ki, taui and coeffs

18 Optimizing Why not have the threads cache it instead? Final Initial Cache quanta and coefficients

19 Optimizing Do math and repeat Final Initial

20 Optimizing Final Initial This is the Cache and Reduce (CR) Kernal

21 GPU thread Cache and Reduce For all elements in the J’’ basis-set Get K f, tau f For all elements in the J’ basis-set, step 256 Get K i, tau i, c i at thread point Get dipole Accumulate half-ls vector Do maths Block: 256 threads Store in shared memory For all elements shared memory Get K i, tau i, c i

22 Optimizing Have each thread cache a part of the initial basis-set Final Initial Cache quanta and coefficients

23 Optimizing

24 SO 3 molecule:

25 Porting to the GPU Half line strength Line strength completion

26 Simple dot product, replace with cuBLAS version. ~5x faster for H 2 CO However we have lots of final state eigenvectors Strategy is to get lots done in ‘parallel’ Use stream execution Use multiple GPUs Why not both?

27 Stream execution Run multiple independant kernals simultaneously

28 Multiple GPUs Run multiple initial states on multiple GPUs

29 Line strength completion

30 Porting to the GPU Half line strength Line strength completion

31 Result:

32 Future Work Port code to DVR3D Remove dot product and switch to DGEMM Integrate fully into TROVE Finish my PhD

33 Thanks


Download ppt "GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower."

Similar presentations


Ads by Google