Download presentation
Presentation is loading. Please wait.
Published byHollie Hart Modified over 9 years ago
1
GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower Street - London - WC1E 6BT ahmed.al-refaie.12@ucl.ac.uk
2
Computing Intensities three-j symbolsprecomputed time-consuming
3
TROVE Doing this for each transition is tough!! However we can split it into two parts A half-linestrength for a particular initial state A simple dot product to complete it
4
TROVE Relegate majority of the computation for each initial state Each transition therefore reduces to a simple dot product However, the half-linestrength can still take a long time Exomol line-lists can have billions of transitions as well This sight is common for particularly dense J: 1043.19 hours = 1.5 months for one J’ J’’ !!
5
Life is too short to wait around for transitions Question: How can you complete a line-list quickly? (1) Reduce quality of the line-lists (2) Make it faster Hint: The answer is not (1)
6
The half-linestrength Focus of the talk will be here: H 2 CO: 30 seconds PH 3 : 1 minute SO 3 : 7-8 mins! Tens of thousands of initial states!! High J times:
8
Half line strength Initial basis-set Final basis-set
9
Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9
10
Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9
11
Half line strength Initial basis-set T:0 T:1 T:2 ….. T:9 1043.19 hours was with 16 cores!
12
Enter the GPU Graphics Processing Units can have around 2000 cores Highly parallel nature with lots of arithmetic capabilities
13
OpenMP thread OpenMP thread Half line strength For all elements in the J’’ basis-set Get K f, tau f For all elements in the J’ basis-set Get K i, tau i, c i Get dipole Accumulate half-ls vector Do maths
14
Baseline Kernal Why?
15
Optimising But we have so many cores!!! WHY!?!?! 1 - Read Ji, Ki, taui 2 - Read dipole matrix 3 - Read coefficients 4 - Do math and accumulate Turns out memory operations are fairly slow. We are doing a lot of memory operations CPUs have really large and multiple caches GPUs have very simple caches……………..
16
Optimising We are provided a user-managed cache called: Shared memory It’s a small chunk of memory thats REALLY fast A lot of the global memory reads are redundant
17
Optimising Initial basis-set Final basis-set Each thread is reading the same Ji,Ki, taui and coeffs
18
Optimizing Why not have the threads cache it instead? Final Initial Cache quanta and coefficients
19
Optimizing Do math and repeat Final Initial
20
Optimizing Final Initial This is the Cache and Reduce (CR) Kernal
21
GPU thread Cache and Reduce For all elements in the J’’ basis-set Get K f, tau f For all elements in the J’ basis-set, step 256 Get K i, tau i, c i at thread point Get dipole Accumulate half-ls vector Do maths Block: 256 threads Store in shared memory For all elements shared memory Get K i, tau i, c i
22
Optimizing Have each thread cache a part of the initial basis-set Final Initial Cache quanta and coefficients
23
Optimizing
24
SO 3 molecule:
25
Porting to the GPU Half line strength Line strength completion
26
Simple dot product, replace with cuBLAS version. ~5x faster for H 2 CO However we have lots of final state eigenvectors Strategy is to get lots done in ‘parallel’ Use stream execution Use multiple GPUs Why not both?
27
Stream execution Run multiple independant kernals simultaneously
28
Multiple GPUs Run multiple initial states on multiple GPUs
29
Line strength completion
30
Porting to the GPU Half line strength Line strength completion
31
Result:
32
Future Work Port code to DVR3D Remove dot product and switch to DGEMM Integrate fully into TROVE Finish my PhD
33
Thanks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.