Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quantum Chemistry and First Principles Molecular Dynamics on GPUs Todd J. Martinez Dept of Chemistry and PULSE Institute, Stanford University SLAC National.

Similar presentations


Presentation on theme: "Quantum Chemistry and First Principles Molecular Dynamics on GPUs Todd J. Martinez Dept of Chemistry and PULSE Institute, Stanford University SLAC National."— Presentation transcript:

1 Quantum Chemistry and First Principles Molecular Dynamics on GPUs Todd J. Martinez Dept of Chemistry and PULSE Institute, Stanford University SLAC National Accelerator Laboratory TeraChem Team: Ivan Ufimtsev (Core SCF and Gradient Routines) Alexey Titov (meta-programming strategies) Nathan Luehr (DFT Quadrature and Dynamic Precision) Jeffrey Gour (TDDFT Gradients and Nonadiabatic Couplings) Christine Isborn (TDDFT and CIS) Heather Kulik (Ab Initio Protein Structure Optimization) Funding: NSF, AFOSR-STTR, DoD, DOE www.petachem.com

2 Quantum Chemistry Where are the electrons? Molecular Dynamics Where are the atoms? or Potential Energy Surface Two Pillars of Chemical Simulation

3 Hartree-Fock Approximation Ψ electronic is antisymmetrized product of N 1-electron orbitals ψ Expand ψ over predefined basis set φ Often taken to be atom-centered Gaussian functions in molecular (non-periodic) case Density Functional Theory – Include K xc [ρ(ψ)] in mean-field Hamiltonian Adds need for numerical quadrature but this is one of the most straightforward tasks to implement efficiently on GPU (1-2 days) and it is not a major bottleneck

4 Hartree-Fock equations All matrices are of size NxN (N ~ 1,000 … 10,000) N 3 operations to solve HF equations (diagonalization of F) N 4 operations to get F (formally) Fock matrix construction is usually the bottleneck

5 What are GPUs good for? “Stream processing” Massively parallel problems with millions threads running concurrently (more than 10,000 threads can run in parallel) Simple kernels – to launch as many threads as possible to hide memory latency High computational load (big FLOPS/MOPS ratio) Little or no communication between threads, except for small blocks of threads (up to 512-thread blocks) that can share data very efficiently

6 6 Every atomic orbital is a fixed contraction of Gaussians Molecular orbitals are orthogonal contractions of AOs Antisymmetrized products of MOs Total electronic wfn is contraction of APs Large amount of data management and linear algebra Quantum chemistry is “stream-friendly!” Large amount of data management and linear algebra Quantum chemistry is “stream-friendly!” Quantum Chemistry - The Never-Ending Contraction

7 Computation of ERIs on GPU CPUGPU “Pair quantities” for bra and ket – N 2 [bra|ket] integrals – N 4 1 thread for each primitive integral!

8 Arrangement of Two-Electron Integrals Different code for each integral type, e.g. [ss|ss] vs [sp|ss] Rearrange computation to optimize kernel execution

9 Further Arrangement of Integrals leaves only N 2 out of N 4 integrals SIMD warp Most negligibly small integrals will be calculated SIMD warp Only significant integrals will be calculated [ij| |kl] ssspppsssppp 1.0 1.0x10 -3 1.0x10 -6 1.0x10 -9 ss sp pp ss sp pp Note: “Pre”-Computation to put problem in form suitable for GPU – this is essential and always (in our experience) the key to good performance Likely also better for modern CPUs and multi-core (but CPUs are more forgiving)

10 J-matrix implementation Next critical step is ordering threads into blocks and specifying how to traverse the problem

11 How Well Does this Work in Practice? RHF/3-21G

12 Riding the Wave of Faster GPUs… GT200  Fermi No code modification

13 TeraChem / GAMESS Benchmarks The speedups are bigger for larger molecules: GAMESS, 32 Intel Clovertown cores: 12954 s TeraChem, 4 GeForce 295 GTX GPUs: 407 s 127X GPU vs CPU core speedup 1 TeraStation ($10K) = 128 Dual Quad-Core Intel Nodes ($512K-$1M) ≈ BPTI Protein

14 Numerical Precision on GPU Take advantage of faster single precision arithmetic Can never use only single precision, but can use mixed precision Diagonalize in double precision Accumulate contributions to Fock matrix in double precision (MP1) Use DP for largest integrals, SP for smallest (MP2)

15 Ascorbic AcidLactoseCyano Toxin GAMESS-680.5828947-1289.6666250-2491.2058893 TeraChem DP-680.5828947-1289.6666250-2491.2058890 TeraChem MP1-680.5828071-1289.6664266-2491.2053660 Neurokinin A5x6 Nanotube GAMESS-4089.6883770-13790.1415171 TeraChem DP-4089.6883762-13790.1415176 TeraChem MP1-4089.6879824-13790.1389987 Numerical Precision on GPU

16 Beyond Mixed Precision Double PrecisionDynamic Precision Precision Error Conv Thres Final EnergyIterFinal EnergyIter Ascorbic Acid (1000K) -680.582894715117-680.5828947066178.50E-0910 -7 -680.582894706012-680.5828947665126.05E-0810 -5 Lactose (Minimum) -1290.088346063214-1290.0883460414142.18E-0810 -7 -1290.088346008610-1290.0883459660104.26E-0810 -5 Cyano Toxin (2000K) -2491.205889023521-2491.2058890017212.18E-0810 -7 -2491.205888991613-2491.2058886707133.21E-0710 -5 Neurokinin A (Minimum) -4091.367264555519-4091.3672645944203.89E-0810 -7 -4091.367264548914-4091.3672644494149.95E-0810 -5 Nanotube (Minimum) -13793.729392522124-13793.7293925323231.02E-0810 -7 -13793.729392492215-13793.7293928287153.37E-0710 -5 Nanotube (2000K) -13790.141517566229-13790.1415175584277.80E-0910 -7 -13790.141517533218-13790.1415191026181.57E-0610 -5 Crambin -17996.656292553818-17996.6562926036184.98E-0810 -7 -17996.656292553512-17996.6562927894122.36E-0710 -5 Ubiquitin -29616.442637659424-29616.4426376596242.00E-1010 -7 -29616.442637630218-29616.4426376655183.53E-0810 -5 T-Cadherin EC1 -36975.672604940721-36975.6726049394211.30E-0910 -7 -36975.672604926516-36975.6726048777164.88E-0810 -5 Ribonuclease A -50813.147124822719-50813.1471248179194.80E-0910 -7 -50813.147124705112-50813.1471250247123.20E-0710 -5 Change precision dynamically… 2-4x speedup over pure double precision (2x for GTX480, 4x for GTX295)

17 Summary Some Key issues: Compute upfront on the CPU to optimize computation on GPU Need many lightweight threads – look for the data which scales worst in order to determine parallelization direction Often want to waste compute cycles to maximize throughput e.g., do not take advantage of integral index symmetries (ij|kl)=(kl|ij) in J-matrix build Consider writing codes (or using a combination of Maple/Python/C) to write codes – forerunner to an ATLAS-like strategy for more complex tasks than linear algebra Many of these ideas are also useful on CPUs – the primary difference is that bad GPU code is 100x slower than good GPU code while bad CPU code is 5x slower than good CPU code…


Download ppt "Quantum Chemistry and First Principles Molecular Dynamics on GPUs Todd J. Martinez Dept of Chemistry and PULSE Institute, Stanford University SLAC National."

Similar presentations


Ads by Google