Quantum Chemistry and First Principles Molecular Dynamics on GPUs Todd J. Martinez Dept of Chemistry and PULSE Institute, Stanford University SLAC National.

Slides:

Advertisements

Similar presentations

Introduction to Computational Chemistry NSF Computational Nanotechnology and Molecular Engineering Pan-American Advanced Studies Institutes (PASI) Workshop.

Advertisements

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

Ionization of the Hydrogen Molecular Ion by Ultrashort Intense Elliptically Polarized Laser Radiation Ryan DuToit Xiaoxu Guan (Mentor) Klaus Bartschat.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 03 Some materials adapted from Prof. Keith E. Gubbins:

Molecular Quantum Mechanics

Introduction to Molecular Orbitals

Chapter 3 Electronic Structures

Quantum Mechanics and Force Fields Hartree-Fock revisited Semi-Empirical Methods Basis sets Post Hartree-Fock Methods Atomic Charges and Multipoles QM.

Computational Chemistry

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

Introduction to ab initio methods I Kirill Gokhberg.

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

Introduction to the very basic computational aspects of the modern Quantum Chemistry for Software Engineers Alexander A. Granovsky The PC GAMESS/Firefly.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

1 Java Grande Introduction  Grande Application: a GA is any application, scientific or industrial, that requires a large number of computing resources(CPUs,

Large-Scale Density Functional Calculations James E. Raynolds, College of Nanoscale Science and Engineering Lenore R. Mullin, College of Computing and.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPGPU platforms GP - General Purpose computation using GPU

Computational Chemistry

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Molecular Modeling : Beyond Empirical Equations Quantum Mechanics Realm C372 Introduction to Cheminformatics II Kelsey Forsythe.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

One-sided Communication Implementation in FMO Method J. Maki, Y. Inadomi, T. Takami, R. Susukita †, H. Honda, J. Ooba, T. Kobayashi, R. Nogita, K. Inoue.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Molecular Orbital Theory. Model Our model of an atom is layers of atomic orbitals (AOs): 1s1s 2s2s 3s3s 2p2p 3p3p 3d3d As atoms approach each other their.

Lecture 10. Chemical Bonding. Generalization & Matrix Representation References Engel Ch.12, Ratner & Schatz, Ch.11 Quantum Chemistry, McQuarrie, Ch.9.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Chemistry 700 Lectures. Resources Grant and Richards, Foresman and Frisch, Exploring Chemistry with Electronic Structure Methods (Gaussian Inc., 1996)

The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.

Last hour: Electron Spin Triplet electrons “avoid each other”, the WF of the system goes to zero if the two electrons approach each other. Consequence:

Martin Kruliš by Martin Kruliš (v1.1)1.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Lecture 8. Chemical Bonding

Molecular quantum mechanics - electron has cartesian and spin coordinates one electron functions one electron functions - no spin operator in electronic.

Sunpyo Hong, Hyesoon Kim

MODELING MATTER AT NANOSCALES 4. Introduction to quantum treatments Eigenvectors and eigenvalues of a matrix.

Determination of surface structure of Si-rich SiC(001)(3x2) Results for the two-adlayer asymmetric dimer model (TAADM) are the only ones that agree with.

Advanced methods of molecular dynamics 1.Monte Carlo methods 2.Free energy calculations 3.Ab initio molecular dynamics 4.Quantum molecular dynamics 5.Trajectory.

Linear Scaling Quantum Chemistry Richard P. Muller 1, Bob Ward 2, and William A. Goddard, III 1 1 Materials and Process Simulation Center California Institute.

Basics of Quantum Chemistry Todd J. Martinez. “Conventional” Quantum Chemistry Is DFT the end of the story? No! Even the best DFT often yield errors of.

©2011, Jordan, Schmidt & Kable Lecture 13 Lecture 13 Self-consistent field theory This is how we do it.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

CS203 – Advanced Computer Architecture

Real-Time Ray Tracing Stefan Popov.

Statistical Mechanics and Multi-Scale Simulation Methods ChBE

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Electronic Structure Theory

Hybrid Programming with OpenMP and MPI

Accelerating Quantum Chemistry with Batched and Vectorized Integrals

Multicore and GPU Programming

Parallel computing in Computational chemistry

Presentation transcript:

Quantum Chemistry and First Principles Molecular Dynamics on GPUs Todd J. Martinez Dept of Chemistry and PULSE Institute, Stanford University SLAC National Accelerator Laboratory TeraChem Team: Ivan Ufimtsev (Core SCF and Gradient Routines) Alexey Titov (meta-programming strategies) Nathan Luehr (DFT Quadrature and Dynamic Precision) Jeffrey Gour (TDDFT Gradients and Nonadiabatic Couplings) Christine Isborn (TDDFT and CIS) Heather Kulik (Ab Initio Protein Structure Optimization) Funding: NSF, AFOSR-STTR, DoD, DOE

Quantum Chemistry Where are the electrons? Molecular Dynamics Where are the atoms? or Potential Energy Surface Two Pillars of Chemical Simulation

Hartree-Fock Approximation Ψ electronic is antisymmetrized product of N 1-electron orbitals ψ Expand ψ over predefined basis set φ Often taken to be atom-centered Gaussian functions in molecular (non-periodic) case Density Functional Theory – Include K xc [ρ(ψ)] in mean-field Hamiltonian Adds need for numerical quadrature but this is one of the most straightforward tasks to implement efficiently on GPU (1-2 days) and it is not a major bottleneck

Hartree-Fock equations All matrices are of size NxN (N ~ 1,000 … 10,000) N 3 operations to solve HF equations (diagonalization of F) N 4 operations to get F (formally) Fock matrix construction is usually the bottleneck

What are GPUs good for? “Stream processing” Massively parallel problems with millions threads running concurrently (more than 10,000 threads can run in parallel) Simple kernels – to launch as many threads as possible to hide memory latency High computational load (big FLOPS/MOPS ratio) Little or no communication between threads, except for small blocks of threads (up to 512-thread blocks) that can share data very efficiently

6 Every atomic orbital is a fixed contraction of Gaussians Molecular orbitals are orthogonal contractions of AOs Antisymmetrized products of MOs Total electronic wfn is contraction of APs Large amount of data management and linear algebra Quantum chemistry is “stream-friendly!” Large amount of data management and linear algebra Quantum chemistry is “stream-friendly!” Quantum Chemistry - The Never-Ending Contraction

Computation of ERIs on GPU CPUGPU “Pair quantities” for bra and ket – N 2 [bra|ket] integrals – N 4 1 thread for each primitive integral!

Arrangement of Two-Electron Integrals Different code for each integral type, e.g. [ss|ss] vs [sp|ss] Rearrange computation to optimize kernel execution

Further Arrangement of Integrals leaves only N 2 out of N 4 integrals SIMD warp Most negligibly small integrals will be calculated SIMD warp Only significant integrals will be calculated [ij| |kl] ssspppsssppp x x x10 -9 ss sp pp ss sp pp Note: “Pre”-Computation to put problem in form suitable for GPU – this is essential and always (in our experience) the key to good performance Likely also better for modern CPUs and multi-core (but CPUs are more forgiving)

J-matrix implementation Next critical step is ordering threads into blocks and specifying how to traverse the problem

How Well Does this Work in Practice? RHF/3-21G

Riding the Wave of Faster GPUs… GT200  Fermi No code modification

TeraChem / GAMESS Benchmarks The speedups are bigger for larger molecules: GAMESS, 32 Intel Clovertown cores: s TeraChem, 4 GeForce 295 GTX GPUs: 407 s 127X GPU vs CPU core speedup 1 TeraStation ($10K) = 128 Dual Quad-Core Intel Nodes ($512K-$1M) ≈ BPTI Protein

Numerical Precision on GPU Take advantage of faster single precision arithmetic Can never use only single precision, but can use mixed precision Diagonalize in double precision Accumulate contributions to Fock matrix in double precision (MP1) Use DP for largest integrals, SP for smallest (MP2)

Ascorbic AcidLactoseCyano Toxin GAMESS TeraChem DP TeraChem MP Neurokinin A5x6 Nanotube GAMESS TeraChem DP TeraChem MP Numerical Precision on GPU

Beyond Mixed Precision Double PrecisionDynamic Precision Precision Error Conv Thres Final EnergyIterFinal EnergyIter Ascorbic Acid (1000K) E E Lactose (Minimum) E E Cyano Toxin (2000K) E E Neurokinin A (Minimum) E E Nanotube (Minimum) E E Nanotube (2000K) E E Crambin E E Ubiquitin E E T-Cadherin EC E E Ribonuclease A E E Change precision dynamically… 2-4x speedup over pure double precision (2x for GTX480, 4x for GTX295)

Summary Some Key issues: Compute upfront on the CPU to optimize computation on GPU Need many lightweight threads – look for the data which scales worst in order to determine parallelization direction Often want to waste compute cycles to maximize throughput e.g., do not take advantage of integral index symmetries (ij|kl)=(kl|ij) in J-matrix build Consider writing codes (or using a combination of Maple/Python/C) to write codes – forerunner to an ATLAS-like strategy for more complex tasks than linear algebra Many of these ideas are also useful on CPUs – the primary difference is that bad GPU code is 100x slower than good GPU code while bad CPU code is 5x slower than good CPU code…