May 8th, 2012 Higher Ed & Research
Molecular Dynamics Applications Overview AMBER NAMD GROMACS LAMMPS Sections Included * * In fullscreen mode, click on link to view a particular module. Click on NVIDIA logo in each slide to return to this page.
Application Features Supported GPU PerfRelease StatusNotes/Benchmarks AMBER PMEMD Explicit Solvent & GB Implicit Solvent ns/day JAC NVE on 16X 2090s Released Multi-GPU, multi-node AMBER 12 htm#Benchmarks htm#Benchmarks NAMD Full electrostatics with PME and most simulation features 6.44 ns/days STMV 585X 2050s Released 100M atom capable Multi-GPU, multi-node NAMD version April 12 md_ bench.html GROMACS Implicit (5x), Explicit (2x) Solvent via OpenMM 165 ns/Day DHFR 4X C2075s 4.5 Single GPU released 4.6 Multi-GPU Released gpu.html LAMMPS Lennard-Jones, Gay- Berne, Tersoff x Released. Multi-GPU, multi-node 1 billion atom on Lincoln: # machine # machine GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Molecular Dynamics (MD) Applications
Application Features Supported GPU PerfRelease StatusNotes Abalone TBD Simulations 4-29X (on 1060 GPU) Released Single GPU. Agile Molecule, Inc. ACEMD Written for use on GPUs 160 ns/dayReleased Production bio-molecular dynamics (MD) software specially optimized to run on single and multi-GPUs DL_POLY Two-body Forces, Link- cell Pairs, Ewald SPME forces, Shake VV 4x V 4.0 Source only Results Published Multi-GPU, multi-node supported HOOMD- Blue Written for use on GPUs 2X (32 CPU cores vs. 2 10XX GPUs) Released, Version Single and multi-GPU. New/Additional MD Applications Ramping GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
GPU Value to Molecular Dynamics What Why How Study disease & discover drugs Predict drug and protein interactions Speed of simulations is critical Enables study of: Longer timeframes Larger systems More simulations GPUs increase throughput & accelerate simulations AMBER 11 Application 4.6x performance increase with 2 GPUs with only a 54% added cost* AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333 GPU Test Drive Pre-configured Applications AMBER 11 NAMD 2.8 GPU Test Drive Pre-configured Applications AMBER 11 NAMD 2.8 GPU Ready Applications Abalone ACEMD AMBER DL_PLOY GAMESS GROMACS LAMMPS NAMD GPU Ready Applications Abalone ACEMD AMBER DL_PLOY GAMESS GROMACS LAMMPS NAMD
All Key MD Codes are GPU Ready AMBER, NAMD, GROMACS, LAMMPS Life and Material Sciences Great multi-GPU performance Additional MD GPU codes: Abalone, ACEMD, HOOMD-Blue Focus: scaling to large numbers of GPUs
Outstanding AMBER Results with GPUs
Run AMBER Faster Up to 5x Speed Up With GPUs DHFR (NVE) 23,558 Atoms “…with two GPUs we can run a single simulation as fast as on 128 CPUs of a Cray XT3 or on 1024 CPUs of an IBM BlueGene/L machine. We can try things that were undoable before. It still blows my mind.” Axel Kohlmeyer Temple University CPU Supercomputer
AMBER Make Research More Productive with GPUs Adding Two 2090 GPUs to a Node Yields a > 4 x Performance Increase Base node configuration: Dual Xeon X5670s and Dual Tesla M2090 GPUs per node 318% Higher Performance 54% Additional Expense No GPU With GPU
Run NAMD Faster Up to 7x Speed Up With GPUs ApoA-1 92,224 Atoms STMV 1,066,628 Atoms Test Platform: 1 Node, Dual Tesla M2090 GPU (6GB), Dual Intel 4-core Xeon (2.4 GHz), NAMD 2.8, CUDA 4.0, ECC On. Visit for more information on speed up results, configuration and test models. NAMD 2.8 B1 + unreleaesd patch, STMV Benchmark A Node is Dual-Socket, Quad-core x5650 with 2 Tesla M2070 GPUs Performance numbers for 2 M cores (GPU+CPU) vs. 8 cores (CPU)
Make Research More Productive with GPUs Get up to a 250% Performance Increase (STMV – 1, atoms) No GPU With GPU 250% Higher 54% Additional Expense
GROMACS Partnership Overview Erik Lindahl, David van der Spoel, Berk Hess are head authors and project leaders. Szilárd Páll is a key GPU developer. 2010: single GPU support (OpenMM library in GROMACS 4.5) NVIDIA Dev Tech resources allocated to GROMACS code 2012: GROMACS 4.6 will support multi-GPU nodes as well as GPU clusters
GROMACS 4.6 Release Features Multi-GPU support - GPU acceleration is one of main focus: majority of features will be accelerated in 4.6 in a transparent fashion PME simulations get special attention, and most of the effort will go into making these algorithms well load-balanced Reaction-Field and Cut-Off simulations also run accelerated List of non-supported GPU accelerated features will be quite short GROMACS Multi-GPU Expected in April 2012
GROMACS 4.6 Alpha Release Absolute Performance Absolute performance of GROMACS running CUDA- and SSE-accelerated non-bonded kernels with PME on 3-12 CPU cores and 1-4 GPUs. Simulations with cubic and truncated dodecahedron cells, pressure coupling, as well as virtual interaction sites enabling 5 fs are shown Benchmark systems: RNAse in water with atoms in cubic and atoms in truncated dodecahedron box Settings: electrostatics cut-off auto-tuned >0.9 nm, LJ cut-off 0.9 nm, 2 fs and 5 fs (with vsites) time steps Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075
GROMACS 4.6 Alpha Release Strong Scaling Strong scaling of GPU-accelerated GROMACS with PME and reaction-field on: Up to 40 cluster nodes with 80 GPUs Benchmark system: water box with 1.5M particles Settings: electrostatics cut-off auto-tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps Hardware: Bullx cluster nodes with 2x Intel Xeon E5649 (6C), 2x NVIDIA Tesla M2090, 2x QDR Infiniband 40 Gb/s
GROMACS 4.6 Alpha Release PME Weak Scaling Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: 3-12 CPU cores and 1-4 GPUs. The gradient background indicates the range of system Sizes which fall beyond the typical single-node production size. Benchmark systems: water boxes size ranging from 1.5k to 3M particles. Settings: electrostatics cut-off auto- tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps. Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075.
GROMACS 4.6 Alpha Release Rxn-Field Weak Scaling Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: 3-12 CPU cores and 1-4 GPUs. The gradient background indicates the range of system sizes which fall beyond the typical single-node production size Benchmark systems: water boxes size ranging from 1.5k to 3M particles Settings: electrostatics cut-off auto- tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075
GROMACS 4.6 Alpha Release Weak Scaling Weak scaling of the CUDA non-bonded force kernel on GeForce and Tesla GPUs. Perfect weak scaling, challenges for strong scaling Benchmark systems: water boxes size ranging from 1.5k to 3M particles Settings: electrostatics & LJ cut-off 1.0 nm, 2 fs time steps Hardware: workstation with 2x Intel Xeon X5650 (6C) CPUs, 4x NVIDIA Tesla C2075
LAMMPS Released GPU Features and Future Plans * Courtesy of Michael Brown at ORNL and Paul Crozier at Sandia Labs LAMMPS August 2009 First GPU accelerated support LAMMPS Aug. 22, 2011 Selected accelerated Non-bonded short‐range potentials (SP, MP, DP support) Lennard-Jones (several variants with & without coulombic interactions) Morse Buckingham CHARMM Tabulated Course grain SDK Anisotropic Gay-Bern RE-squared “Hybrid” combinations (GPU accel & no GPU accel) Particle-Particle Particle-Mesh (SP or DP) Neighbor list builds Longer Term* Improve performance on smaller particle counts Neighbor List is the problem Improve long-range performance MPI/Poisson Solve is the problem Additional pair potential support (including expensive advanced force fields) – See “Tremendous Opportunity for GPUs” slide* Performance improvements focused to specific science problems
W.M. Brown, “GPU Acceleration in LAMMPS”, 2011 LAMMPS Workshop LAMMPS LAMMPS 8.6x Speed-up with GPUs
LAMMPS LAMMPS 4x Faster on Billion Atoms Test Platform: NCSA Lincoln Cluster with S1070 1U GPU servers attached CPU-only Cluster- Cray XT5 Billion Atom Lennard-Jones Benchmark 103 Seconds 288 GPUs + CPUs1920 x86 CPUs
4X-15X Speedups Gay-Berne RE-Squared From August 2011 LAMMPS Workshop Courtesy of W. Michael Brown, ORNL LAMMPS
LAMMPS Conclusions Runs both with individual multi-GPU node, as well as GPU clusters Outstanding raw performance! Performance is 3x-40X higher than equivalent CPU code Impressive linear strong scaling Good weak scaling, scales to a billion particles Tremendous opportunity to GPU accelerate other force fields