ALGORITHM IMPROVEMENTS AND HPC 1. Basic MD algorithm 1. Initialize atoms’ status and the LJ potential table; set parameters controlling the simulation;

Slides:

Advertisements

Similar presentations

Time averages and ensemble averages

Advertisements

Severs AIST Cluster (50 CPU) Titech Cluster (200 CPU) KISTI Cluster (25 CPU) Climate Simulation on ApGrid/TeraGrid at SC2003 Client (AIST) Ninf-G Severs.

Simulazione di Biomolecole: metodi e applicazioni giorgio colombo

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Survey of Molecular Dynamics Simulations By Will Welch For Jan Kubelka CHEM 4560/5560 Fall, 2014 University of Wyoming.

LINCS: A Linear Constraint Solver for Molecular Simulations Berk Hess, Hemk Bekker, Herman J.C.Berendsen, Johannes G.E.M.Fraaije Journal of Computational.

Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Memory Management 2010.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Joo Chul Yoon with Prof. Scott T. Dunham Electrical Engineering University of Washington Molecular Dynamics Simulations.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Advanced methods of molecular dynamics Monte Carlo methods

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

ExTASY 0.1 Beta Testing 1 st April 2015

Neural and Evolutionary Computing - Lecture 10 1 Parallel and Distributed Models in Evolutionary Computing  Motivation  Parallelization models  Distributed.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Monte Carlo Simulation of Liquid Water Daniel Shoemaker Reza Toghraee MSE 485/PHYS Spring 2006.

Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

Molecular Dynamics A brief overview. 2 Notes - Websites "A Molecular Dynamics Primer", F. Ercolessi

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

E-science grid facility for Europe and Latin America E2GRIS1 André A. S. T. Ribeiro – UFRJ (Brazil) Itacuruça (Brazil), 2-15 November 2008.

Introduction to Lattice Simulations. Cellular Automata What are Cellular Automata or CA? A cellular automata is a discrete model used to study a range.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Comparative Study of NAMD and GROMACS

Molecular Modelling - Lecture 2 Techniques for Conformational Sampling Uses CHARMM force field Written in C++

Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Anton, a Special-Purpose Machine for Molecular Dynamics Simulation By David E. Shaw et al Presented by Bob Koutsoyannis.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Review Session BS123A/MB223 UC-Irvine Ray Luo, MBB, BS.

MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

CS 732: Advance Machine Learning

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Data-Driven Time-Parallelization in the AFM Simulation of Proteins L. Ji, H. Nymeyer, A. Srinivasan, and Y. Yu Florida State University

PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,

Molecular dynamics (MD) simulations  A deterministic method based on the solution of Newton’s equation of motion F i = m i a i for the ith particle; the.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Distributed Processors

Task Scheduling for Multicore CPUs and NUMA Systems

Parallel Algorithm Design

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Parallel Programming in C with MPI and OpenMP

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University

Large Time Scale Molecular Paths Using Least Action.

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

Parallel computing in Computational chemistry

Computational issues Issues Solutions Large time scale

Presentation transcript:

ALGORITHM IMPROVEMENTS AND HPC 1

Basic MD algorithm 1. Initialize atoms’ status and the LJ potential table; set parameters controlling the simulation; O(N) 2. For all time steps do 3. Update positions of all atoms; O(N) 4. If there are atoms who have moved too much, do Update the neighbour list, including all atom pairs that are within a distance range (half distance matrix computation); O(N 2 ) End if 5. Make use of the neighbour list to compute forces acted on all atoms; O(N) 6. Update velocities of all atoms; O(N) 7. Update the displace list, which contains the displacements of atoms; O(N) 8. Accumulate and output target statistics of each time step; O(N) 9. End for 2

Some general ideas MD basic algorithm has not changed the last 25 years Most major MD codes use similar approaches for optimization All trivial tricks have been already applied!! No major breakthrough except perhaps GPU’s and Anton –MD improvement is largely ligated to the improvement of hardware resources Can we use efficiently a present supercomputer for MD? 3

Improvement directions Floating-point calculations –Precision, Custom Maths Non-Bonded interactions –PBC, Common interactions Stream processing & GPUs Integration steps –Multiple time steps, Constrains, Virtual interactions sites Parallelization. Spatial decomposition Asynchronous parallelization –Replica exchange, Markov state models P. Larsson, B. Hess & E. Lindahl. WIRE Comp. Mol. Sci. 2011, 1,

Floating point calculations MD limiting step is the calculation of Non-bonded interactions –Floating-point operations –Random memory access Move to simple precision –Problem for accumulated numerical error or for too small position displacements –However, force calculations require only a relative error < –Small effect in the calculation itself in modern processors –Improves use cache, SIMD register –Math routines in single precision are more efficient Custom Math routines –MD does not require maths to be IEEE complain!! –Some internal checks can be avoided (ex. positive parameter in sqrt ) 5

Non-bonded interactions Avoid O(N 2 ) –Maintain list of NB pairs within a distance cutoff Updated at larger time steps Caution, too short cutoff distance give wrong behaviour –Domain decomposition (more later) –PBC requires to check for the closest image Simple approach with 6 conditionals kill modern processors Cut-off sphere partitioning method –Maintains information about all 26 copies of the particle, updated with the NB list. –Grid based NB pair list 6

Specialized interactions Some interactions are more abundant than others –WAT-WAT dominate NB interactions list –Water models SPC, TIP3P, TIP4P Rigid, only VdW in O, reduced set of charges –The number of interactions to check is largely reduced –Only oxygens should be considered in the NB Lists A recent Charmm implementation uses a combined Coulomb + Lennard-Jones potential for Wat-Wat interactions 7

Streaming & GPUs GPUs have become popular in simulation field. –One order of magnitude faster that comparable CPUs –Requires reformulate algorithms –Usually the approach is not compatible with others –SIMD approach, –No random access to memory, no persistent data CPU is responsible of the main algorithm, GPU performs selected parts –Update the NB list, Evaluate interactions, Update coordinates & velocities GPU could be viewed as a multicore device executing a high number of threads in parallel –Threads are grouped in blocks that share memory. –Memory distribution could be limitation –Atom based decomposition is preferred 8

Use of GPU’s 9

Integration steps Multiple time steps –Numerical integrations require that the time step is kept below the fastest movement in the system Normally 1 femtosecond (10 9 iterations to reach biologically meaningful time scales) –Most parts of the calculation do not change significatively in 1 fs. GROMACS style: –1fs bond vibrations, 2fs NB interactions, 4fs PME (long range interactions) –Tenths fs (NB list update, temperature and pressure update) Increasing the granularity of the minimal calculation (ex. in coarse-grained) will allow longer time steps 10

Integration steps. Constrains Constrains –Some movements are not relevant for global flexibility –Algorithms like SHAKE, RATTLE, M-SHAKE, LINCS, update bond distances by direct minimization Constrains are very efective but limit parallelization –Usually restricted to bonds with hydrogen atoms Virtual interaction sites –Hydrogen atoms are replaced by virtual atoms –Possible to reach 4-5 fs 11

MD Parallelization MD parallelization uses MPI –Heavily limited by communication bandwith –Modern multicore processors will allow to combine with openMP Simplest strategy is Atoms/Force decomposition –Atoms are distributed among processing units –Bad and unpredictable load balance –Requires all-to-all update of coordinates –Partially improved by distribution NB list instead of atoms –Scales only 8-16 processors Distribution should be dynamic: Spatial decomposition –Space regions (not atoms) are distributed among processing units –Assignment of atoms is dynamic, depending on their position in space 12

Spatial decomposition 13 Model to use depends on hardware communication model –Eight shell for single network (classical Linux cluster) –Midpoint for full-duplex connections (Cray, BlueGene) Pulses from on processor to the next in each geometrical direction diminish communication calls for long range interactions

Spatial decomposition 14 Load balancing is the major efficiency problem –Cells need not to be equal, and possibly updated dynamically

Asyncronous parallelization Any parallelization method become useless with less of ca. 100 atoms per processing unit –Present computers can have thousands of cores but most usual simulations have around 100,000 atoms !!! A single long simulation can be replaced by a series of shorter ones. –Same statistics –Sampling is improved as initial structures could be different –In most cases simulations are completely independent (embarrisingly parallel) –A single supercomputer is not really needed Distributed computing Grid 15

Replica Exchange

Markov State Model Kinetic clustering of states –Trajectory defines different ensembles separated by energy barriers Markov models used to predict transitions –Trained with a large number of short simulations 17

Conclusions Major MD codes uses a combination of optimization methods Current routine tends to be GPU based calculations, but combinations MPI + openMP are being considered. –GPU + MPI + … approaches are being considered but are highly dependent on hardware evolution. No parallelization schema will survive next generation of supercomputers 18

MD & HPC future? “An ideal approach would for instance be to run a simulation of an ion channel system over 1000 cores in parallel, employing, e.g., 1000 independent such simulations for kinetic clustering, and finally perhaps use this to screen thousands of ligands or mutations in parallel (each setup using 1,000,000 cores)” P. Larsson, B. Hess & E. Lindahl. WIRE Comp. Mol. Sci. 2011, 1,