P ARALLELIZATION IN M OLECULAR D YNAMICS By Aditya Mittal For CME346A by Professor Eric Darve Stanford University.

Slides:

Advertisements

Similar presentations

Load Balancing Parallel Applications on Heterogeneous Platforms.

Advertisements

Simple but slow: O(n 2 ) algorithms Serial Algorithm Algorithm compares each particle with every other particle and checks for the interaction radius Most.

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Formulation of an algorithm to implement Lowe-Andersen thermostat in parallel molecular simulation package, LAMMPS Prathyusha K. R. and P. B. Sunil Kumar.

Transfer FAS UAS SAINT-PETERSBURG STATE UNIVERSITY COMPUTATIONAL PHYSICS Introduction Physical basis Molecular dynamics Temperature and thermostat Numerical.

N-Body I CS 170: Computing for the Sciences and Mathematics.

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

2 Less fish … More fish! Parallelism means doing multiple things at the same time: you can get more work done in the same time.

Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.

Generated Waypoint Efficiency: The efficiency considered here is defined as follows: As can be seen from the graph, for the obstruction radius values (200,

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

The Calculation of Enthalpy and Entropy Differences??? (Housekeeping Details for the Calculation of Free Energy Differences) first edition: p

Tirgul 9 Amortized analysis Graph representation.

1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.

StreamMD Molecular Dynamics Eric Darve. MD of water molecules Cutoff is used to truncate electrostatic potential Gridding technique: water molecules are.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Continuum Crowds Adrien Treuille, Siggraph 王上文.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Continuous Time Monte Carlo and Driven Vortices in a Periodic Potential V. Gotcheva, Yanting Wang, Albert Wang and S. Teitel University of Rochester Lattice.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :

Given UPC algorithm – Cyclic Distribution Simple algorithm does cyclic distribution This means that data is not local unless item weight is a multiple.

Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays Paper by Kenneth Moreland, Brian Wylie, and Constantine Pavlakos Presented.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.

Lecture 10 CSS314 Parallel Computing

Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.

Week 5 - Wednesday.  What did we talk about last time?  Project 2  Normal transforms  Euler angles  Quaternions.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.

Molecular Dynamics Simulations on a GPU in OpenCL Alex Cappiello.

ALGORITHM IMPROVEMENTS AND HPC 1. Basic MD algorithm 1. Initialize atoms’ status and the LJ potential table; set parameters controlling the simulation;

Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:

Plan Last lab will be handed out on 11/22. No more labs/home works after Thanksgiving. 11/29 lab session will be changed to lecture. In-class final (1hour):

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

SVD Data Compression: Application to 3D MHD Magnetic Field Data Diego del-Castillo-Negrete Steve Hirshman Ed d’Azevedo ORNL ORNL-PPPL LDRD Meeting ORNL.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.

Molecular Modelling - Lecture 2 Techniques for Conformational Sampling Uses CHARMM force field Written in C++

ANTON D.E Shaw Research.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Application of the MCMC Method for the Calibration of DSMC Parameters James S. Strand and David B. Goldstein The University of Texas at Austin Sponsored.

Anton, a Special-Purpose Machine for Molecular Dynamics Simulation By David E. Shaw et al Presented by Bob Koutsoyannis.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Review Session BS123A/MB223 UC-Irvine Ray Luo, MBB, BS.

Molecular dynamics (4) Treatment of long-range interactions Computing properties from simulation results.

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

Data Structures and Algorithms in Parallel Computing Lecture 10.

Theory of dilute electrolyte solutions and ionized gases

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

D. M. Ceperley, 2000 Simulations1 Neighbor Tables Long Range Potentials A & T pgs , Today we will learn how we can handle long range potentials.

Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.

Molecular dynamics (MD) simulations  A deterministic method based on the solution of Newton’s equation of motion F i = m i a i for the ith particle; the.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Computational Techniques for Efficient Carbon Nanotube Simulation

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Routing: Distance Vector Algorithm

Craig Schroeder October 26, 2004

Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University

Computational Techniques for Efficient Carbon Nanotube Simulation

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

P ARALLELIZATION IN M OLECULAR D YNAMICS By Aditya Mittal For CME346A by Professor Eric Darve Stanford University

P ARALLELIZATION O VERVIEW In MD we would like to compute the velocities and positions of particles based on computing the interaction forces between millions of particles This requires lots of computation but each computation is pretty much independent of all the other computations The problem, however, is that interprocessor communication times are slow and so combining the results of each computation for a particle is slow

P ARALLELIZATION T ECHNIQUES There are 3 Different Parallelization Techniques that can be adapted for the type of problem we are dealing with: Atom Decomposition Force Decomposition Space Decomposition By type of problem I mean the aspects of the problem such as the number of particles in the simulation, the number of processors available for simulation, the type of potential to simulate and so on…

A TOM D ECOMPOSITION (AD) Assign P processors to N atoms so each processor gets N/P atoms, load balancing is automatically achieved Each processor will compute a sub-block of N/P rows of an NxN Force Matrix F throughout the simulation, where (i,j) entry is the force on ith atom due to jth atom F is sparse because we only consider short range forces and is skew-symmetric because of Newton’s 3 rd Law The forces will be used to update the position vector x and velocity vector v of the different atoms Clearly in this method each processor requires importing positions of many other atoms, maintained in “neighbor lists” to compute F and has high communication times, as all processors communicate with all other processors at each time step Thus this method quickly becomes ineffective as we increase the number of processors

F ORCE D ECOMPOSITION (FD) Notice that the inefficiency in AD came from having a row wise decomposition of the Force Matrix F So in FD, instead we do a block decomposition of F’ (some permutation of F in order to preserve the ordering of the position vector x) As with AD we create and keep neighbor lists, and for large number of atoms we generally maintain bins of neighboring atoms and check the 27 bins surrounding each atom (1 bin in which the atom is and 26 bins around the atom’s bin)

E XPANDING AND FOLDING 1 We use expanding and folding operations for the actual interprocessor communication Here is an example of 8 processor expand used to distribute the positions to all other processors:

E XPANDING AND FOLDING 2 Folding is the inverse operation of expand Suppose each processor now knows the new force values for its atoms and a processor needs to know the force value for an atom on another processor Then it is able to do so in log(P) steps by splitting f in each step and sending the required part of f to its partner processor. This is the fold operation.

FD CONTINUED Given the expand and fold operations for communicating the neighbor lists and forces we can now write a Force Decomposition algorithm

FD CONTINUED Having seen the algorithm for FD it is then clear that AD’s O(N) scaling of the communication cost is improved to O(N/sqrt(P)) in the FD

B ASIC T ECHNIQUE FOR S PACE D ECOMPOSITION The basic idea here is to discretize the space of particles into cubes and let each processor do the computation for a cube The cube in which a particle exists is called its “home box” So, then to perform the total computations for a particle, the processor imports data for all the particles within a certain interaction radius, this is known as the “import region” The particles outside the home box and import region are generally ignored from the calculation, or some other less intensive method may be used SD is useful for adjusting short range forces since it is local as opposed to the global AD and FD

L IMITATIONS OF THE T RADITIONAL SD T ECHNIQUE In the basic technique just discussed, there are significant limitations to the number of processors we can use because of the dependence on the interaction radius In the limit as the number of processors gets large, each processor will have to import a lot of data and since interprocessor communication is slow and this is inefficient

DE S HAW ’ S N EUTRAL T ERRITORY (NT) M ETHOD DE Shaw’s idea is that instead of having a whole home box on one processor and then importing the import region to the processor, its better to make independent the processor from the home box So, instead consider the 2D case where each box is a square on a grid, Shaw’s NT method assigns each box on the grid to a processor, and then processes the interactions of the particles in the ith row and jth column on the (i, j) processor In general this method means we might have to import both particles to the nearest processor between the two, but we import both to the nearest processor in the grid and make the number of processors independent of the interaction radius

CUTOFF The “cutoff” radius defines the region in which we need to consider the interaction of particles, this works well for Lennard-Jones potential because it decays quickly as the distance increases Coulomb potential generally is split into a real space and a k-space by assuming a periodic structure and using the Ewald method – the real space is analyzed using “cutoff” and the k-space is analyzed with a transform space since it is long range

S NIR ’ S HYBRID (SH) METHOD Space decomposition has less communication when cutoff c is small because it relies on locality Force decomposition avoids expense of direct broadcast to a large number of processors SH optimizes between the two types of decompositions using space when p d/c, here p is number of particles and d is the side of the cubes forming the discretization NT and SH scale asymptotically as as opposed to the traditional spatial decomposition method which scales as

U SING MULTIPLES ZONES So far the methods talked about have 2 zones corresponding to the two particles whose interaction we considered, but often it is advantageous to compute interactions by having multiple zones as in the “eighth shell” method The eighth shell method forms eight shells from the full import region around the home box and the home box interacts one at a time by each of the eight shells This method is worse than NT in the limit of large number of processors, but has a smaller import region than any of the other known methods in the low parallelism limit By having multiple zones we can work out in more detail exactly which zones really need to interact with which zones, often reducing the total amount of interaction to be computed

B ENCHMARKING AND C ONCLUSION Of the three types of parallelization algorithms, atom decomposition, force decomposition, and space decomposition the type of parallelization algorithm that is optimal varies from problem to problem based on the costs of things like Calculating the number of neighbors included in cutoff distance Calculating the forces depending on locality of potential etc. Building neighbor lists as functions of temperature, density, cutoff etc. Amount of Load balancing that is necessary, for example, there is no need for load balancing with the atom decomposition technique Lammps by Sandia National Labs for example is a parallelization software optimized for massive parallelization and works by importing neighbor lists.

R EFERENCES [1] Kevin J Bowers, Ron O Dror, and David E Shaw. Overview of neutral territory methods for the parallel evaluation of pairwise particle interactions. [2] Marc Snir. A note on N-body computations with cutoffs. [3] Steve Plimpton. Fast Parallel Algorithms for Short- Range Molecular Dynamics

T HANK Y OU.