MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory

Slides:



Advertisements
Similar presentations
Time averages and ensemble averages
Advertisements

Simulazione di Biomolecole: metodi e applicazioni giorgio colombo
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Formulation of an algorithm to implement Lowe-Andersen thermostat in parallel molecular simulation package, LAMMPS Prathyusha K. R. and P. B. Sunil Kumar.
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Advanced methods of molecular dynamics Monte Carlo methods
‘Tis not folly to dream: Using Molecular Dynamics to Solve Problems in Chemistry Christopher Adam Hixson and Ralph A. Wheeler Dept. of Chemistry and Biochemistry,
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (ii) Bill Smith CCLRC Daresbury Laboratory
Free energies and phase transitions. Condition for phase coexistence in a one-component system:
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
Performance Evaluation of Parallel Processing. Why Performance?
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 08, 2005 Session 16.
1 Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 14 Some materials adapted from Prof. Keith E. Gubbins:
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
8. Selected Applications. Applications of Monte Carlo Method Structural and thermodynamic properties of matter [gas, liquid, solid, polymers, (bio)-macro-
Plan Last lab will be handed out on 11/22. No more labs/home works after Thanksgiving. 11/29 lab session will be changed to lecture. In-class final (1hour):
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
LLNL-PRES DRAFT This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract.
Molecular Modelling - Lecture 2 Techniques for Conformational Sampling Uses CHARMM force field Written in C++
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Review Session BS123A/MB223 UC-Irvine Ray Luo, MBB, BS.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Basic Monte Carlo (chapter 3) Algorithm Detailed Balance Other points non-Boltzmann sampling.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Stride Scheduling: Deterministic Proportional-Share Resource Management Carl A. Waldspurger, William E. Weihl MIT Laboratory for Computer Science Presenter:
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Systematic errors of MC simulations Equilibrium error averages taken before the system has reached equilibrium  Monitor the variables you are interested.
Wayne Wolf Dept. of EE Princeton University
Porting DL_MESO_DPD on GPUs
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
CMSC 611: Advanced Computer Architecture
CPSC 531: System Modeling and Simulation
Outline Midterm results summary Distributed file systems – continued
Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University
CS 584.
COMP60621 Fundamentals of Parallel and Distributed Systems
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Parallel computing in Computational chemistry
Presentation transcript:

MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory

Computational Science & Engineering Departm CSE Parallel Computers: Shared Memory M p0p0p0p0 p1p1p1p1 p3p3p3p3 p2p2p2p2

Computational Science & Engineering Departm CSE Parallel Computers: Distributed Memory m0m0m0m0 m1m1m1m1 m2m2m2m2 m3m3m3m3 m4m4m4m4 m5m5m5m5 m6m6m6m6 m7m7m7m7 m8m8m8m8 p0p0p0p0 p1p1p1p1 p2p2p2p2 p3p3p3p3 p4p4p4p4 p5p5p5p5 p6p6p6p6 p7p7p7p7 p8p8p8p8

Computational Science & Engineering Departm CSE Parallel Computers: Virtual Shared Memory m0m0m0m0 m1m1m1m1 m2m2m2m2 m3m3m3m3 m4m4m4m4 m5m5m5m5 m6m6m6m6 m7m7m7m7 m8m8m8m8 p0p0p0p0 p1p1p1p1 p2p2p2p2 p3p3p3p3 p4p4p4p4 p5p5p5p5 p6p6p6p6 p7p7p7p7 p8p8p8p8

Computational Science & Engineering Departm CSE Parallel Computers: Beowulf Clusters M M M M P0P0P0P0 P1P1P1P1 P2P2P2P2 P3P3P3P3 Ethernet/FDDI

Computational Science & Engineering Departm CSE Important Issues in Parallel Processing ● Load Balancing: –Sharing work equally between processors –Sharing memory requirement equally –Maximum concurrent use of each processor ● Communication: –Maximum size of messages passed –Minimum number of messages passed –Local versus global communications –Asynchronous communication

Computational Science & Engineering Departm CSE Scaling in Parallel Processing ● Type 1 scaling –Fixed number of processors –Scaling of elapsed time with total workload –Ideal: elapsed time directly proportional to workload ● Type 2 scaling (strong scaling) –Fixed total workload –Performance scaling with number of processors –Ideal: double processor count - double performance ● Type 3 scaling (weak scaling) –Fixed workload per processor –Scaling of elapsed time with number of processors –Ideal: double processor count – constant elapsed time

Computational Science & Engineering Departm CSE Time required per step: T s =T p +T c where –T s is the time per step –T p is the processing (computation) time/step –T c is the communication time/step Performance Analysis (i)

Computational Science & Engineering Departm CSE Performance Analysis (ii) Can also write: T s =T p (1+R cp ) where: R cp= T c/ T p R cp is the Fundamental Ratio NB Assume synchronous communications, without overlap of communication and computation

Computational Science & Engineering Departm CSE Initialize Forces Motion Properties Summarize Molecular Dynamics Basics ● Key stages in MD simulation: ● Set up initial system ● Calculate atomic forces ● Calculate atomic motion ● Calculate physical properties ● Repeat ! ● Produce final summary

Computational Science & Engineering Departm CSE Basic MD Parallelization Strategies ● This Lecture –Computing Ensemble –Hierarchical Control –Replicated Data ● Next Lecture –Systolic Loops –Domain Decomposition

Computational Science & Engineering Departm CSE Setup Forces Motion Stats. Results Setup Forces Motion Stats. Results Setup Forces Motion Stats. Results Setup Forces Motion Stats. Results Proc 0Proc 3Proc 2Proc 1 Parallel MD Algorithms: Computing Ensemble

Computational Science & Engineering Departm CSE Computing Ensemble ● Advantages: –Simple to implement - no comms! –Maximum parallel efficiency – excellent throughput –Perfect load balancing –Good scaling behaviour (types 1 and 3) –Suitable method for Monte Carlo –Suitable for parallel replica applications (e.g. hyperdynamics) ● Disadvantages: –Limited to current physical systems –Limited to short timescale dynamics –Algorithmic sterility –Offers no new intellectual challenges

Computational Science & Engineering Departm CSE Parallel Replica Dynamics (i) Original configuration Equilibration Period Production Periods  t block Decorrelation Period  t corr {p i } p0p0 p1p1 p M-1 pMpM X Procs Minimization: m 0 m 1 m 2 …. Transition

Computational Science & Engineering Departm CSE Parallel Replica Dynamics (ii) Procedure: 1.Replicate system on M processors. Minimize to get `initial’ state. 2.Equilibrate system using different {v i } on all M (check in same state). Accumulated time t sum =0 3.Run for time  t block then minimize & check for transition. Accumulated time t sum =t sum +M  t block 4.If no transition repeat step 3. 5.If transition found on proc `i’ continue run for time  t corr. Accumulated time t sum =t sum +  t corr. 6.Take configuration on proc I as new state, proceed to step 1. Accuracy on time of transition: +/-  t block

Computational Science & Engineering Departm CSE Parallel Tempering (i) T0T0 T1T1 TiTi T M-1 TMTM Original configuration Equilibrate Temperature Production Period Close down {p i } pMpM p M-1 p1p1 p0p0 Procs Monte Carlo Trials

Computational Science & Engineering Departm CSE Parallel Tempering (ii) Procedure: ● Start M simulations (n=0 to M- 1) of model system at different temperatures T=nDT+T 0 ● Equilibrate systems for N equil steps. ● At intervals of N sample steps attempt Monte Carlo controlled swap of the configurations of two processors chosen at random. ● Continue the simulation until the distribution of configuration energies in the lowest temperature system follows Boltzmann. ● Calculate physical properties of low temperature system(s) ● Save all replica configurations for possible restart.

Computational Science & Engineering Departm CSE Parallel MD Algorithms: Hierarchical Control - Task Farming Proc 0 Forces Forces Forces Forces Proc 4 Proc 1 Proc 2 Proc 3 Setup Allocate Motion Stats. Results Forces Proc n

Computational Science & Engineering Departm CSE Parallel MD Algorithms: Task Farming ● Advantages –Can work with heterogeneous computers –Historical precedence ● Disadvantages –Poor comms hierarchy –Hard to load balance –Poor scaling (types 1 & 2) –Danger of deadlock

Computational Science & Engineering Departm CSE Hierarchical Control - Master-Slave Proc 0 Proc 2 Proc 1 Proc 5 Proc 4 Proc 3 Proc 6

Computational Science & Engineering Departm CSE Master-Slave MD Algorithm ● Advantages: –Can work on heterogeneous computers –Better comms strategy than task farming ● Disadvantages: –Poor load balancing characteristics –Difficult to scale with system size and processor count

Computational Science & Engineering Departm CSE Parallel MD Algorithms: Replicated Data Initialize Forces Motion Statistics Summary Initialize Forces Motion Statistics Summary Initialize Forces Motion Statistics Summary Initialize Forces Motion Statistics Summary Proc 0 Proc 1Proc 2 Proc N-1

Computational Science & Engineering Departm CSE Replicated Data MD Algorithm ● Features: –Each node has copy of all atomic coordinates (Ri,Vi,Fi) –Force calculations shared equally between nodes (i.e. N(N-1)/2P pair forces per node). –Atomic forces summed globally over all nodes –Motion integrated for all or some atoms on each node –Updated atom positions circulated to all nodes –Example of Algorithmic Decomposition

Computational Science & Engineering Departm CSE Processing Time: Communications Time (hypercube comms): with Replicated Data Performance Analysis NB: O(N 2) Algorithm

Computational Science & Engineering Departm CSE Fundamental Ratio: Large N (N>>P): Small N (N~P):

Computational Science & Engineering Departm CSE Replicated Data MD Algorithm ● Advantages: –Simple to implement –Good load balancing –Highly portable programs –Suitable for complex force fields –Good scaling with system size (Type 1) –Dynamic load balancing possible ● Disadvantages: –High communication overhead –Sub-optimal scaling with processor count (Type 2) –Large memory requirement –Unsuitable for massive parallelism

Computational Science & Engineering Departm CSE RD Load Balancing: DL_POLY Brode-Ahlrichs decomposition

Computational Science & Engineering Departm CSE RD Load Balancing: Atom Decomposition Atoms i p0p0 p1p1 p2p2 p3p3 p4p4 Atoms j F ij

Computational Science & Engineering Departm CSE Force Decomposition F ij popo p1p1 p 14 p 13 p 12 p 11 p 10 p3p3 p2p2 p9p9 p8p8 p7p7 p6p6 p5p5 p4p4 Atoms i Atoms j Note: Need not be confined to replicated data approach!

Computational Science & Engineering Departm CSE Replicated Data: Intramolecular Forces (i)

Computational Science & Engineering Departm CSE Replicated Data: Intramolecular Forces (ii) Molecular force field definition Global Force Field P 0 Local forceterms P 1 Local forceterms P 2 Local forceterms Processors

Computational Science & Engineering Departm CSEwith: Long Ranged Forces: The Ewald Summation

Computational Science & Engineering Departm CSE Parallel Ewald Summation (i) ● Self interaction correction - as is. ● Real Space terms: –Handle as for short ranged forces –For excluded atom pairs replace erfc by -erf ● Reciprocal Space Terms: –Distribute over atoms –Distribute over k-vectors

Computational Science & Engineering Departm CSE Partition over atoms: Add to Ewald sum on all processors Global Sum pPpPpPpP p3p3p3p3 p2p2p2p2 p1p1p1p1 p0p0p0p0 Repeat for each k vector Parallel Ewald Summation (ii) Note: Stack sums for efficiency

Computational Science & Engineering Departm CSE Parallel Ewald Summation (iii) pPpP p2p2 p1p1 p0p0 Different k on each processor Partition over k-vectors

Computational Science & Engineering Departm CSE The End