MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (ii) Bill Smith CCLRC Daresbury Laboratory

Slides:



Advertisements
Similar presentations
Formulation of an algorithm to implement Lowe-Andersen thermostat in parallel molecular simulation package, LAMMPS Prathyusha K. R. and P. B. Sunil Kumar.
Advertisements

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Reference: Message Passing Fundamentals.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Evaluation of Fast Electrostatics Algorithms Alice N. Ko and Jesús A. Izaguirre with Thierry Matthey Department of Computer Science and Engineering University.
StreamMD Molecular Dynamics Eric Darve. MD of water molecules Cutoff is used to truncate electrostatic potential Gridding technique: water molecules are.
Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
‘Tis not folly to dream: Using Molecular Dynamics to Solve Problems in Chemistry Christopher Adam Hixson and Ralph A. Wheeler Dept. of Chemistry and Biochemistry,
Algorithms and Software for Large-Scale Simulation of Reactive Systems _______________________________ Ananth Grama Coordinated Systems Lab Purdue University.
Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.
1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.
Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Plan Last lab will be handed out on 11/22. No more labs/home works after Thanksgiving. 11/29 lab session will be changed to lecture. In-class final (1hour):
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Parallel Computing Presented by Justin Reschke
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Computational Techniques for Efficient Carbon Nanotube Simulation
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Parallel Programming in C with MPI and OpenMP
Scalable Molecular Dynamics for Large Biomolecular Systems
Algorithms and Software for Large-Scale Simulation of Reactive Systems
Course Outline Introduction in algorithms and applications
Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University
CS 584.
Computational Techniques for Efficient Carbon Nanotube Simulation
Department of Computer Science, University of Tennessee, Knoxville
Algorithms and Software for Large-Scale Simulation of Reactive Systems
Parallel Programming in C with MPI and OpenMP
Parallel computing in Computational chemistry
Presentation transcript:

MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (ii) Bill Smith CCLRC Daresbury Laboratory

Computational Science & Engineering Department CSE Basic MD Parallelization Strategies Recap: ● Last Lecture –Computing Ensemble –Hierarchical Control –Replicated Data ● This Lecture –Systolic Loops –Domain Decomposition

Computational Science & Engineering Department CSE Proc 0 Proc 1 1 2P 2 2P-1 Proc (P- 2) Proc (P-1) P-1 P P+2 P+1 Systolic Loops: SLS-G Algorithm ● Systolic Loop algorithms –Compute the interactions between (and within) `data packets’ –Data packets are then transferred between nodes to permit calculation of all possible pair interactions

Computational Science & Engineering Department CSE Systolic Loop (SLS-G) Algorithm ● Systolic Loop Single-Group ● Features: –P processing nodes, N molecules –2P groups (`packets’) of n molecules (N=2Pn) –For each time step: (a) calculate intra-group forces (b) calculate inter-group forces (c) move data packets one `pulse’ (d) repeat (b)-(c) 2P-1 times (e) integrate equations of motion

Computational Science & Engineering Department CSE SLS-G Communications Pattern

Computational Science & Engineering Department CSE Processing Time: Communications Time: with Systolic Loop Performance Analysis (i)

Computational Science & Engineering Department CSE Fundamental Ratio: Large N (N>>P): Small N (N~2P): Systolic Loop Performance Analysis (ii)

Computational Science & Engineering Department CSE Systolic Loop Algorithms ● Advantages –Good load balancing –Portable between parallel machines –Good type 1 scaling with system size and processor count –Memory requirement fully distributed –Asynchronous communications ● Disadvantages –Complicated communications strategy –Complicated force fields difficult

Computational Science & Engineering Department CSE Domain Decomposition (Scalar - 2D)

Computational Science & Engineering Department CSE Domain Decomposition (Parallel - 2D) A B C D

Computational Science & Engineering Department CSE Domain Decomposition (Parallel - 3D) (a) (b)

Computational Science & Engineering Department CSE Domain Decomposition MD ● Features: –Short range potential cut off (rcut << Lcell) –Spatial decomposition of atoms into domains –Map domains onto processors –Use link cells in each domain –Pass border link cells to adjacent processors –Calculate forces, solve equations of motion –Re-allocate atoms leaving domains

Computational Science & Engineering Department CSE ● Processing Time: ● Communications Time: ● with ● and is the number of link cells per node. Domain Decomposition Performance Analysis (i) NB: O(N) Algorithm

Computational Science & Engineering Department CSE Fundamental Ratio: Large N Case 1: (N>>P and fixed): Large N Case 2: (N>>P and i.e. fixed): Small N: (N=P and ): Domain Decomposition Performance Analysis (ii)

Computational Science & Engineering Department CSE Domain Decomposition MD ● Advantages: –Predominantly Local Communications –Good load balancing (if system is isotropic!) –Good type 1 scaling –Ideal for huge systems (10 5 ~ 10 5 atoms) –Simple communication structure –Fully distributed memory requirement –Dynamic load balancing possible ● Disadvantages –Problems with mapping/portability –Sub-optimal type 2 scaling –Requires short potential cut off –Complex force fields tricky

Computational Science & Engineering Department CSE Force field definition Global atomic indices P 0 Local atomicindices P 1 Local atomicindices P 2 Local atomicindices Processor Domains Difficult! Domain Decomposition: Intramolecular Forces

Computational Science & Engineering Department CSE The crucial part of the SPME method is the conversion of the Reciprocal Space component of the Ewald sum into a form suitable for Fast Fourier Transforms (FFT). Thus: becomes: where G and Q are 3D grid arrays (see later) Ref: Essmann et al., J. Chem. Phys. (1995) Coulombic Forces: Smoothed Particle-Mesh Ewald

Computational Science & Engineering Department CSE Central idea - share discrete charges on 3D grid: Cardinal B-Splines M n (u) - in 1D: Recursion relation SPME: Spline Scheme

Computational Science & Engineering Department CSE Is the charge array and Q T (k 1,k 2,k 3 ) its discrete Fourier transform. G T (k 1,k 2,k 3 ) is the discrete Fourier Transform of the function: with SPME: Building the Arrays

Computational Science & Engineering Department CSE SPME Parallelisation ● Handle real space terms using short range force methods ● Reciprocal space terms options: –Fully replicated Q array construction and FFT (R. Data) –Atomic partition of Q array, replicated FFT (R. Data) Easily done, acceptable for few processors Limits imposed by RAM, global sum required –Domain decomposition of Q array, distributed FFT Required for large Q array and many processors Atoms `shared’ between domains - potentially awkward Requires distributed FFT - implies comms dependence

Computational Science & Engineering Department CSE SPME: Parallel Approaches ● SPME is generally faster then conventional Ewald sum in most applications. Algorithm scales as O(NlogN) –In Replicated Data: build the FFT array in pieces on each processor and make whole by a global sum for the FFT operation. –In Domain Decomposition: build the FFT array in pieces on each processor and keep that way for the distributed FFT operation (The FFT `hides’ all the implicit communications) ● Characteristics of FFTs –Fast (!) - O(M log M) operations where M is the number of points in the grid –Global operations - to perform a FFT you need all the points –This makes it difficult to write an efficient, good scaling FFT.

Computational Science & Engineering Department CSE Traditional Parallel FFTs ● Strategy –Distribute the data by planes –Each processor has a complete set of points in the x and y directions so can do those Fourier transforms –Redistribute data so that a processor holds all the points in z –Do the z transforms ● Characteristics –Allows efficient implementation of the serial FFTs ( use a library routine ) –In practice for large enough 3D FFTs can scale reasonably –However the distribution does not usually map onto domain decomposition of simulation - implies large amounts of data redistribution

Computational Science & Engineering Department CSE Daresbury Advanced 3-D FFT (DAFT) ● Takes data distributed as MD domain decomposition. ● So do a distributed data FFT in the x direction –Then the y –And finally the z ● Disadvantage is that can not use the library routine for the 1D FFT ( not quite true – can do sub-FFTs on each domain ) ● Scales quite well - e.g. on 512 procs, an 8x8x8 proc grid, a 1D FFT need only scale to 8 procs ● Totally avoids data redistribution costs ● Communication is by rows/columns ● In practice DAFT wins ( on the machines we have compared ) and also the coding is simpler !

Computational Science & Engineering Department CSE Domain Decomposition: Load Balancing Issues ● Domain decomposition according to spatial domains sometimes presents severe load balancing problems –Material can be inhomogeneous –Some parts may require different amounts of computations E.g. enzyme in a large bath of water ● Strategies can include –Dynamic load balancing: re-distribution (migration) of atoms from one processor to another Need to carry around associated data on bonds, angles, constraints….. –Redistribution of parts of the force calculation E.g. NAMD

Computational Science & Engineering Department CSE Boillat, Bruge, Kropf, J. Comput Phys., 96 1 (1991) Domain Decomposition: Dynamic Load Balancing Can be applied in 3D (but not easily!)

Computational Science & Engineering Department CSE NAMD: Dynamic Load Balancing ● NAMD exploits MD as a tool to understand the structure and function of biomolecules –proteins, DNA, membranes ● NAMD is a production quality MD program –Active use by biophysicists (science publications) –50,000+ lines of C++ code –1000+ registered users –Features and “accessories” such as VMD: visualization and analysis BioCoRE: collaboratory Steered and Interactive Molecular Dynamics ● Load balancing ref: –L.V. Kale, M. Bhandarkar and R. Brunner, Lecture Notes in Computer Science 1998, 1457,

Computational Science & Engineering Department CSE NAMD : Initial Static Balancing ● Allocate patches (link cells) to processors so that –Each processor has same number of atoms (approx.) –Neighbouring patches share same processor if possible ● Weighing the workload on each processor –Calculate forces internal to each patch (weight ~ n p 2 /2) –Calculate forces between patches (i.e. one compute object) on the same processor (weight ~ w*n 1 *n 2 ). Factor w depends on connection (face-face > edge-edge > corner- corner) –If two patches on different processors – send proxy patch to lesser loaded processor. ● Dynamic load balancing used during simulation run.

Computational Science & Engineering Department CSE NAMD : Dynamic Load Balancing (i) ● Balance maintained by a Distributed Load Balance Coordinator which monitors on each processor: –Background load (non migratable work) –Idle time –Migratable compute objects and their associated compute load –The patches that compute objects depend upon –The home processor of each patch –The proxy patches required by each processor ● The monitored data is used to determine load balancing

Computational Science & Engineering Department CSE NAMD : Dynamic Load Balancing (ii) ● Greedy load balancing strategy: –Sort migratable compute objects in order of heaviest load –Sort processors in order of `hungriest’ –Share out compute objects so hungriest ranked processor gets largest compute object available –BUT: this does not take into account communication cost ● Modification: – Identify least loaded processors with: Both patches or proxies to complete a compute object (no comms) One patch necessary for a compute object (moderate comms) No patches for a compute object (high comms) –Allocate compute object to processor giving best compromise in cost (compute plus communication).

Computational Science & Engineering Department CSE Impact of Measurement-based Load Balancing

Computational Science & Engineering Department CSE The End