Presentation is loading. Please wait.

Presentation is loading. Please wait.

MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (ii) Bill Smith CCLRC Daresbury Laboratory

Similar presentations


Presentation on theme: "MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (ii) Bill Smith CCLRC Daresbury Laboratory"— Presentation transcript:

1 MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (ii) Bill Smith CCLRC Daresbury Laboratory w.smith@daresbury.ac.uk

2 Computational Science & Engineering Department CSE Basic MD Parallelization Strategies Recap: ● Last Lecture –Computing Ensemble –Hierarchical Control –Replicated Data ● This Lecture –Systolic Loops –Domain Decomposition

3 Computational Science & Engineering Department CSE Proc 0 Proc 1 1 2P 2 2P-1 Proc (P- 2) Proc (P-1) P-1 P P+2 P+1 Systolic Loops: SLS-G Algorithm ● Systolic Loop algorithms –Compute the interactions between (and within) `data packets’ –Data packets are then transferred between nodes to permit calculation of all possible pair interactions

4 Computational Science & Engineering Department CSE Systolic Loop (SLS-G) Algorithm ● Systolic Loop Single-Group ● Features: –P processing nodes, N molecules –2P groups (`packets’) of n molecules (N=2Pn) –For each time step: (a) calculate intra-group forces (b) calculate inter-group forces (c) move data packets one `pulse’ (d) repeat (b)-(c) 2P-1 times (e) integrate equations of motion

5 Computational Science & Engineering Department CSE SLS-G Communications Pattern

6 Computational Science & Engineering Department CSE Processing Time: Communications Time: with Systolic Loop Performance Analysis (i)

7 Computational Science & Engineering Department CSE Fundamental Ratio: Large N (N>>P): Small N (N~2P): Systolic Loop Performance Analysis (ii)

8 Computational Science & Engineering Department CSE Systolic Loop Algorithms ● Advantages –Good load balancing –Portable between parallel machines –Good type 1 scaling with system size and processor count –Memory requirement fully distributed –Asynchronous communications ● Disadvantages –Complicated communications strategy –Complicated force fields difficult

9 Computational Science & Engineering Department CSE Domain Decomposition (Scalar - 2D)

10 Computational Science & Engineering Department CSE Domain Decomposition (Parallel - 2D) A B C D

11 Computational Science & Engineering Department CSE Domain Decomposition (Parallel - 3D) (a) (b)

12 Computational Science & Engineering Department CSE Domain Decomposition MD ● Features: –Short range potential cut off (rcut << Lcell) –Spatial decomposition of atoms into domains –Map domains onto processors –Use link cells in each domain –Pass border link cells to adjacent processors –Calculate forces, solve equations of motion –Re-allocate atoms leaving domains

13 Computational Science & Engineering Department CSE ● Processing Time: ● Communications Time: ● with ● and is the number of link cells per node. Domain Decomposition Performance Analysis (i) NB: O(N) Algorithm

14 Computational Science & Engineering Department CSE Fundamental Ratio: Large N Case 1: (N>>P and fixed): Large N Case 2: (N>>P and i.e. fixed): Small N: (N=P and ): Domain Decomposition Performance Analysis (ii)

15 Computational Science & Engineering Department CSE Domain Decomposition MD ● Advantages: –Predominantly Local Communications –Good load balancing (if system is isotropic!) –Good type 1 scaling –Ideal for huge systems (10 5 ~ 10 5 atoms) –Simple communication structure –Fully distributed memory requirement –Dynamic load balancing possible ● Disadvantages –Problems with mapping/portability –Sub-optimal type 2 scaling –Requires short potential cut off –Complex force fields tricky

16 Computational Science & Engineering Department CSE Force field definition Global atomic indices P 0 Local atomicindices P 1 Local atomicindices P 2 Local atomicindices Processor Domains Difficult! Domain Decomposition: Intramolecular Forces

17 Computational Science & Engineering Department CSE The crucial part of the SPME method is the conversion of the Reciprocal Space component of the Ewald sum into a form suitable for Fast Fourier Transforms (FFT). Thus: becomes: where G and Q are 3D grid arrays (see later) Ref: Essmann et al., J. Chem. Phys. (1995) 103 8577 Coulombic Forces: Smoothed Particle-Mesh Ewald

18 Computational Science & Engineering Department CSE Central idea - share discrete charges on 3D grid: Cardinal B-Splines M n (u) - in 1D: Recursion relation SPME: Spline Scheme

19 Computational Science & Engineering Department CSE Is the charge array and Q T (k 1,k 2,k 3 ) its discrete Fourier transform. G T (k 1,k 2,k 3 ) is the discrete Fourier Transform of the function: with SPME: Building the Arrays

20 Computational Science & Engineering Department CSE SPME Parallelisation ● Handle real space terms using short range force methods ● Reciprocal space terms options: –Fully replicated Q array construction and FFT (R. Data) –Atomic partition of Q array, replicated FFT (R. Data) Easily done, acceptable for few processors Limits imposed by RAM, global sum required –Domain decomposition of Q array, distributed FFT Required for large Q array and many processors Atoms `shared’ between domains - potentially awkward Requires distributed FFT - implies comms dependence

21 Computational Science & Engineering Department CSE SPME: Parallel Approaches ● SPME is generally faster then conventional Ewald sum in most applications. Algorithm scales as O(NlogN) –In Replicated Data: build the FFT array in pieces on each processor and make whole by a global sum for the FFT operation. –In Domain Decomposition: build the FFT array in pieces on each processor and keep that way for the distributed FFT operation (The FFT `hides’ all the implicit communications) ● Characteristics of FFTs –Fast (!) - O(M log M) operations where M is the number of points in the grid –Global operations - to perform a FFT you need all the points –This makes it difficult to write an efficient, good scaling FFT.

22 Computational Science & Engineering Department CSE Traditional Parallel FFTs ● Strategy –Distribute the data by planes –Each processor has a complete set of points in the x and y directions so can do those Fourier transforms –Redistribute data so that a processor holds all the points in z –Do the z transforms ● Characteristics –Allows efficient implementation of the serial FFTs ( use a library routine ) –In practice for large enough 3D FFTs can scale reasonably –However the distribution does not usually map onto domain decomposition of simulation - implies large amounts of data redistribution

23 Computational Science & Engineering Department CSE Daresbury Advanced 3-D FFT (DAFT) ● Takes data distributed as MD domain decomposition. ● So do a distributed data FFT in the x direction –Then the y –And finally the z ● Disadvantage is that can not use the library routine for the 1D FFT ( not quite true – can do sub-FFTs on each domain ) ● Scales quite well - e.g. on 512 procs, an 8x8x8 proc grid, a 1D FFT need only scale to 8 procs ● Totally avoids data redistribution costs ● Communication is by rows/columns ● In practice DAFT wins ( on the machines we have compared ) and also the coding is simpler !

24 Computational Science & Engineering Department CSE Domain Decomposition: Load Balancing Issues ● Domain decomposition according to spatial domains sometimes presents severe load balancing problems –Material can be inhomogeneous –Some parts may require different amounts of computations E.g. enzyme in a large bath of water ● Strategies can include –Dynamic load balancing: re-distribution (migration) of atoms from one processor to another Need to carry around associated data on bonds, angles, constraints….. –Redistribution of parts of the force calculation E.g. NAMD

25 Computational Science & Engineering Department CSE Boillat, Bruge, Kropf, J. Comput Phys., 96 1 (1991) Domain Decomposition: Dynamic Load Balancing Can be applied in 3D (but not easily!)

26 Computational Science & Engineering Department CSE NAMD: Dynamic Load Balancing ● NAMD exploits MD as a tool to understand the structure and function of biomolecules –proteins, DNA, membranes ● NAMD is a production quality MD program –Active use by biophysicists (science publications) –50,000+ lines of C++ code –1000+ registered users –Features and “accessories” such as VMD: visualization and analysis BioCoRE: collaboratory Steered and Interactive Molecular Dynamics ● Load balancing ref: –L.V. Kale, M. Bhandarkar and R. Brunner, Lecture Notes in Computer Science 1998, 1457, 251-261.

27 Computational Science & Engineering Department CSE NAMD : Initial Static Balancing ● Allocate patches (link cells) to processors so that –Each processor has same number of atoms (approx.) –Neighbouring patches share same processor if possible ● Weighing the workload on each processor –Calculate forces internal to each patch (weight ~ n p 2 /2) –Calculate forces between patches (i.e. one compute object) on the same processor (weight ~ w*n 1 *n 2 ). Factor w depends on connection (face-face > edge-edge > corner- corner) –If two patches on different processors – send proxy patch to lesser loaded processor. ● Dynamic load balancing used during simulation run.

28 Computational Science & Engineering Department CSE NAMD : Dynamic Load Balancing (i) ● Balance maintained by a Distributed Load Balance Coordinator which monitors on each processor: –Background load (non migratable work) –Idle time –Migratable compute objects and their associated compute load –The patches that compute objects depend upon –The home processor of each patch –The proxy patches required by each processor ● The monitored data is used to determine load balancing

29 Computational Science & Engineering Department CSE NAMD : Dynamic Load Balancing (ii) ● Greedy load balancing strategy: –Sort migratable compute objects in order of heaviest load –Sort processors in order of `hungriest’ –Share out compute objects so hungriest ranked processor gets largest compute object available –BUT: this does not take into account communication cost ● Modification: – Identify least loaded processors with: Both patches or proxies to complete a compute object (no comms) One patch necessary for a compute object (moderate comms) No patches for a compute object (high comms) –Allocate compute object to processor giving best compromise in cost (compute plus communication).

30 Computational Science & Engineering Department CSE Impact of Measurement-based Load Balancing

31 Computational Science & Engineering Department CSE The End


Download ppt "MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (ii) Bill Smith CCLRC Daresbury Laboratory"

Similar presentations


Ads by Google