Download presentation
Presentation is loading. Please wait.
Published byTamsyn Williams Modified over 6 years ago
1
Scalable Molecular Dynamics for Large Biomolecular Systems
Robert Brunner James C Phillips Laxmikant Kale
2
Overview Context: approach and methodology
Molecular dynamics for biomolecules Our program NAMD Basic Parallelization strategy NAMD performance Optimizations Techniques Results Conclusions: summary, lessons and future work
3
The context Objective: Enhance Performance and productivity in parallel programming For complex, dynamic applications Scalable to thousands of processors Theme: Adaptive techniques for handling dynamic behavior Look for optimal division of labor between human programmer and the “system” Let the programmer specify what to do in parallel Let the system decide when and where to run the subcomputations Data driven objects as the substrate
4
5 8 1 1 2 10 4 3 8 2 3 9 7 5 6 10 9 4 9 12 11 13 6 13 7 11 12
5
Data driven execution Scheduler Scheduler Message Q Message Q
6
Charm++ Parallel C++ with Data Driven Objects
Object Arrays and collections Asynchronous method invocation Object Groups: global object with a “representative” on each PE Prioritized scheduling Mature, robust, portable
7
Multi-partition decomposition
8
Load balancing Based on migratable objects
Collect timing data for several cycles Run heuristic load balancer Several alternative ones Re-map and migrate objects accordingly Registration mechanisms facilitate migration
9
Measurement based load balancing
Application induced imbalances: Abrupt, but infrequent, or Slow, cumulative rarely: frequent, large changes Principle of persistence Extension of principle of locality Behavior, including computational load and communication patterns, of objects tend to persist over time We have implemented strategies that exploit this automatically
10
Molecular Dynamics
11
Molecular dynamics and NAMD
MD to understand the structure and function of biomolecules proteins, DNA, membranes NAMD is a production quality MD program Active use by biophysicists (science publications) 50,000+ lines of C++ code 1000+ registered users Features and “accessories” such as VMD: visualization Biocore: collaboratory Steered and Interactive Molecular Dynamics
12
NAMD Contributors PI s : NAMD 1: NAMD2:
Laxmikant Kale, Klaus Schulten, Robert Skeel NAMD 1: Robert Brunner, Andrew Dalke, Attila Gursoy, Bill Humphrey, Mark Nelson NAMD2: M. Bhandarkar, R. Brunner, A. Gursoy, J. Phillips, N.Krawetz, A. Shinozaki, K. Varadarajan, Gengbin Zheng, ..
13
Molecular Dynamics Collection of [charged] atoms, with bonds
Newtonian mechanics At each time-step Calculate forces on each atom bonds: non-bonded: electrostatic and van der Waal’s Calculate velocities and Advance positions 1 femtosecond time-step, millions needed! Thousands of atoms (1, ,000)
14
Cut-off radius Use of cut-off radius to reduce work
Å Faraway charges ignored! 80-95 % work is non-bonded force computations Some simulations need faraway contributions Periodic systems: Ewald, Particle-Mesh Ewald Aperiodic systems: FMA Even so, cut-off based computations are important: near-atom calculations are part of the above multiple time-stepping is used: k cut-off steps, 1 PME/FMA
15
Scalability The Program should scale up to use a large number of processors. But what does that mean? An individual simulation isn’t truly scalable Better definition of scalability: If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size
16
Isoefficiency Quantify scalability
(Work of Vipin Kumar, U. Minnesota) How much increase in problem size is needed to retain the same efficiency on a larger machine? Efficiency : Seq. Time/ (P · Parallel Time) parallel time = computation + communication + idle
17
Atom decomposition Partition the Atoms array across processors
Nearby atoms may not be on the same processor Communication: O(N) per processor Communication/Computation: O(N)/(N/P): O(P) Again, not scalable by our definition
18
Force Decomposition Distribute force matrix to processors
Matrix is sparse, non uniform Each processor has one block Communication: Ratio: Better scalability in practice (can use 100+ processors) Plimpton: Hwang, Saltz, et al: 6% on 32 Pes % on 128 processor Yet not scalable in the sense defined here!
19
Spatial Decomposition
Allocate close-by atoms to the same processor Three variations possible: Partitioning into P boxes, 1 per processor Good scalability, but hard to implement Partitioning into fixed size boxes, each a little larger than the cutoff distance Partitioning into smaller boxes Communication: O(N/P): so, scalable in principle
20
Spatial Decomposition in NAMD
NAMD 1 used spatial decomposition Good theoretical isoefficiency, but for a fixed size system, load balancing problems For midsize systems, got good speedups up to 16 processors…. Use the symmetry of Newton’s 3rd law to facilitate load balancing
21
Spatial Decomposition
But the load balancing problems are still severe:
23
FD + SD Now, we have many more objects to load balance:
Each diamond can be assigned to any processor Number of diamonds (3D): 14·Number of Patches
24
Bond Forces Multiple types of forces: Straightforward implementation:
Bonds(2), Angles(3), Dihedrals (4), .. Luckily, each involves atoms in neighboring patches only Straightforward implementation: Send message to all neighbors, receive forces from them 26*2 messages per patch!
25
Bonded Forces: Assume one patch per processor:
an angle force involving atoms in patches: (x1,y1,z1), (x2,y2,z2), (x3,y3,z3) is calculated in patch: (max{xi}, max{yi}, max{zi}) A C B
26
Implementation Multiple Objects per processor
Different types: patches, pairwise forces, bonded forces, Each may have its data ready at different times Need ability to map and remap them Need prioritized scheduling Charm++ supports all of these
27
Load Balancing Is a major challenge for this application
especially for a large number of processors Unpredictable workloads Each diamond (force object) and patch encapsulate variable amount of work Static estimates are inaccurate Measurement based Load Balancing Framework Robert Brunner’s recent Ph.D. thesis Very slow variations across timesteps
28
Bipartite graph balancing
Background load: Patches (integration, ..) and bond-related forces: Migratable load: Non-bonded forces Bipartite communication graph between migratable and non-migratable objects Challenge: Balance Load while minimizing communication
29
Load balancing strategy
Greedy variant (simplified): Sort compute objects (diamonds) Repeat (until all assigned) S = set of all processors that: -- are not overloaded -- generate least new commun. P = least loaded {S} Assign heaviest compute to P Refinement: Repeat - Pick a compute from the most overloaded PE - Assign it to a suitable underloaded PE Until (No movement) Cell Compute Cell
31
Initial Speedup Results: ASCI Red
32
BC1 complex: 200k atoms
33
Optimizations Series of optimizations Examples to be covered here:
Grainsize distributions (bimodal) Integration: message sending overheads
34
Grainsize and Amdahls’s law
A variant of Amdahl’s law, for objects, would be: The fastest time can be no shorter than the time for the biggest single object! How did it apply to us? Sequential step time was 57 seconds To run on 2k processors, no object should be more than 28 msecs. Should be even shorter Grainsize analysis via projections showed that was not so..
35
Grainsize analysis Solution:
Split compute objects that may have too much work: using a heuristics based on number of interacting atoms Problem
36
Grainsize reduced
37
Performance audit
38
Performance audit Through the optimization process, Total Ideal Actual
an audit was kept to decide where to look to improve performance Total Ideal Actual Total nonBonded Bonds Integration Overhead Imbalance Idle Receives Integration time doubled
39
Integration overhead analysis
Problem: integration time had doubled from sequential run
40
Integration overhead example:
The projections pictures showed the overhead was associated with sending messages. Many cells were sending messages. The overhead was still too much compared with the cost of messages. Code analysis: memory allocations! Identical message is being sent to 30+ processors. Simple multicast support was added to Charm++ Mainly eliminates memory allocations (and some copying)
41
Integration overhead: After multicast
42
Improved Performance Data
43
Results on Linux Cluster
44
Performance of Apo-A1 on Asci Red
45
Performance of Apo-A1 on O2k and T3E
46
Lessons learned Need to downsize objects! One of the biggest challenge
Choose smallest possible grainsize that amortizes overhead One of the biggest challenge was getting time for performance tuning runs on parallel machines
47
Future and Planned work
Speedup on small molecules! Interactive molecular dynamics Increased speedups on 2k-10k processors Smaller grainsizes New algorithms for reducing communication impact New load balancing strategies Further performance improvements for PME/FMA With multiple timestepping Needs multi-phase load balancing
48
Steered MD: example picture
Image and Simulation by the theoretical biophysics group, Beckman Institute, UIUC
49
More information Charm++ and associated framework:
NAMD and associated biophysics tools: Both include downloadable software
50
Performance: size of system
Performance data on Cray T3E
51
Performance: various machines
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.