Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois at Urbana-Champaign

Contributors PI s : –Laxmikant Kale, Klaus Schulten, Robert Skeel NAMD 1: –Robert Brunner, Andrew Dalke, Attila Gursoy, Bill Humphrey, Mark Nelson NAMD2: –M. Bhandarkar, R. Brunner, A. Gursoy, J. Philips, N.Krawetz, A. Shinozaki, K. Varadarajan, Gengbin Zheng,..

Middle layers Applications Parallel Machines “Middle Layers”: Languages, Tools, Libraries

Molecular Dynamics Collection of [charged] atoms, with bonds Newtonian mechanics At each time-step –Calculate forces on each atom bonds: non-bonded: electrostatic and van der Waal’s –Calculate velocities and Advance positions 1 femtosecond time-step, millions needed! Thousands of atoms (1,000 - 100,000)

Cut-off radius Use of cut-off radius to reduce work –8 - 14 Å –Faraway charges ignored! 80-95 % work is non-bonded force computations Some simulations need faraway contributions –Periodic systems: Ewald, Particle-Mesh Ewald –Aperiodic systems: FMA Even so, cut-off based computations are important: –near-atom calculations are part of the above –multiple time-stepping is used: k cut-off steps, 1 PME/FMA

Scalability The Program should scale up to use a large number of processors. –But what does that mean? An individual simulation isn’t truly scalable Better definition of scalability: –If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size

Isoefficiency Quantify scalability –(Work of Vipin Kumar, U. Minnesota) How much increase in problem size is needed to retain the same efficiency on a larger machine? Efficiency : Seq. Time/ (P · Parallel Time) –parallel time = computation + communication + idle

Traditional Approaches Replicated Data: –All atom coordinates stored on each processor –Non-bonded Forces distributed evenly –Analysis: Assume N atoms, P processors Computation: O(N/P) Communication: O(N log P) Communication/Computation ratio: P log P Fraction of communication increases with number of processors, independent of problem size! –So, not scalable by this definition

Atom decomposition Partition the Atoms array across processors –Nearby atoms may not be on the same processor –Communication: O(N) per processor –Communication/Computation: O(N)/(N/P): O(P) –Again, not scalable by our definition

Force Decomposition Distribute force matrix to processors –Matrix is sparse, non uniform –Each processor has one block –Communication: –Ratio: Better scalability in practice –(can use 100+ processors) –Plimpton: –Hwang, Saltz, et al: 6% on 32 Pes 36% on 128 processor –Yet not scalable in the sense defined here!

Spatial Decomposition Allocate close-by atoms to the same processor Three variations possible: –Partitioning into P boxes, 1 per processor Good scalability, but hard to implement –Partitioning into fixed size boxes, each a little larger than the cutoff distance –Partitioning into smaller boxes Communication: O(N/P): –so, scalable in principle

Spatial Decomposition in NAMD NAMD 1 used spatial decomposition Good theoretical isoefficiency, but for a fixed size system, load balancing problems For midsize systems, got good speedups up to 16 processors…. Use the symmetry of Newton’s 3rd law to facilitate load balancing

Spatial Decomposition But the load balancing problems are still severe:

FD + SD Now, we have many more objects to load balance: –Each diamond can be assigned to any processor – Number of diamonds (3D): 14·Number of Patches

Bond Forces Multiple types of forces: –Bonds(2), Angles(3), Dihedrals (4),.. –Luckily, each involves atoms in neighboring patches only Straightforward implementation: –Send message to all neighbors, –receive forces from them –26*2 messages per patch!

Bonded Forces: Assume one patch per processor: –an angle force involving atoms in patches: (x1,y1,z1), (x2,y2,z2), (x3,y3,z3) is calculated in patch: (max{xi}, max{yi}, max{zi}) B CA

Implementation Multiple Objects per processor –Different types: patches, pairwise forces, bonded forces, –Each may have its data ready at different times –Need ability to map and remap them –Need prioritized scheduling Charm++ supports all of these

Charm++ Parallel C++ with Data Driven Objects Object Groups: –global object with a “representative” on each PE Asynchronous method invocation Prioritized scheduling Mature, robust, portable http://charm.cs.uiuc.edu

Data driven execution Scheduler Message Q

Load Balancing Is a major challenge for this application –especially for a large number of processors Unpredictable workloads –Each diamond (force object) and patch encapsulate variable amount of work –Static estimates are inaccurate Measurement based Load Balancing Framework –Robert Brunner’s recent Ph.D. thesis –Very slow variations across timesteps

Bipartite graph balancing Background load: –Patches (integration,..) and bond-related forces: Migratable load: –Non-bonded forces –bond-related forces involving atoms of the same patch Bipartite communication graph –between migratable and non-migratable objects Challenge: –Balance Load while minimizing communication

Load balancing Collect timing data for several cycles Run heuristic load balancer –Several alternative ones Re-map and migrate objects accordingly –Registration mechanisms facilitate migration Needs a separate talk!

Performance: size of system Performance data on Cray T3E

Performance: various machines

Speedup

Recent Speedup Results: ASCI Red

Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Similar presentations

Presentation on theme: "Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Similar presentations

Presentation on theme: "Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois."— Presentation transcript:

Similar presentations

About project

Feedback