What’s New With NAMD Triumph and Torture with New Platforms

Name: What’s New With NAMD Triumph and Torture with New Platforms
Uploaded: 2017-07-22T11:04:09+00:00
Duration: PTM15S39
Channel: Norma Barrett
Description: What’s New With NAMD Triumph and Torture with New Platforms

What’s New With NAMD Triumph and Torture with New Platforms
Jim Phillips and Chee Wai Lee Theoretical and Computational Biophysics Group

What is NAMD? Molecular dynamics and related algorithms
e.g., minimization, steering, locally enhanced sampling, alchemical and conformational free energy perturbation Efficient algorithms for full electrostatics Effective on affordable commodity hardware Read file formats from standard packages: X-PLOR (NAMD 1.0), CHARMM (NAMD 2.0), Amber (NAMD 2.3), GROMACS (NAMD 2.4) Building a complete modeling environment

Towards Understanding Membrane Channels
The versatile, highly selective and efficent aquaporin Deposited at the web site of the Nobel Museum

Protein Redesign Seeks a Photosynthetic Source for Hydrogen Gas
Algal Hydrogenase Protein Redesign Seeks a Photosynthetic Source for Hydrogen Gas 57,000 atoms Periodic boundary conditions CHARMM27 force-field, NVT: constant volume and temperature PME full electrostatics Teragrid benchmark: 0.24 day/ns on 64 Itanium 1.5 GHz processors Collaboration with DOE National Renewable Energy Lab. Golden, CO

Torque is transmitted between the motors via the central stalk.
ATP-Synthase One shaft, two motors ~ 100Å 330,000 atom Soluble part, F1-ATPase Synthesizes ATP when torque is applied to it (main function of this unit) Produces torque when it hydrolyzes ATP (not main function) ~ 80 Å ~ 200 Å Membrane-bound part, F0 Complex Produces torque when positive proton gradient across membrane(main function of this unit) Pumps protons when torque is applied (not main function) ~ 60 Å 130,000 atom ~ 60 Å Torque is transmitted between the motors via the central stalk.

Molecular Mechanics Force Field

Biomolecular Time Scales
Max Timestep: 1 fs

Example Simulation: GlpF
NAMD with PME Periodic boundary conditions NPT ensemble at 310 K Protein: ~ 15,000 atoms Lipids: ~ 40,000 atoms Water: ~ 51,000 atoms Total: ~ 106,000 atoms PSC TCS CPUs 4 hours per ns M. Jensen, E. Tajkhorshid, K. Schulten, Structure 9, 1083 (2001) E. Tajkhorshid et al., Science 296, (2002)

Typical Simulation Statistics
100,000 atoms (including water, lipid) 10-20 MB of data for entire system 100 A per side periodic cell 12 A cutoff of short-range nonbonded terms 10,000,000 timesteps (10 ns) 4 s/step on one processor (1.3 years total!)

Parallel MD: Easy or Hard?
Tiny working data Spatial locality Uniform atom density Persistent repetition Multiple timestepping Hard Sequential timesteps Short iteration time Full electrostatics Fixed problem size

Poorly Scaling Approaches
Replicated data All atom coordinates stored on each processor Communication/Computation ratio: O(P log P) Partition the atom array across processors Nearby atoms may not be on the same processor C/C ratio: O(P) Distribute force matrix to processors Matrix is sparse, non uniform C/C Ratio: O(sqrt P)

Spatial Decomposition: NAMD 1
Atoms spatially distributed to cubes Size of each cube : Just a larger than cut-off radius Communicate only w/ neighbors Work for each pair of neighbors C/C ratio: O(1) However: Load Imbalance Limited Parallelism Cells, Cubes or “Patches”

Hybrid Decomposition: NAMD 2
Spatially decompose data and communication. Separate but related work decomposition. “Compute objects” facilitate iterative, measurement-based load balancing system.

Particle Mesh Ewald Particle Mesh Ewald (PME) calculation adds:
A global grid of modest size (e.g. 192x144x144). Distributing charge from each atom to 4x4x4 sub-grid. 3D FFT over the grid, hence O(N log N) performance. Strategy: Use a smaller subset of processors for PME. Overlap PME with cutoff computation. Use same processors for both PME and cutoff. Multiple time-step reduces scaling impact.

NAMD 2 w/PME Parallelization using Charm++
700 30,000 144 192

Avoiding Barriers In NAMD: This came handy when:
The energy reductions were made asynchronous. No other global barriers are used in cut-off simulations. This came handy when: Running on Pittsburgh Lemieux (3000 processors). The machine (and how Converse uses the network) produced unpredictable, random communication delay. A send call would remain stuck for 20 ms, for example. Each timestep, ideally, was ms.

Handling Network Delays

SC2002 Gordon Bell Award Lemieux (PSC) 327K atoms with PME
28 s per step 36 ms per step 76% efficiency 327K atoms with PME Linear scaling number of processors

Major New Platforms SGI Altix Cray XT3 “Red Storm”
IBM BG/L “Blue Gene”

SGI Altix 3000 Itanium-based successor to Origin series
1.6 GHz Itanium 2 CPUs w/ 9 MB Cache Cache-coherent NUMA shared memory Runs Linux (with some SGI modifications) NCSA has two 512 processor machines

Porting NAMD to the Altix
Normal Itanium binary just works. Best serial performance ever, better than other Itanium platforms (TeraGrid) at same clock speed. Building with SGI MPI just works. setenv MPI_DSM_DISTRIBUTE needed. Superlinear speedups 16 to 64 processors (good network, running mostly in cache at 64). Decent scaling to 256 (for ApoA1 benchmark). Intel 8.1 and later compiler performance issues.

NAMD on New Platforms 92K atoms, PME NCSA 3.06 GHz Xeon PSC Cray XT3
(perfect scaling is a horizontal line) 92K atoms, PME NCSA 3.06 GHz Xeon PSC Cray XT3 TeraGrid 1.5 GHz Itanium 2 NCSA Altix 1.6 GHz Itanium 2 21 ms/step 4.1 ns/day number of processors

Altix Conclusions Nice machine, easy to port to
Code must run well on Itanium Perfect for typical NAMD user Fastest serial performance Scales well to typical number of processors Full environment, no surprises TCBG’s favorite platform for the past year

Altix Transforms Interactive NAMD
VMD User (HHS Secretary Thompson) 2fs step = 1ps/s 2005 8-fold Performance Growth 2001 to 2003: 72% faster 2003 to 2004: 39% faster 2004 to 2005: 239% faster 1.6 GHz Altix steps per second 3.06 GHz Xeon 2004 GlpF IMD Benchmark: 4210 atoms 3295 fixed atoms 10A cutoff, no PME 2.13 GHz 2003 1.33 GHz Athlon 2001 processors

Cray XT3 (Red Storm) Each node:
Single AMD Opteron 100-series processors 57 ns memory latency 6.4 GB/s memory bandwidth 6.4 GB/s HyperTransport to Seastar network Seastar router chip: 6 ports (3D torus topology) 7.6 GB/s per port (in fixed Seastar 2) Poor latency (vs. XD1, according to Cray)

Cray XT3 (Red Storm) 4 nodes per blade 8 blades per chassis
3 chassis per cabinet, plus one big fan PSC machine (Big Ben) has 22 chassis 2068 compute processors Performance boost for TCS system (Lemieux)

Cray XT3 (Red Storm) Service and I/O nodes run Linux
Normal x64-64 binaries just work on them Compute nodes run Catamount kernel No OS interference for fine-grained parallelism No time sharing…one process at a time No sockets No interrupts No virtual memory System calls forwarded to head node (slow!)

Cray XT3 Porting Initial compile mostly straightforward
Disable Tcl, sockets, hostname, username code. Initial runs horribly slow on startup Almost like memory allocation was O(n2) Found docs: “simple implementation of malloc(), optimized for the lightweight kernel and large memory allocations” Sounds like they assume a stack-based structure Using –lgmalloc restores sane performance

Cray XT3 Porting Still somewhat slow on startup
Need to do all I/O to Lustre scratch space May be better when head node isn’t overloaded Tried SHMEM port (old T3E layer) New library doesn’t support locks yet SHMEM was optimized for T3E, not XT3 Need Tcl for fully functional NAMD #ifdef out all socket and user info code Same approach should work on BG/L

Cray XT3 Porting Random crashes even on short benchmarks
Same NAMD code as elsewhere Same MPI layer as other platforms Try the debugger (TotalView) Still buggy, won’t attach to running jobs Managed to load a core file Found pcqueue with item count of –1 Checking item count apparently fixes problem Probably a compiler bug…the code looks fine

Cray XT3 Porting Performance limited (on 256 CPUs)
Only when printing energies every step NAMD streams better than direct CmiPrintf() I/O is unbuffered by default, 20ms per write Create large buffer, remove NAMD flushes Fixes performance problem Can hit 6ms/step on 1024 CPUs…very good No output until end of job, may lose all in crash

NAMD on New Platforms 92K atoms, PME NCSA 3.06 GHz Xeon PSC Cray XT3
(perfect scaling is a horizontal line) 92K atoms, PME NCSA 3.06 GHz Xeon PSC Cray XT3 TeraGrid 1.5 GHz Itanium 2 NCSA Altix 1.6 GHz Itanium 2 21 ms/step 4.1 ns/day number of processors

Cray XT3 Conclusions Serial performance is reasonable
Itanium is faster for NAMD Opteron requires less tuning work Scaling is outstanding (eventually) Low system noise allows 6ms timesteps NAMD latency tolerance may help Lack of OS features annoying, but workable TCBG’s main allocation for this year

What’s New With NAMD Triumph and Torture with New Platforms

Similar presentations

Presentation on theme: "What’s New With NAMD Triumph and Torture with New Platforms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What’s New With NAMD Triumph and Torture with New Platforms

Similar presentations

Presentation on theme: "What’s New With NAMD Triumph and Torture with New Platforms"— Presentation transcript:

Similar presentations

About project

Feedback