What’s New With NAMD Triumph and Torture with New Platforms

Slides:



Advertisements
Similar presentations
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Advertisements

Thoughts on Shared Caches Jeff Odom University of Maryland.
File Systems.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
1 Case Study I: Parallel Molecular Dynamics and NAMD: Laxmikant Kale Parallel Programming Laboratory University of Illinois at.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
SOS7, Durango CO, 4-Mar-2003 Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD Distilled [Trimmed & Distilled for SOS7 by M.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.
1 Introduction to Parallel Computing Issues Laxmikant Kale Parallel Programming Laboratory Dept. of Computer Science And Theoretical.
Beckman Institute, UIUCNIH Resource for Biomolecular Modeling and Bioinformatics Molecular Dynamics Method 1 James Phillips Theoretical.
Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
BlueGene/L Facts Platform Characteristics 512-node prototype 64 rack BlueGene/L Machine Peak Performance 1.0 / 2.0 TFlops/s 180 / 360 TFlops/s Total Memory.
ALGORITHM IMPROVEMENTS AND HPC 1. Basic MD algorithm 1. Initialize atoms’ status and the LJ potential table; set parameters controlling the simulation;
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Using GPUs to compute the multilevel summation.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1 NAMD: Biomolecular Simulation on Thousands of Processors James C. Phillips Gengbin Zheng Sameer Kumar Laxmikant Kale Parallel.
1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
LLNL-PRES DRAFT This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract.
Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC Molecular Dynamics Method 2 Justin Gullingsrud.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
1 Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study Laxmikant Kale Gengbin Zheng Sameer Kumar Chee Wai.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Large-Scale Static Timing Analysis Akintayo Holder Rensselaer Polytechnic Institute 2/18/20161.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale.
Data-Driven Time-Parallelization in the AFM Simulation of Proteins L. Ji, H. Nymeyer, A. Srinivasan, and Y. Yu Florida State University
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.
NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Density-based Hybrid Clustering
So far we have covered … Basic visualization algorithms
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Scalable Molecular Dynamics for Large Biomolecular Systems
Department of Computer Science, University of Tennessee, Knoxville
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

What’s New With NAMD Triumph and Torture with New Platforms Jim Phillips and Chee Wai Lee Theoretical and Computational Biophysics Group http://www.ks.uiuc.edu/Research/namd/

What is NAMD? Molecular dynamics and related algorithms e.g., minimization, steering, locally enhanced sampling, alchemical and conformational free energy perturbation Efficient algorithms for full electrostatics Effective on affordable commodity hardware Read file formats from standard packages: X-PLOR (NAMD 1.0), CHARMM (NAMD 2.0), Amber (NAMD 2.3), GROMACS (NAMD 2.4) Building a complete modeling environment

Towards Understanding Membrane Channels The versatile, highly selective and efficent aquaporin Deposited at the web site of the Nobel Museum

Protein Redesign Seeks a Photosynthetic Source for Hydrogen Gas Algal Hydrogenase Protein Redesign Seeks a Photosynthetic Source for Hydrogen Gas 57,000 atoms Periodic boundary conditions CHARMM27 force-field, NVT: constant volume and temperature PME full electrostatics Teragrid benchmark: 0.24 day/ns on 64 Itanium 1.5 GHz processors Collaboration with DOE National Renewable Energy Lab. Golden, CO

Torque is transmitted between the motors via the central stalk. ATP-Synthase One shaft, two motors ~ 100Å 330,000 atom Soluble part, F1-ATPase Synthesizes ATP when torque is applied to it (main function of this unit) Produces torque when it hydrolyzes ATP (not main function) ~ 80 Å ~ 200 Å Membrane-bound part, F0 Complex Produces torque when positive proton gradient across membrane(main function of this unit) Pumps protons when torque is applied (not main function) ~ 60 Å 130,000 atom ~ 60 Å Torque is transmitted between the motors via the central stalk.

Molecular Mechanics Force Field

Biomolecular Time Scales Max Timestep: 1 fs

Example Simulation: GlpF NAMD with PME Periodic boundary conditions NPT ensemble at 310 K Protein: ~ 15,000 atoms Lipids: ~ 40,000 atoms Water: ~ 51,000 atoms Total: ~ 106,000 atoms PSC TCS CPUs 4 hours per ns M. Jensen, E. Tajkhorshid, K. Schulten, Structure 9, 1083 (2001) E. Tajkhorshid et al., Science 296, 525-530 (2002)

Typical Simulation Statistics 100,000 atoms (including water, lipid) 10-20 MB of data for entire system 100 A per side periodic cell 12 A cutoff of short-range nonbonded terms 10,000,000 timesteps (10 ns) 4 s/step on one processor (1.3 years total!)

Parallel MD: Easy or Hard? Tiny working data Spatial locality Uniform atom density Persistent repetition Multiple timestepping Hard Sequential timesteps Short iteration time Full electrostatics Fixed problem size

Poorly Scaling Approaches Replicated data All atom coordinates stored on each processor Communication/Computation ratio: O(P log P) Partition the atom array across processors Nearby atoms may not be on the same processor C/C ratio: O(P) Distribute force matrix to processors Matrix is sparse, non uniform C/C Ratio: O(sqrt P)

Spatial Decomposition: NAMD 1 Atoms spatially distributed to cubes Size of each cube : Just a larger than cut-off radius Communicate only w/ neighbors Work for each pair of neighbors C/C ratio: O(1) However: Load Imbalance Limited Parallelism Cells, Cubes or “Patches”

Hybrid Decomposition: NAMD 2 Spatially decompose data and communication. Separate but related work decomposition. “Compute objects” facilitate iterative, measurement-based load balancing system.

Particle Mesh Ewald Particle Mesh Ewald (PME) calculation adds: A global grid of modest size (e.g. 192x144x144). Distributing charge from each atom to 4x4x4 sub-grid. 3D FFT over the grid, hence O(N log N) performance. Strategy: Use a smaller subset of processors for PME. Overlap PME with cutoff computation. Use same processors for both PME and cutoff. Multiple time-step reduces scaling impact.

NAMD 2 w/PME Parallelization using Charm++ 700 30,000 144 192

Avoiding Barriers In NAMD: This came handy when: The energy reductions were made asynchronous. No other global barriers are used in cut-off simulations. This came handy when: Running on Pittsburgh Lemieux (3000 processors). The machine (and how Converse uses the network) produced unpredictable, random communication delay. A send call would remain stuck for 20 ms, for example. Each timestep, ideally, was 12-14 ms.

Handling Network Delays

SC2002 Gordon Bell Award Lemieux (PSC) 327K atoms with PME 28 s per step 36 ms per step 76% efficiency 327K atoms with PME Linear scaling number of processors

Major New Platforms SGI Altix Cray XT3 “Red Storm” IBM BG/L “Blue Gene”

SGI Altix 3000 Itanium-based successor to Origin series 1.6 GHz Itanium 2 CPUs w/ 9 MB Cache Cache-coherent NUMA shared memory Runs Linux (with some SGI modifications) NCSA has two 512 processor machines

Porting NAMD to the Altix Normal Itanium binary just works. Best serial performance ever, better than other Itanium platforms (TeraGrid) at same clock speed. Building with SGI MPI just works. setenv MPI_DSM_DISTRIBUTE needed. Superlinear speedups 16 to 64 processors (good network, running mostly in cache at 64). Decent scaling to 256 (for ApoA1 benchmark). Intel 8.1 and later compiler performance issues.

NAMD on New Platforms 92K atoms, PME NCSA 3.06 GHz Xeon PSC Cray XT3 (perfect scaling is a horizontal line) 92K atoms, PME NCSA 3.06 GHz Xeon PSC Cray XT3 TeraGrid 1.5 GHz Itanium 2 NCSA Altix 1.6 GHz Itanium 2 21 ms/step 4.1 ns/day number of processors

Altix Conclusions Nice machine, easy to port to Code must run well on Itanium Perfect for typical NAMD user Fastest serial performance Scales well to typical number of processors Full environment, no surprises TCBG’s favorite platform for the past year

Altix Transforms Interactive NAMD VMD User (HHS Secretary Thompson) 2fs step = 1ps/s 2005 8-fold Performance Growth 2001 to 2003: 72% faster 2003 to 2004: 39% faster 2004 to 2005: 239% faster 1.6 GHz Altix steps per second 3.06 GHz Xeon 2004 GlpF IMD Benchmark: 4210 atoms 3295 fixed atoms 10A cutoff, no PME 2.13 GHz 2003 1.33 GHz Athlon 2001 processors

Cray XT3 (Red Storm) Each node: Single AMD Opteron 100-series processors 57 ns memory latency 6.4 GB/s memory bandwidth 6.4 GB/s HyperTransport to Seastar network Seastar router chip: 6 ports (3D torus topology) 7.6 GB/s per port (in fixed Seastar 2) Poor latency (vs. XD1, according to Cray)

Cray XT3 (Red Storm) 4 nodes per blade 8 blades per chassis 3 chassis per cabinet, plus one big fan PSC machine (Big Ben) has 22 chassis 2068 compute processors Performance boost for TCS system (Lemieux)

Cray XT3 (Red Storm) Service and I/O nodes run Linux Normal x64-64 binaries just work on them Compute nodes run Catamount kernel No OS interference for fine-grained parallelism No time sharing…one process at a time No sockets No interrupts No virtual memory System calls forwarded to head node (slow!)

Cray XT3 Porting Initial compile mostly straightforward Disable Tcl, sockets, hostname, username code. Initial runs horribly slow on startup Almost like memory allocation was O(n2) Found docs: “simple implementation of malloc(), optimized for the lightweight kernel and large memory allocations” Sounds like they assume a stack-based structure Using –lgmalloc restores sane performance

Cray XT3 Porting Still somewhat slow on startup Need to do all I/O to Lustre scratch space May be better when head node isn’t overloaded Tried SHMEM port (old T3E layer) New library doesn’t support locks yet SHMEM was optimized for T3E, not XT3 Need Tcl for fully functional NAMD #ifdef out all socket and user info code Same approach should work on BG/L

Cray XT3 Porting Random crashes even on short benchmarks Same NAMD code as elsewhere Same MPI layer as other platforms Try the debugger (TotalView) Still buggy, won’t attach to running jobs Managed to load a core file Found pcqueue with item count of –1 Checking item count apparently fixes problem Probably a compiler bug…the code looks fine

Cray XT3 Porting Performance limited (on 256 CPUs) Only when printing energies every step NAMD streams better than direct CmiPrintf() I/O is unbuffered by default, 20ms per write Create large buffer, remove NAMD flushes Fixes performance problem Can hit 6ms/step on 1024 CPUs…very good No output until end of job, may lose all in crash

NAMD on New Platforms 92K atoms, PME NCSA 3.06 GHz Xeon PSC Cray XT3 (perfect scaling is a horizontal line) 92K atoms, PME NCSA 3.06 GHz Xeon PSC Cray XT3 TeraGrid 1.5 GHz Itanium 2 NCSA Altix 1.6 GHz Itanium 2 21 ms/step 4.1 ns/day number of processors

Cray XT3 Conclusions Serial performance is reasonable Itanium is faster for NAMD Opteron requires less tuning work Scaling is outstanding (eventually) Low system noise allows 6ms timesteps NAMD latency tolerance may help Lack of OS features annoying, but workable TCBG’s main allocation for this year