Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
1 Case Study I: Parallel Molecular Dynamics and NAMD: Laxmikant Kale Parallel Programming Laboratory University of Illinois at.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Novel and “Alternative” Parallel Programming Paradigms Laxmikant Kale CS433 Spring 2000.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Algorithms and Software for Large-Scale Simulation of Reactive Systems _______________________________ Ananth Grama Coordinated Systems Lab Purdue University.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Adaptive MPI Milind A. Bhandarkar
1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
1 Introduction to Parallel Computing Issues Laxmikant Kale Parallel Programming Laboratory Dept. of Computer Science And Theoretical.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
1 NAMD: Biomolecular Simulation on Thousands of Processors James C. Phillips Gengbin Zheng Sameer Kumar Laxmikant Kale Parallel.
1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.
Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
1 Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study Laxmikant Kale Gengbin Zheng Sameer Kumar Chee Wai.
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Review Session BS123A/MB223 UC-Irvine Ray Luo, MBB, BS.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
1 Major analytical/theoretical techniques Typically involves simple algebraic formulas, and ratios –Typical variables are: data size (N), number of processors.
Dynamic Load Balancing Tree and Structured Computations.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Programming for Performance Laxmikant Kale CS 433.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Computational Techniques for Efficient Carbon Nanotube Simulation
Parallel Programming By J. H. Wang May 2, 2017.
Performance Evaluation of Adaptive MPI
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Programming Models for Blue Gene/L : Charm++, AMPI and Applications
Component Frameworks:
Scalable Molecular Dynamics for Large Biomolecular Systems
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University
Computational Techniques for Efficient Carbon Nanotube Simulation
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
An Orchestration Language for Parallel Objects
Higher Level Languages on Adaptive Run-Time System
Presentation transcript:

Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale

Outline What is needed for HPC to succeed? Parallelization of Molecular Dynamics –Aggressive Parallel decomposition –Load Balancing and performance –Multiparadigm programming Collaborative Interdisciplinary Research –Comments and lessons

Contributors PI s : –Laxmikant Kale, Klaus Schulten, Robert Skeel NAMD 1: –Robert Brunner, Andrew Dalke, Attila Gursoy, Bill Humphrey, Mark Nelson NAMD2: –M. Bhandarkar, R. Brunner, A. Gursoy, J. Philips, N.Krawetz, A. Shinozaki, K. Varadarajan,

Parallel Computing Research Trends: –application centered CS research –Isolated CS research Drawback of both Needed: – Computer Science centered, yet application oriented research

Middle layers Applications Parallel Machines “Middle Layers”: Languages, Tools, Libraries

Molecular Dynamics Collection of [charged] atoms, with bonds Newtonian mechanics At each time-step –Calculate forces on each atom bonds: non-bonded: electrostatic and van der Waal’s –Calculate velocities and Advance positions 1 femtosecond time-step, millions needed! Thousands of atoms (1, ,000)

Further MD Use of cut-off radius to reduce work – Å –Faraway charges ignored! % work is non-bonded force computations Some simulations need faraway contributions

Scalability The Program should scale up to use a large number of processors. –But what does that mean? An individual simulation isn’t truly scalable Better definition of scalability: –If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size

Isoefficiency Quantify scalability How much increase in problem size is needed to retain the same efficiency on a larger machine? Efficiency : Seq. Time/ (P · Parallel Time) –parallel time = computation + communication + idle

Traditional Approaches Replicated Data: –All atom coordinates stored on each processor –Non-bonded Forces distributed evenly –Analysis: Assume N atoms, P processors Computation: O(N/P) Communication: O(N log P) Communication/Computation ratio: P log P Fraction of communication increases with number of processors, independent of problem size!

Atom decomposition Partition the Atoms array across processors –Nearby atoms may not be on the same processor –Communication: O(N) per processor –Communication/Computation: O(P)

Force Decomposition Distribute force matrix to processors –Matrix is sparse, non uniform –Each processor has one block –Communication: N/sqrt(P) –Ratio: sqrt(P) Better scalability (can use 100+ processors) –Hwang, Saltz, et al: –6% on 32 Pes 36% on 128 processor

Spatial Decomposition Allocate close-by atoms to the same processor Three variations possible: –Partitioning into P boxes, 1 per processor Good scalability, but hard to implement –Partitioning into fixed size boxes, each a little larger than the cutoff disctance –Partitioning into smaller boxes Communication: O(N/P)

Spatial Decomposition in NAMD NAMD 1 used spatial decomposition Good theoretical isoefficiency, but for a fixed size system, load balancing problems For midsize systems, got good speedups up to 16 processors…. Use the symmetry of Newton’s 3rd law to facilitate load balancing

Spatial Decomposition

FD + SD Now, we have many more objects to load balance: –Each diamond can be assigned to any processor – Number of diamonds (3D): –14·Number of Patches

Bond Forces Multiple types of forces: –Bonds(2), Angles(3), Dihedrals (4),.. –Luckily, each involves atoms in neighboring patches only Straightforward implementation: –Send message to all neighbors, –receive forces from them –26*2 messages per patch!

Bonded Forces: Assume one patch per processor B CA

Implementation Multiple Objects per processor –Different types: patches, pairwise forces, bonded forces, –Each may have its data ready at different times –Need ability to map and remap them –Need prioritized scheduling Charm++ supports all of these

Charm++ Data Driven Objects Object Groups: –global object with a “representative” on each PE Asynchronous method invocation Prioritized scheduling Mature, robust, portable

Data driven execution Scheduler Message Q

Load Balancing Is a major challenge for this application –especially for a large number of processors Unpredictable workloads –Each diamond (force object) and patch encapsulate variable amount of work –Static estimates are inaccurate Measurement based Load Balancing –Very slow variations across timesteps

Bipartite graph balancing Background load: –Patches and angle forces Migratable load: –Non-bonded forces Bipartite communication graph –between migratable and non-migratable objects Challenge: –Balance Load while minimizing communication

Load balancing Collect timing data for several cycles Run heuristic load balancer –Several alternative ones Re-map and migrate objects accordingly –Registration mechanisms facilitate migration Needs a separate talk!

Before and After

Performance: size of system

Performance: various machines

Speedup

Multi-paradigm programming Long-range electrostatic interactions –Some simulations require this –Contributions of faraway atoms can be calculated infrequently –PVM based library, DPMTA developed at Duke by John Board et al Patch life cycle Better expressed as a thread

Converse Supports multi-paradigm programming Provides portability Makes it easy to implement RTS for new paradigms Several languages/libraries: –Charm++, threaded MPI, PVM, Java, md-perl, pc++, Nexus, Path, Cid, CC++, DP, Agents,..

Namd2 with Converse

NAMD2 In production use –Internally for about a year –Several simulations completed/published Fastest MD program? We think so Modifiable/extensible –Steered MD –Free energy calculations

Lessons for CSE Technical lessons –Multiple-domain (patch) decomposition provides necessary flexibility –Data driven objects and threads is a great combo –Measurement based load balancing is better –Multi-paradigm parallel programming works! Integrate independently developed libraries Use appropriate paradigm for each component

Real Application? Drawbacks –Need to spend effort on mundane details not germane to CS research –Production program: complicates structure

Real Application for CS research? Benefits –Subtle and complex research problems uncovered only with real application –Satisfaction of “real” concrete contribution –With careful planning, you can truly enrich the “middle layers” –Bring back a rich variety of relevant CS problems –Apply to other domains: Rockets? Casting?

Collaboration lessons Use conservative methods.. –C++: fashionable vs. conservative –Aggressive methods where they matter Account for differing priorities and objectives