Next-generation scalable applications: When MPI-only is not enough June 3, 2008 Paul S. Crozier Sandia National Labs Sandia is a multiprogram laboratory.

Slides:



Advertisements
Similar presentations
Simulazione di Biomolecole: metodi e applicazioni giorgio colombo
Advertisements

PERFORMANCE EVALUATION OF USER-REAXC PACKAGE Hasan Metin Aktulga Postdoctoral Researcher Scientific Computing Group Lawrence Berkeley National Laboratory.
Formulation of an algorithm to implement Lowe-Andersen thermostat in parallel molecular simulation package, LAMMPS Prathyusha K. R. and P. B. Sunil Kumar.
N-Body I CS 170: Computing for the Sciences and Mathematics.
Molecular Simulations of Metal-Organic Frameworks
1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
Unstructured Data Partitioning for Large Scale Visualization CSCAPES Workshop June, 2008 Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram.
Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
PGA – Parallel Genetic Algorithm Hsuan Lee. Reference  E Cantú-Paz, A Survey on Parallel Genetic Algorithm, Calculateurs Paralleles, Reseaux et Systems.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
Application Performance Analysis on Blue Gene/L Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling,
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
‘Tis not folly to dream: Using Molecular Dynamics to Solve Problems in Chemistry Christopher Adam Hixson and Ralph A. Wheeler Dept. of Chemistry and Biochemistry,
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
Algorithms and Software for Large-Scale Simulation of Reactive Systems _______________________________ Ananth Grama Coordinated Systems Lab Purdue University.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (ii) Bill Smith CCLRC Daresbury Laboratory
ExTASY 0.1 Beta Testing 1 st April 2015
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Trilinos: From a User’s Perspective Russell Hooper Nov. 7, 2007 SAND P Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
HPC Technology Track: Foundations of Computational Science Lecture 1 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.
Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
After step 2, processors know who owns the data in their assumed partitions— now the assumed partition defines the rendezvous points Scalable Conceptual.
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
E-science grid facility for Europe and Latin America E2GRIS1 André A. S. T. Ribeiro – UFRJ (Brazil) Itacuruça (Brazil), 2-15 November 2008.
P ARALLELIZATION IN M OLECULAR D YNAMICS By Aditya Mittal For CME346A by Professor Eric Darve Stanford University.
Strategies for Solving Large-Scale Optimization Problems Judith Hill Sandia National Laboratories October 23, 2007 Modeling and High-Performance Computing.
ALGORITHM IMPROVEMENTS AND HPC 1. Basic MD algorithm 1. Initialize atoms’ status and the LJ potential table; set parameters controlling the simulation;
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Molecular Dynamics Study of Aqueous Solutions in Heterogeneous Environments: Water Traces in Organic Media Naga Rajesh Tummala and Alberto Striolo School.
System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.
Molecular simulation methods Ab-initio methods (Few approximations but slow) DFT CPMD Electron and nuclei treated explicitly. Classical atomistic methods.
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
LAMMPS Users’ Workshop
Danny Dunlavy, Andy Salinger Sandia National Laboratories Albuquerque, New Mexico, USA SIAM Parallel Processing February 23, 2006 SAND C Sandia.
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Algorithms and Software for Large-Scale Simulation of Reactive Systems _______________________________ Metin Aktulga, Sagar Pandit, Alejandro Strachan,
Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.
1 HPC Middleware on GRID … as a material for discussion of WG5 GeoFEM/RIST August 2nd, 2001, ACES/GEM at MHPCC Kihei, Maui, Hawaii.
Molecular Modelling - Lecture 2 Techniques for Conformational Sampling Uses CHARMM force field Written in C++
Anton, a Special-Purpose Machine for Molecular Dynamics Simulation By David E. Shaw et al Presented by Bob Koutsoyannis.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Performing Fault-tolerant, Scalable Data Collection and Analysis James Jolly University of Wisconsin-Madison Visualization and Scientific Computing Dept.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Simulation Study of Phase Transition of Diblock Copolymers
Comparison to LAMMPS-REAX
Algorithms and Software for Large-Scale Simulation of Reactive Systems
Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University
Parallelization of CPAIMD using Charm++
Hybrid Programming with OpenMP and MPI
Algorithms and Software for Large-Scale Simulation of Reactive Systems
Parallel computing in Computational chemistry
Presentation transcript:

Next-generation scalable applications: When MPI-only is not enough June 3, 2008 Paul S. Crozier Sandia National Labs Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04-94AL How we use MPI to perform scalable parallel molecular dynamics

Why MD begs to be run on parallel supercomputers: Very popular method Lots of useful information produced Fundamentally limited by: force field accuracy and compute power (hardware/software) Where’s the expensive part? Force calculations Neighborhood list creation Long-range electrostatics (if needed) Molecular dynamics on parallel supercomputers

Our MD codes miniMD: part of Mantevo project currently in licensing process (LGPL) intended to be extremely simple parallel MD code that enables rapid portability and rewriting for use on novel architechtures LAMMPS: ( Large-scale Atomic/Molecular Massively Parallel Simulator )

 Classical MD code.  Open source, highly portable C++.  Freely available for download under GPL.  Easy to download, install, and run.  Well documented.  Easy to modify or extend with new features and functionality.  Active user’s list with over 300 subscribers.  Since Sept. 2004: over 20k downloads, grown from 53 to 125 kloc.  Spatial-decomposition of simulation domain for parallelism.  Energy minimization via conjugate-gradient relaxation.  Radiation damage and two temperature model (TTM) simulations.  Atomistic, mesoscale, and coarse-grain simulations.  Variety of potentials (including many-body and coarse-grain).  Variety of boundary conditions, constraints, etc. LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) Steve Plimpton, Aidan Thompson, Paul Crozier lammps.sandia.gov

LAMMPS development areas Timescale & spatial scale Faster MD On a single processor In parallel, or with load balancing novel architectures, or N/P < 1 Accelerated MD Temperature accelerated dynamics Parallel replica dynamics Forward flux sampling Multiscale simulation Couple to quantum Couple to fluid solvers Couple to KMC Coarse graining Aggregation Rigidification Peridynamics Force fields Auto generation FF parameter data base Biological & organics Solid materials Electrons & plasmas Chemical reactions ReaxFF Bond making/swapping/b reaking Functional forms Charge equilibration Long-range dipole- dipole interactions Allow user-defined FF formulas Aspherical particles Features Informatics traj analysis NPT for non- orthogonal boxes User-requested features

Possible parallel decompositions Atom: each processor owns a fixed subset of atoms. requires all-to-all communication Force: each processor owns a fixed subset of interatomic forces. scales as O(N/sqrt(P)) Spatial: each processor owns a fixed spatial region. scales as O(N/P) Hybrid: typically a combination of spatial and force decompositions. Other? S. J. Plimpton, Fast Parallel Algorithms for Short-Range Molecular Dynamics, J Comp Phys, 117, 1-19 (1995).

Parallelism via Spatial-Decomposition Physical domain divided into 3d boxes, one per processor Each processor computes forces on atoms in its box –using info from nearby procs Atoms "carry along" molecular topology –as they migrate to new procs Communication via –nearest-neighbor 6-way stencil Optimal N/P scaling for MD, –so long as load-balanced Computation scales as N/P Communication scales as N/P –for large problems (or better or worse) Memory scales as N/P

Parallel Performance Fixed-size (32K atoms) and scaled-size (32K atoms/proc) parallel efficiencies Protein (rhodopsin) in solvated lipid bilayer Billions of atoms on 64K procs of Blue Gene or Red Storm Opteron processor speed: 4.5E-5 sec/atom/step –(12x for metal, 25x for LJ)

Particle-mesh Methods for Coulombics Coulomb interactions fall off as 1/r so require long-range for accuracy Particle-mesh methods: partition into short-range and long-range contributions short-range via direct pairwise interactions long-range: interpolate atomic charge to 3d mesh solve Poisson's equation on mesh (4 FFTs) interpolate E-fields back to atoms FFTs scale as NlogN if cutoff is held fixed

Parallel FFTs 3d FFT is 3 sets of 1d FFTs in parallel, 3d grid is distributed across procs perform 1d FFTs on-processor native library or FFTW ( 1d FFTs, transpose, 1d FFTs, transpose,... "transpose” = data transfer transfer of entire grid is costly FFTs for PPPM can scale poorly on large # of procs and on clusters Good news: Cost of PPPM is only ~2x more than 8-10 Angstrom cutoff

Could we do better with non-traditional hardware, or something other than MPI? Alternative hardware: Multicore GPGPUs Vector machine Non-MPI message passing: OpenMP Special-purpose message passing Other suggestions?

Comments on morning discussion 1.We really want strong scaling --- get out longer on the time axis. Rather simulate a membrane protein for ms than a whole cell for a ns. We want an MD algorithm that performs well on N < P. LAMMPS’s spatial decomposition breaks down for N < P. We’ll need a new parallel MD algorithm for speedup along the time axis for problems where N < P. 2.We’ve looked at OpenMP vs MPI and see no advantage on LAMMPS as it stands right now. 3.Many in the MD community now looking at doing MD on “novel architechtures” --- GPUs, multicore mahcines, etc. 4.“Canned” particle pusher middleware seems difficult. Needs many different options and flavors.

MPI Applications: How We Use MPI Tuesday, June 3, 1:45-3:00 pm Paul S. Crozier Sandia National Labs Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04-94AL How we use MPI to perform scalable parallel molecular dynamics