Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

Slides:

Advertisements

Similar presentations

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Advertisements

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Automating Topology Aware Task Mapping on Large Parallel Machines Abhinav S Bhatele Advisor: Laxmikant V. Kale University of Illinois at Urbana-Champaign.

Load Balancing and Topology Aware Mapping for Petascale Machines Abhinav S Bhatele Parallel Programming Laboratory, Illinois.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Interconnect Networks

1 First-Principles Molecular Dynamics for Petascale Computers François Gygi Dept of Applied Science, UC Davis

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Grid Computing With Charm++ And Adaptive MPI Gregory A. Koenig Department of Computer Science University of Illinois.

1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.

Department of Computer Science at Florida State LFTI: A Performance Metric for Assessing Interconnect topology and routing design Background ‒ Innovations.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Auburn University

Effects of Contention on Message Latencies in Large Supercomputers

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Parallel Algorithm Design

Performance Evaluation of Adaptive MPI

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Parallel Programming in C with MPI and OpenMP

Component Frameworks:

Scalable Molecular Dynamics for Large Biomolecular Systems

Title Meta-Balancer: Automated Selection of Load Balancing Strategies

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC

Outline The Mapping Problem Motivation for Topology Aware Mapping NAMD: Dynamic Unstructured Graph OpenAtom: Static Structured Graph Other Applications Automated Mapping Framework October 8th, 20082IACAT Seminar

Outline The Mapping Problem Motivation for Topology Aware Mapping NAMD: Dynamic Unstructured Graph OpenAtom: Static Structured Graph Other Applications Automated Mapping Framework October 8th, 20083IACAT Seminar

The Mapping Problem Given a set of communicating parallel “entities”, map them onto physical processors Entities – Ranks in case of an MPI program – Objects in case of a Charm++ program Aim – Balance load – Minimize communication traffic October 8th, 20084IACAT Seminar

Target Machines 3D torus/mesh interconnects In a torus, diameter is reduced by half – More paths for adaptive routing Unlike fat-trees – Not asymptotically scalable October 8th, 2008IACAT Seminar5

Scale of Parallel Machines Blue Gene/P at ANL: 40,960 nodes, torus dimensions 32 x 32 x 40 (?) Jaguar (XT4) at ORNL: 8,064 nodes, torus dimensions 21 x 16 x 24 October 8th, 2008IACAT Seminar6

Outline The Mapping Problem Motivation for Topology Aware Mapping NAMD: Dynamic Unstructured Graph OpenAtom: Static Structured Graph Other Applications Automated Mapping Framework October 8th, 20087IACAT Seminar

Motivation Consider a 3D mesh/torus interconnect Message latencies can be modeled by (L f /B) x D + L/B October 8th, 20088IACAT Seminar

MPI Tests on Cray XT3 October 8th, 20089IACAT Seminar

OpenAtom Performance on Blue Gene/L October 8th, IACAT Seminar

OpenAtom Performance on Blue Gene/L October 8th, IACAT Seminar

Outline The Mapping Problem Motivation for Topology Aware Mapping NAMD: Dynamic Unstructured Graph OpenAtom: Static Structured Graph Other Applications Automated Mapping Framework October 8th, IACAT Seminar

Molecular Dynamics A system of [charged] atoms with bonds Use Newtonian Mechanics to find the positions and velocities of atoms Each time-step is typically in femto-seconds At each time step – calculate the forces on all atoms – calculate the velocities and move atoms around October 8th, 2008IACAT Seminar13

NAMD: NAnoscale Molecular Dynamics Naïve force calculation is O(N 2 ) Reduced to O(N logN) by calculating – Bonded forces – Non-bonded: using a cutoff radius Short-range: calculated every time step Long-range: calculated every fourth time-step (PME) October 8th, 2008IACAT Seminar14

Hybrid of spatial and force decomposition NAMD’s Parallel Design October 8th, 2008IACAT Seminar15

Parallelization using Charm++ October 8th, 2008IACAT Seminar16 Static Mapping Load Balancing Bhatele, A., Kumar, S., Mei, C., Phillips, J. C., Zheng, G. & Kale, L. V Overcoming Scaling Challenges in Biomolecular Simulations across Multiple Platforms. In Proceedings of IEEE International Parallel and Distributed Processing Symposium, Miami, FL, USA, April 2008.

Communication in NAMD Each patch multicasts its information to many computes Each compute is a target of two multicasts only Use ‘Proxies’ to send data to different computes on the same processor October 8th, 2008IACAT Seminar17

Load Balancing in Charm++ Principle of Persistence – Object communication patterns and computational loads tend to persist over time Measurement-based Load Balancing – Instrument computation time and communication volume at runtime – Use the database to make new load balancing decisions October 8th, 2008IACAT Seminar18

NAMD’s Load Balancing Strategy NAMD uses a dynamic centralized greedy strategy There are two schemes in play: – A comprehensive strategy (called once) – A refinement scheme (called several times during a run) Algorithm: – Pick a compute and find a “suitable” processor to place it on October 8th, 2008IACAT Seminar19

Choice of a suitable processor Among underloaded processors, try to: – Find a processor with the two patches or their proxies – Find a processor with one patch or a proxy – Pick any underloaded processor October 8th, 2008IACAT Seminar20 Highest Priority Lowest Priority

Load Balancing Metrics Load Balance: Bring Max-to-Avg Ratio close to 1 Communication Volume: Minimize the number of proxies Communication Traffic: Minimize hop bytes Hop-bytes = Message size X Distance traveled by message October 8th, 2008IACAT Seminar21 Agarwal, T., Sharma, A., Kale, L.V Topology-aware task mapping for reducing communication contention on large parallel machines, In Proceedings of IEEE International Parallel and Distributed Processing Symposium, Rhodes Island, Greece, April 2006.

Topology Aware Techniques Static Placement of Patches October 8th, 2008IACAT Seminar22

Topology Aware Techniques (contd.) Placement of computes October 8th, 2008IACAT Seminar23

Results: Hop-bytes October 8th, 2008IACAT Seminar24

Results: Performance October 8th, 2008IACAT Seminar25

Outline The Mapping Problem Motivation for Topology Aware Mapping NAMD: Dynamic Unstructured Graph OpenAtom: Static Structured Graph Other Applications Automated Mapping Framework October 8th, IACAT Seminar

OpenAtom Ab-Initio Molecular Dynamics code Communication is static and structured Challenge: Multiple groups of objects with conflicting communication patterns October 8th, IACAT Seminar

Parallelization using Charm++ October 8th, IACAT Seminar

Topology Mapping of Chare Arrays October 8th, IACAT Seminar

Visualization of Mapping Solutions Paraview October 8th, 2008IACAT Seminar30

Results on Blue Gene/P (ANL) October 8th, IACAT Seminar

Results on XT3 (BigBen) October 8th, IACAT Seminar

Outline The Mapping Problem Motivation for Topology Aware Mapping NAMD: Dynamic Unstructured Graph OpenAtom: Static Structured Graph Other Applications Automated Mapping Framework October 8th, IACAT Seminar

Other Applications Lattice QCD (Quantum Chromodynamics) Collaboration – Near-neighbor communication MILC (MIMD Lattice Computation) – Static Structured communication – Map 4D grid to 3D mesh October 8th, IACAT Seminar

Outline The Mapping Problem Motivation for Topology Aware Mapping NAMD: Dynamic Unstructured Graph OpenAtom: Static Structured Graph Other Applications Automated Mapping Framework October 8th, IACAT Seminar

Requirements Processor topology graph Communication Graph Efficient and scalable mapping algorithms – Heuristics in many cases October 8th, 2008IACAT Seminar36

Topology Manager API The application needs information such as – Dimensions of the partition – Rank to physical co-ordinates and vice-versa TopoManager API – On BG/L and BG/P: provides a wrapper for system calls – On XT3 and XT4, there are no such system calls – Extract this information in various ways and provide a clean interface to the application October 8th, IACAT Seminar

Cray XT3 (BigBen) October 8th, IACAT Seminar

Cray XT4 (Jaguar) October 8th, 2008IACAT Seminar39

Communication Scenarios Static Regular Graph: 3D Stencil, Matrix Multiplication, Structured Mesh applications such as MILC Static Irregular Graph: Unstructured Mesh applications Dynamic Communication Graph: MD codes, Unstructured Mesh codes with AMR October 8th, IACAT Seminar

Automatic Mapping Framework Obtain the communication graph (previous run, at runtime, compile time) Apply heuristic techniques for different scenarios to arrive at a mapping solution automatically October 8th, 2008IACAT Seminar41

Thanks! October 8th, 2008IACAT Seminar42