Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Advertisements

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.
Parallelizing Incremental Bayesian Segmentation (IBS) Joseph Hastings Sid Sen.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.
Parallelized Evolution System Onur Soysal, Erkin Bahçeci Erol Şahin Dept. of Computer Engineering Middle East Technical University.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Peter Dinda Department of Computer Science Northwestern University.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Adaptive MPI Milind A. Bhandarkar
Grid Computing With Charm++ And Adaptive MPI Gregory A. Koenig Department of Computer Science University of Illinois.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Advanced / Other Programming Models Sathish Vadhiyar.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.
AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
I MAGIS is a joint project of CNRS - INPG - INRIA - UJF iMAGIS-GRAVIR / IMAG Efficient Parallel Refinement for Hierarchical Radiosity on a DSM computer.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Computer Science Overview Laxmikant Kale October 29, 2002 ©2002 Board of Trustees of the University of Illinois ©
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Computer Science Overview
Parallel Objects: Virtualization & In-Process Components
Performance Evaluation of Adaptive MPI
AMPI: Adaptive MPI Tutorial
Using SCTP to hide latency in MPI programs
Class project by Piyush Ranjan Satapathy & Van Lepham
AMPI: Adaptive MPI Celso Mendes & Chao Huang
Department of Computer Science University of California, Santa Barbara
Component Frameworks:
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Faucets: Efficient Utilization of Multiple Clusters
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Chao Huang Parallel Programming Laboratory University of Illinois
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Department of Computer Science University of California, Santa Barbara
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign

MPI2 Motivation  Challenges New generation parallel applications are:  Dynamically varying: load shifting, adaptive refinement Typical MPI implementations are:  Not naturally suitable for dynamic applications Set of available processors:  May not match those required by the algorithm  Adaptive MPI Virtual MPI Processor (VP) Solutions and optimizations

MPI3 Outline  Motivation  Implementations  Features and Experiments  Current Status  Future Work

MPI4 Processor Virtualization  Basic idea of processor virtualization User specifies interaction between objects (VP’s) RTS maps VP’s onto physical processors Typically, # virtual processors > # processors User View System implementation

MPI5 AMPI: MPI with Virtualization  Each virtual process implemented as a user-level thread embedded in a Charm++ object MPI processes Real Processors Implemented as virtual processes (user-level migratable threads) MPI “processes”

MPI6 Adaptive Overlap Problem setup: 3D stencil calculation of size run on Lemieux. Run with virtualization ratio 1 and 8. p = 8 vp = 8 p = 8 vp = 64

MPI7 Benefit of Adaptive Overlap Problem setup: 3D stencil calculation of size run on Lemieux. Shows AMPI with virtualization ratio of 1 and 8.

MPI8 Problem setup: 3D stencil calculation of size run on Lemieux. AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs cube #. Comparison with Native MPI  Performance Slightly worse w/o optimization Being improved  Flexibility Big runs on a few processors Fits the nature of algorithms

MPI9 Automatic Load Balancing  Problems Dynamically varying applications Load imbalance impacts overall performance Difficult to move jobs between processors  Implementation Load balancing by migrating objects (VP’s)  RTS collects CPU and network usage of VP’s  New mapping based on collected information  Threads are packed up and shipped as needed Different variations of strategies available

MPI10 Load Balancing Example AMR application Refinement happens at step 25 Load balancer is activated at time steps 20, 40, 60, and 80.

MPI11 Collective Operations  Problem with collective operations Complex: involving many processors Time consuming: designed as blocking calls in MPI Time breakdown of 2D FFT benchmark [ms]

MPI12 Collective Communication Optimization Time breakdown of an all-to-all operation using Mesh library  Computation is only a small proportion of the elapsed time  A number of optimization techniques are developed to improve collective communication performance

MPI13 Asynchronous Collectives Time breakdown of 2D FFT benchmark [ms]  VP’s implemented as threads  Overlapping computation with waiting time of collective operations  Total completion time reduced

MPI14 Shrink/Expand  Problem: Availability of computing platform may change  Fitting applications on the platform by object migration Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600

MPI15 Summary of Advantages  Run MPI programs on any # of PE’s  Overlap computation with communication  Automatic load balancing  Asynchronous collective operations  Shrink/expand  Automatic checkpoint/restart mechanism  Cross communicators  Interoperability

MPI16 Application Example: GEN2 MPI/AMPI Rocpanda Rocblas Rocface Rocman Roccom Rocflo-MP Rocflu-MP Rocsolid Rocfrac Rocburn2D ZNZN APNAPN PYPY Truegrid Tetmesh Metis Gridgen Makeflo

MPI17 Future Work  Performance prediction via direct simulation Performance tuning w/o continuous access to large machine  Support for visualization Facilitating debugging and performance tuning  Support for MPI-2 standard ROMIO as parallel I/O library One-sided communications

MPI18 Thank You! Free download of AMPI is available at: Parallel Programming Lab at University of Illinois