1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Introductory Courses in High Performance Computing at Illinois David Padua.
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.
Parallel Programming Models and Paradigms
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
Architectural Design Principles. Outline  Architectural level of design The design of the system in terms of components and connectors and their arrangements.
Jun Peng Stanford University – Department of Civil and Environmental Engineering Nov 17, 2000 DISSERTATION PROPOSAL A Software Framework for Collaborative.
PPL-Dept of Computer Science, UIUC Component Frameworks: Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Novel and “Alternative” Parallel Programming Paradigms Laxmikant Kale CS433 Spring 2000.
Adaptive MPI Milind A. Bhandarkar
Grid Computing With Charm++ And Adaptive MPI Gregory A. Koenig Department of Computer Science University of Illinois.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Advanced / Other Programming Models Sathish Vadhiyar.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Adaptive Mesh Modification in Parallel Framework Application of parFUM Sandhya Mangala (MIE) Prof. Philippe H. Geubelle (AE) University of Illinois, Urbana-Champaign.
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Parallel Computing Presented by Justin Reschke
Programming for Performance Laxmikant Kale CS 433.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Xing Cai University of Oslo
OpenMosix, Open SSI, and LinuxPMI
Parallel Objects: Virtualization & In-Process Components
Performance Evaluation of Adaptive MPI
Component Frameworks:
Component Frameworks:
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Faucets: Efficient Utilization of Multiple Clusters
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City

2CPSD Outline Quench and solidification codes Coarse grain parallelization of the quench code Adaptive parallelization techniques Dynamic variations Adaptive load balancing Finite element framework with adaptivity Preliminary results

3CPSD Coarse grain parallelization Structure of current sequential quench code: 2-D array of elements (each independently refined) Within row dependence Independent rows, but… —share global variables Parallelization using Charm++: 3 hours effort (after a false start) about 20 lines of change to F90 code A 100 line Charm++ wrapper Observations: —Global variables that are defined and used within inner loop iterations are easily dealt with in Charm++, in contrast to OpenMP —Dynamic load balancing is possible, but was unnecessary

4CPSD Performance results Contributors: Engineering: N. Sobh, R. Haber Computer Science: M. Bhandarkar, R. Liu, L. Kale

5CPSD OpenMP experience Work by: J. Hoeflinger, D. Padua, with N. Sobh, R. Haber, J. Dantzig, N. Provatas Solidification code: Parallelized using openMp Relatively straightforward, after a key decision —Parallelize by rows only

6CPSD OpenMP experience Quench code on Origin2000 Privatization of variables is needed —as outer loop was parallelized Unexpected initial difficulties with OpenMP —Led initially to large slowdown in parallelized code —Traced to unnecessary locking in MATMUL intrinsic

7CPSD Adaptive Strategies Advanced codes model dynamic and irregular behavior Solidification: adaptive grid refinement Quench: —Complex dependencies, —Parallelization within elements To parallelize these effectively, —adaptive runtime strategies are necessary

8CPSD Multi-partition decomposition: Idea: decompose the problem into a number of partitions, independent of the number of processors # Partitions > # Processors The system maps partitions to processors The system should be able to map and re-map objects as needed

9CPSD Charm++ A parallel C++ library Supports data driven objects —singleton objects, object arrays, groups, Many objects per processor, with method execution scheduled with availability of data System supports automatic instrumentation and object migration Works with other paradigms: MPI, openMP,..

10CPSD Data driven execution in Charm++ Scheduler Message Q

11CPSD Load Balancing Framework Aimed at handling... Continuous (slow) load variation Abrupt load variation (refinement) Workstation clusters in multi-user mode Measurement based Exploits temporal persistence of computation and communication structures Very accurate (compared with estimation) instrumentation possible via Charm++/Converse

12CPSD Object balancing framework

13CPSD Utility of the framework: workstation clusters Cluster of 8 machines, One machine gets another job Parallel job slows down on all machines Using the framework: Detection mechanism Migrate objects away from overloaded processor Restored almost original throughput!

14CPSD Performance on timeshared clusters Another user logged on at about 28 seconds into a parallel run on 8 workstations. Throughput dipped from 10 steps per second to 7. The load balancer intervened at 35 seconds,and restored throughput to almost its initial value.

15CPSD Utility of the framework: Intrinsic load imbalance To test the abilities of the framework A simple problem: Gauss-Jacobi iterations Refine selected sub-domains ConSpector: web based tool Submit parallel jobs Monitor performance and application behavior Interact with running jobs via GUI interfaces

16CPSD AppSpector view of Load balancer on the synthetic Jacobi relaxation benchmark. Imbalance is introduced by interactively refining a subset of cells around 9 seconds.. The resultant load imbalance brings the utilization down to 80% from the peak of 96%. The load balancer kicks in around t = 16, and restores utilization to around 94%.

17CPSD Charm++ Converse Load database + balancer MPI-on-CharmIrecv+ Automatic Conversion from MPI FEM Structured Cross module interpolation Migration path Framework path Using the Load Balancing Framework

18CPSD Example application: Crack propagation (P. Geubelle et al) Similar in structure to Quench components 1900 lines of F90 Rewritten using FEM framework in C lines of C++ code Framework: 500 lines of code, —reused by all applications Parallelization completely by the framework

19CPSD Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle

20CPSD “Overhead” of multi-partition method

21CPSD Overhead study on 8 processors When running on 8 processors, the effect of using multiple partitions per processor is also beneficial, due to cache behavior.

22CPSD Cross-approach comparison MPI-F90 original Charm++ framework(all C++) F90 + charm++ library

23CPSD Load balancer in action

24CPSD Summary and Planned Research Use the adaptive FEM framework To parallelize Quench code further Quad tree based solidification code: —First phase: parallelize each phase separately —Parallelize across refinement phases Refine the FEM framework Use feedback from applications Support for implicit solvers and multigrid