Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Reference: Message Passing Fundamentals.
An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming Models and Paradigms
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Jun Peng Stanford University – Department of Civil and Environmental Engineering Nov 17, 2000 DISSERTATION PROPOSAL A Software Framework for Collaborative.
AstroBEAR Parallelization Options. Areas With Room For Improvement Ghost Zone Resolution MPI Load-Balancing Re-Gridding Algorithm Upgrading MPI Library.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Novel and “Alternative” Parallel Programming Paradigms Laxmikant Kale CS433 Spring 2000.
The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Adaptive MPI Milind A. Bhandarkar
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Advanced / Other Programming Models Sathish Vadhiyar.
Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS) Dan Schaffer NOAA Forecast Systems Laboratory (FSL)
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
OpenMosix, Open SSI, and LinuxPMI
GdX - Grid eXplorer parXXL: A Fine Grained Development Environment on Coarse Grained Architectures PARA 2006 – UMEǺ Jens Gustedt - Stéphane Vialle - Amelia.
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Objects: Virtualization & In-Process Components
Performance Evaluation of Adaptive MPI
Parallel Programming in C with MPI and OpenMP
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Component Frameworks:
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Faucets: Efficient Utilization of Multiple Clusters
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
Parallel Exact Stochastic Simulation in Biochemical Systems
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Parallelization Strategies Laxmikant Kale

Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal modification strategies Thread based techniques: ROCFLO,.. Some future plans

OpenMP Motivation: –Shared memory model often easy to program –Incremental optimization possible

ROCFLO via OpenMP Parallelization of ROCFLO using a loop- parallel paradigm via OpenMP –Poor speedup compared with MPI version –Was locality the culprit? Study conducted by Jay Hoeflinger –In collaboration with Fady Najjar

ROCFLO with MPI

The Methodology Do OpenMP/MPI comparison experiments. Write an OpenMP version of ROCFLO –Start with the MPI version of ROCFLO, –Duplicate the structure of the MPI code exactly (including message passing calls). –This removes locality as a problem. Measure performance –If any parts do not scale well, determine why.

Barrier Cost: MPI vs OpenMP (Origin 2000)

So Locality was not the whole problem! The other problems turned out to be: –I/O which doesn’t scale –ALLOCATE which doesn’t scale –our non-scaling reduction implementation –our first-cut messaging infrastructure which, could be improved Conclusion –Efficient loop parallel version may be feasible, avoiding Allocates and using scalable IO

Need for adaptive strategies Computation structure changes over time: –Combustion Adaptive techniques in application codes: –Adaptive refinement in structures or even fluid –Other codes such as crack propagation Can affect the load balance dramatically –One can go from 90% efficiency to less than 25%

Multi-partition decompositions Idea: decompose the problem into a number of partitions, –independent of the number of processors –# Partitions > # Processors The system maps partitions to processors –The system should be able to map and re-map objects as needed

Load Balancing Framework Aimed at handling... –Continuous (slow) load variation –Abrupt load variation (refinement) –Workstation clusters in multi-user mode Measurement based –Exploits temporal persistence of computation and communication structures –Very accurate (compared with estimation) –instrumentation possible via Charm++/Converse

Charm++ A parallel C++ library –supports data driven objects –many objects per processor, with method execution scheduled with availability of data –system supports automatic instrumentation and object migration –Works with other paradigms: MPI, openMP,..

Load balancing framework

Load balancing demonstration To test the abilities of the framework –A simple problem: Gauss-Jacobi iterations –Refine selected sub-domains AppSpector: web based tool –Submit parallel jobs –Monitor performance and application behavior –Interact with running jobs via GUI interfaces

Adapitivity with minimal modification Current code base is parallel (MPI) –But doesn’t support adaptivity directly –Rewrite the code with objects?... Idea: support adaptivity with minimal changes to F90/MPI codes Work by: –Milind Bhandarkar, Jay Hoeflinger, Eric de Sturler

Migratable threads approach Change required: –Encapsulate global variables in modules Dynamically allocatable Intercept MPI calls –Implement them in a multithreaded layer Run each original MPI process as a thread –User level thread Migrate threads as needed by load balancing –Trickier problem than object migration

Progress: Test Fortran-90 - C++ interface Encapsulation feasibility: Thread migration mechanics ROCFLO study: Test code implementation ROCFLO implementation

Another approach to adaptivity Cleanly separate parallel and sequential code: –All parallel code in C++ –All application code in Fortran 90 sequential subroutines Needs more restructuring of application codes –But is feasible, especially for new codes –Much easier to migrate –Improves modularity