Chao Huang Parallel Programming Laboratory University of Illinois

Slides:

Advertisements

Similar presentations

Threads, SMP, and Microkernels

Advertisements

MPI Message Passing Interface Portable Parallel Programs.

MPI Message Passing Interface

Reference: / MPI Program Structure.

Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.

Message-Passing Programming and MPI CS 524 – High-Performance Computing.

Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy H. Kaiser, Ph.D. San Diego.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Review of Memory Management, Virtual Memory CS448.

Adaptive MPI Milind A. Bhandarkar

Advanced / Other Programming Models Sathish Vadhiyar.

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.

MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Parallel Programming with MPI By, Santosh K Jena..

AMPI, Charisma and MSA Celso Mendes Pritish Jetley Parallel Programming Laboratory University of Illinois at Urbana-Champaign.

Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Sending large message counts (The MPI_Count issue)

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

Message Passing Interface Using resources from

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Introduction to Operating Systems Concepts

Introduction to Operating Systems

Processes and threads.

Introduction to parallel computing concepts and technics

Adaptive MPI Performance & Application Studies

Chapter 3: Process Concept

Day 12 Threads.

CS399 New Beginnings Jonathan Walpole.

Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads

Introduction to MPI.

MPI Message Passing Interface

Introduction to MPI CDP.

Performance Evaluation of Adaptive MPI

AMPI: Adaptive MPI Tutorial

AMPI: Adaptive MPI Celso Mendes & Chao Huang

Introduction to Operating Systems

Chapter 4: Threads.

Threads and Data Sharing

MPI-Message Passing Interface

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Message Passing Models

A Message Passing Standard for MPP and Workstations

Lecture Topics: 11/1 General Operating System Concepts Processes

Lecture 4- Threads, SMP, and Microkernels

Introduction to parallelism and the Message Passing Interface

Lecture 7: Flexible Address Translation

An Orchestration Language for Parallel Objects

Higher Level Languages on Adaptive Run-Time System

Support for Adaptivity in ARMCI Using Migratable Objects

COMP755 Advanced Operating Systems

MPI Message Passing Interface

CS 584 Lecture 8 Assignment?.

Presentation transcript:

Chao Huang Parallel Programming Laboratory University of Illinois Adaptive MPI Tutorial Chao Huang Parallel Programming Laboratory University of Illinois

Parallel Programming Laboratory @ UIUC 2019/4/29 Motivation Challenges New generation parallel applications are: Dynamically varying: load shifting, adaptive refinement Typical MPI implementations are: Not naturally suitable for dynamic applications Set of available processors: May not match the natural expression of the algorithm AMPI MPI with virtualization: VP Hotspots of computing move over the time. Causing load imbalance. Wrapping around if not enough PEs 4/29/2019 Parallel Programming Laboratory @ UIUC

Parallel Programming Laboratory @ UIUC Outline MPI basics Charm++/AMPI introduction How to write AMPI program Running with virtualization How to convert an MPI program Using AMPI features Automatic load balancing Checkpoint/restart mechanism Interoperability with Charm++ ELF and global variables 4/29/2019 Parallel Programming Laboratory @ UIUC

Parallel Programming Laboratory @ UIUC 2019/4/29 MPI Basics Standardized message passing interface Passing messages between processes Standard contains the technical features proposed for the interface Minimally, 6 basic routines: int MPI_Init(int *argc, char ***argv) int MPI_Finalize(void) int MPI_Comm_size(MPI_Comm comm, int *size) int MPI_Comm_rank(MPI_Comm comm, int *rank) int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) Forum founded in 1993, released Standard 1.0 in 1994, 1.1 in 1995, 2 in 1997 by vote. Our dept head Marc Snir is a major contributor to the standard! Status containing error info for (potentially) multiple requests. 4/29/2019 Parallel Programming Laboratory @ UIUC

Parallel Programming Laboratory @ UIUC 2019/4/29 MPI Basics MPI-1.1 contains 128 functions in 6 categories: Point-to-Point Communication Collective Communication Groups, Contexts, and Communicators Process Topologies MPI Environmental Management Profiling Interface Language bindings: for Fortran, C 20+ implementations reported. C++ binding is not mentioned till 2.0. 4/29/2019 Parallel Programming Laboratory @ UIUC

Parallel Programming Laboratory @ UIUC 2019/4/29 MPI Basics MPI-2 Standard contains: Further corrections and clarifications for the MPI-1 document Completely new types of functionality Dynamic processes One-sided communication Parallel I/O Added bindings for Fortran 90 and C++ Lots of new functions: 188 for C binding C++ binding is not mentioned till 2.0. 4/29/2019 Parallel Programming Laboratory @ UIUC

Parallel Programming Laboratory @ UIUC Example: Hello World! #include <stdio.h> #include <mpi.h> int main( int argc, char *argv[] ) { int size,myrank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); printf( "[%d] Hello, parallel world!\n", myrank ); MPI_Finalize(); return 0; } 4/29/2019 Parallel Programming Laboratory @ UIUC

Parallel Programming Laboratory @ UIUC Example: Send/Recv ... double a[2] = {0.3, 0.5}; double b[2] = {0.7, 0.9}; MPI_Status sts; if(myrank == 0){ MPI_Send(a,2,MPI_DOUBLE,1,17,MPI_COMM_WORLD); }else if(myrank == 1){ MPI_Recv(b,2,MPI_DOUBLE,0,17,MPI_COMM_WORLD, &sts); } [Demo…] 4/29/2019 Parallel Programming Laboratory @ UIUC

Parallel Programming Laboratory @ UIUC 2019/4/29 Charm++ Basic idea of processor virtualization User specifies interaction between objects (VP’s) RTS maps VP’s onto physical processors Typically, # virtual processors > # processors System implementation User View Number of chares: Independent of number of processors Typically larger than number of processors Seeking optimal division of labor between “system” and programmer: Decomposition done by programmer, everything else like mapping and scheduling is automated 4/29/2019 Parallel Programming Laboratory @ UIUC

Parallel Programming Laboratory @ UIUC Charm++ Charm++ features Data driven objects Asynchronous method invocation Mapping multiple objects per processor Load balancing, static and run time Portability TCharm features User level threads, do not block CPU Light-weight: context-switch time ~ 1μs Migratable threads 4/29/2019 Parallel Programming Laboratory @ UIUC

AMPI: MPI with Virtualization 2019/4/29 AMPI: MPI with Virtualization Each virtual process implemented as a user-level thread embedded in a Charm++ object Real Processors MPI “processes” Implemented as virtual processes (user-level migratable threads) MPI processes Implemented as virtual processors (user-level migratable threads) 4/29/2019 Parallel Programming Laboratory @ UIUC

Comparison with Native MPI 2019/4/29 Comparison with Native MPI Performance Slightly worse w/o optimization Being improved Flexibility Big runs on a few processors Fits the nature of algorithms 27, 64, 125, 216, 512 Two points in comparison: Performance is slightly worse: 1) we are running on top of MPI 2) remember more advantages AMPI are not used here. 2. Number of processors flexibility Some overhead of offset by the caching effect, etc. Problem setup: 3D stencil calculation of size 2403 run on Lemieux. AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs cube #. 4/29/2019 Parallel Programming Laboratory @ UIUC

Parallel Programming Laboratory @ UIUC 2019/4/29 Build Charm++/AMPI Download website: http://charm.cs.uiuc.edu/download.html Please register for better support Build Charm++/AMPI > ./build <target> <version> <options> [charmc-options] To build AMPI: > ./build AMPI net-linux -g Version includes your communication network and OS Charmc-options like -g 4/29/2019 Parallel Programming Laboratory @ UIUC

How to write AMPI program (1) 2019/4/29 How to write AMPI program (1) Write your normal MPI program, and then… Link and run with Charm++ Build your charm with target AMPI Compile and link with charmc charm/bin/mpicc|mpiCC|mpif77|mpif90 > charmc -o hello hello.c -language ampi Run with charmrun > charmrun hello Example of two threads sharing one counter. A sets it to 2, B increments it, A reads the incorrect count. Isomalloc: special memory allocation mode, give allocated memory the same virtual address on all processors. problem: address space limitation on 32-bit machines. 4GB – OS, conventional heap allocation, = virtual address ~1GB divided over 20PE -> 50MB per PE 4/29/2019 Parallel Programming Laboratory @ UIUC

How to write AMPI program (1) Now we can run MPI program with Charm++ Demo - Hello World! 4/29/2019 Parallel Programming Laboratory @ UIUC

How to write AMPI program (2) 2019/4/29 How to write AMPI program (2) Do not use global variables Global variables are dangerous in multithread programs Global variables are shared by all the threads on a processor and can be changed by other thread Thread 1 Thread2 count=1 block in MPI_Recv b=count count=2 Example of two threads sharing one counter. A sets it to 2, B increments it, A reads the incorrect count. Isomalloc: special memory allocation mode, give allocated memory the same virtual address on all processors. problem: address space limitation on 32-bit machines. 4GB – OS, conventional heap allocation, = virtual address ~1GB divided over 20PE -> 50MB per PE 4/29/2019 Parallel Programming Laboratory @ UIUC

How to write AMPI program (2) 2019/4/29 How to write AMPI program (2) Now we can run multithread on one processor Running with many virtual processors +vp command line option > charmrun +p3 hello +vp8 Demo - Hello Parallel World! Demo - 2D Jacobi Relaxation Example of two threads sharing one counter. A sets it to 2, B increments it, A reads the incorrect count. Isomalloc: special memory allocation mode, give allocated memory the same virtual address on all processors. problem: address space limitation on 32-bit machines. 4GB – OS, conventional heap allocation, = virtual address ~1GB divided over 20PE -> 50MB per PE 4/29/2019 Parallel Programming Laboratory @ UIUC

How to convert an MPI program 2019/4/29 How to convert an MPI program Remove global variables Pack them into struct/TYPE or class Allocated in heap or stack Original Code MODULE shareddata TYPE chunk INTEGER :: myrank DOUBLE PRECISION :: xyz(100) END TYPE END MODULE AMPI Code MODULE shareddata INTEGER :: myrank DOUBLE PRECISION :: xyz(100) END MODULE Avoid large stack data allocation because stack size in AMPI is fixed and doesn’t grow at run time. Can be specified with command line option. 4/29/2019 Parallel Programming Laboratory @ UIUC

How to convert an MPI program 2019/4/29 How to convert an MPI program Original Code SUBROUTINE MPI_Main USE shareddata USE AMPI INTEGER :: i, ierr TYPE(chunk), pointer :: c CALL MPI_Init(ierr) ALLOCATE(c) CALL MPI_Comm_rank( MPI_COMM_WORLD, c%myrank, ierr) DO i = 1, 100 c%xyz(i) = i + c%myrank END DO CALL subA(c) CALL MPI_Finalize(ierr) END SUBROUTINE AMPI Code PROGRAM MAIN USE shareddata include 'mpif.h' INTEGER :: i, ierr CALL MPI_Init(ierr) CALL MPI_Comm_rank( MPI_COMM_WORLD, myrank, ierr) DO i = 1, 100 xyz(i) = i + myrank END DO CALL subA CALL MPI_Finalize(ierr) END PROGRAM Can use stack space by removing pointer and ALLOCATE, but limited stacksize 4/29/2019 Parallel Programming Laboratory @ UIUC

How to convert an MPI program Original Code SUBROUTINE subA(c) USE shareddata TYPE(chunk) :: c INTEGER :: i DO i = 1, 100 c%xyz(i) = c%xyz(i) + 1.0 END DO END SUBROUTINE AMPI Code SUBROUTINE subA USE shareddata INTEGER :: i DO i = 1, 100 xyz(i) = xyz(i) + 1.0 END DO END SUBROUTINE C examples can be found in the manual 4/29/2019 Parallel Programming Laboratory @ UIUC

How to run an AMPI program Use virtual processors Run with +vp option Specify stacksize with +tcharm_stacksize option Demo – big stack 4/29/2019 Parallel Programming Laboratory @ UIUC

How to convert an MPI program 2019/4/29 How to convert an MPI program Fortran program entry point: MPI_Main program pgm  subroutine MPI_Main ... ... Avoid large stack data allocation because stack size in AMPI is fixed and doesn’t grow at run time. Can be specified with command line option. 4/29/2019 Parallel Programming Laboratory @ UIUC

Parallel Programming Laboratory @ UIUC AMPI Extensions Automatic load balancing Non-blocking collectives Checkpoint/restart mechanism Multi-module programming ELF and global variables 4/29/2019 Parallel Programming Laboratory @ UIUC

Automatic Load Balancing 2019/4/29 Automatic Load Balancing Load imbalance in dynamic applications hurts the performance Automatic load balancing: MPI_Migrate() Collective call informing the load balancer that the thread is ready to be migrated, if needed. If there is a load balancer present: First sizing, then packing on source processor Sending stack and pupped data to the destination Unpacking data on destination processor Informs the load balancing system that you are ready to be migrated, if needed. If the system decides to migrate you, the pup function passed to TCharm Register will first be called with a sizing pupper, then a packing, deleting pupper. Your stack and pupped data will then be sent to the destination machine, where your pup function will be called with an unpacking pupper. TCharm Migrate will then return. Can only be called from in the parallel context. 4/29/2019 Parallel Programming Laboratory @ UIUC

Automatic Load Balancing 2019/4/29 Automatic Load Balancing To use automatic load balancing module: Link with LB modules > charmc pgm hello.o -language ampi -module EveryLB Run with +balancer option > ./charmrun pgm +p4 +vp16 +balancer GreedyCommLB Isomalloc: special memory allocation mode, give allocated memory the same virtual address on all processors. problem: address space limitation on 32-bit machines. 4GB – OS, conventional heap allocation, = virtual address ~1GB divided over 200PE -> 50MB per PE 4/29/2019 Parallel Programming Laboratory @ UIUC

Automatic Load Balancing 2019/4/29 Automatic Load Balancing Link-time flag -memory isomalloc makes heap-data migration transparent Special memory allocation mode, giving allocated memory the same virtual address on all processors Ideal on 64-bit machines Should fit in most cases and highly recommended Isomalloc: special memory allocation mode, give allocated memory the same virtual address on all processors. problem: address space limitation on 32-bit machines. 4GB – OS, conventional heap allocation, = virtual address ~1GB divided over 200PE -> 50MB per PE 4/29/2019 Parallel Programming Laboratory @ UIUC

Automatic Load Balancing 2019/4/29 Automatic Load Balancing Limitation with isomalloc: Memory waste 4KB minimum granularity Avoid small allocations Limited space on 32-bit machine Alternative: PUPer Isomalloc: special memory allocation mode, give allocated memory the same virtual address on all processors. problem: address space limitation on 32-bit machines. 4GB – OS, conventional heap allocation, = virtual address ~1GB divided over 200PE -> 50MB per PE 4/29/2019 Parallel Programming Laboratory @ UIUC

Automatic Load Balancing Group your global variables into a data structure Pack/UnPack routine (a.k.a. PUPer) Heap data –(Pack)–> network message –(Unpack)–> heap data Demo – Load balancing 4/29/2019 Parallel Programming Laboratory @ UIUC

Collective Operations 2019/4/29 Collective Operations Problem with collective operations Complex: involving many processors Time consuming: designed as blocking calls in MPI Powerful idea used in Charm++, now applied to AMPI too. First this idea of non-blocking collectives was implemented in IBM MPI, but as AMPI is based on threads and capable of migration, it has more flexibility to take advantage of the overlapping feature. Mention specially designed 2d fft. “but is that possible?” Time breakdown of 2D FFT benchmark [ms] 4/29/2019 Parallel Programming Laboratory @ UIUC

Collective Communication Optimization 2019/4/29 Collective Communication Optimization Powerful idea used in Charm++, now applied to AMPI too. First this idea of non-blocking collectives was implemented in IBM MPI, but as AMPI is based on threads and capable of migration, it has more flexibility to take advantage of the overlapping feature. Time breakdown of an all-to-all operation using Mesh library Computation is only a small proportion of the elapsed time A number of optimization techniques are developed to improve collective communication performance 4/29/2019 Parallel Programming Laboratory @ UIUC

Asynchronous Collectives Our implementation is asynchronous Collective operation posted Test/wait for its completion Meanwhile useful computation can utilize CPU MPI_Ialltoall( … , &req); /* other computation */ MPI_Wait(req); 4/29/2019 Parallel Programming Laboratory @ UIUC

Asynchronous Collectives 2019/4/29 Asynchronous Collectives Time breakdown of 2D FFT benchmark [ms] VP’s implemented as threads Overlapping computation with waiting time of collective operations Total completion time reduced DON’T SAY the amount of reduced is equal to the amount of overlapped 4/29/2019 Parallel Programming Laboratory @ UIUC

Checkpoint/Restart Mechanism Large scale machines suffer from failure Checkpoint/restart mechanism State of applications checkpointed to disk files Capable of restarting on different # of PE’s Facilitates future efforts on fault tolerance 4/29/2019 Parallel Programming Laboratory @ UIUC

Checkpoint/Restart Mechanism Checkpoint with collective call MPI_Checkpoint(DIRNAME) Synchronous checkpoint Restart with run-time option > ./charmrun pgm +p4 +vp16 +restart DIRNAME Demo: checkpoint/restart an AMPI program 4/29/2019 Parallel Programming Laboratory @ UIUC

Interoperability with Charm++ Charm++ has a collection of library support You can make use of them by running Charm++ code in AMPI programs Also we can run MPI code in Charm++ programs Demo: interoperability with Charm++ 4/29/2019 Parallel Programming Laboratory @ UIUC

ELF and global variables Global variables are not thread-safe Can we switch global variables when we switch threads? The Executable and Linking Format (ELF) Executable has a Global Offset Table containing global data GOT pointer stored at %ebx register Switch this pointer when switching between threads Support on Linux, Solaris 2.x, and more Integrated in Charm++/AMPI Invoke by compile time option -swapglobals Demo: thread-safe global variables 4/29/2019 Parallel Programming Laboratory @ UIUC

Parallel Programming Laboratory @ UIUC Future Work Performance prediction via direct simulation Performance tuning w/o continuous access to large machine Support for visualization Facilitating debugging and performance tuning Support for MPI-2 standard ROMIO as parallel I/O library One-sided communications 4/29/2019 Parallel Programming Laboratory @ UIUC

Thank You! Free download and manual available at: http://charm.cs.uiuc.edu/ Parallel Programming Lab at University of Illinois 4/29/2019 Parallel Programming Laboratory @ UIUC