MPI3 Hybrid Proposal Description www.parallel.illinois.edu.

Slides:

Advertisements

Similar presentations

Construction process lasts until coding and testing is completed consists of design and implementation reasons for this phase –analysis model is not sufficiently.

Advertisements

Enabling MPI Interoperability Through Flexible Communication Endpoints

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

CS 140: Models of parallel programming: Distributed memory and MPI.

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.

Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.

CS 240A: Models of parallel programming: Distributed memory and MPI.

Message-Passing Programming and MPI CS 524 – High-Performance Computing.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

OpenMPI Majdi Baddourah

1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.

Programming with Shared Memory Introduction to OpenMP

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Parallel Programming in Java with Shared Memory Directives.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.

CS 240A Models of parallel programming: Distributed memory and MPI.

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Hybrid MPI and OpenMP Parallel Programming

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

1 The Message-Passing Model l A process is (traditionally) a program counter and address space. l Processes may have multiple threads (program counters.

Share Memory Program Example int array_size=1000 int global_array[array_size] main(argc, argv) { int nprocs=4; m_set_procs(nprocs); /* prepare to launch.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

CSCI-455/522 Introduction to High Performance Computing Lecture 4.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Chapter 4 Message-Passing Programming. The Message-Passing Model.

A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.

MPI and OpenMP.

Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Message Passing Interface Using resources from

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Chapter 4: Threads.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

SHARED MEMORY PROGRAMMING WITH OpenMP

CS427 Multicore Architecture and Parallel Computing

Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads

Introduction to MPI.

Computer Engg, IIT(BHU)

MPI Message Passing Interface

Introduction to OpenMP

Computer Science Department

Multi-core CPU Computing Straightforward with OpenMP

Chapter 4: Threads.

MPI-Message Passing Interface

Introduction to parallelism and the Message Passing Interface

Programming with Shared Memory

Introduction to OpenMP

MPI Message Passing Interface

Presentation transcript:

MPI3 Hybrid Proposal Description

2 Goals Allowing more than one “MPI process” per node, without requiring multiple address spaces Having a clean binding to OpenMP and other thread-based parallel programming models Having a clean binding to UPC, CAF and similar PGAS languages

3 Process Current Design Single-threaded Multi-threaded – Requires reentrant MPI library (hard) or mutual exclusion Process appl thread NI appl thread NI Process appl thread NI appl thread NI appl thread Network Interfaces (NI): physical or virtual, disjoint interfaces Node

4 Proposed Design Single-threaded – Each NI (or virtual NI) is accessed by only one thread Multi-threaded – Some NIs are accessed by multiple threads Process appl thread NI appl thread NI Process appl thread NI appl thread NI appl thread Node Same as before, except multiple NIs are mapped in same address space

5 MPI_COMM_EWORLD Each MPI Endpoint has unique rank in MPI_COMM_EWORLD – rank in derived communicators computed using MPI rules MPI code executed by thread(s) attached to endpoint – Including collectives – thread is attached to at most one endpoint Process appl thread NI appl thread NI Process appl thread NI appl thread NI appl thread 0123

6 Binding Time Binding of MPI Endpoints to processes – done at MPI initialization time, or before Attachment of threads to MPI endpoints – done during execution; can change but expected to rarely change – threads can be created after endpoints

7 Endpoint Creation (1) MPI_ENDPOINT_INIT(required, provided) IN required desired level of thread support (integer) OUT provided provided level of thread support (integer) MPI_THREAD_FUNNELED – each endpoint will have only one attached thread MPI_THREAD_SERIALIZED – multiple threads can attach to same endpoint, but cannot invoke MPI concurrently MPI_THREAD_MULTIPLE – multiple threads can attach to same endpoint and can invoke MPI concurrently Executed by one thread on each process; initializes MPI_COMM_WORLD

8 After MPI_ENDPOINT_INIT Attribute MPI_ENDPOINTS (on MPI_COMM_WORLD) indicates maximum number of endpoints at each process. – Can be different at each process (can depend on mpiexec arguments) Process appl thread Process appl thread EP appl thread 01 EP MPI_COMM_WORLD

9 Endpoint Creation (2) MPI_ENDPOINT_CREATE (num_endpoints, array_of_endpoints) IN num_endpoints number of endpoints (integer) OUT array_of_endpoints array of endpoint handles (array of handles) Executed by one thread on each process; initializes MPI_COMM_EWORLD

10 After MPI_ENDPOINT_CREATE Process appl thread Process appl thread EP appl thread 01 EP MPI_COMM_WORLD 02 MPI_COMM_EWORLD MPI_COMM_PROCESS

11 Thread Registration MPI_THREAD_ATTACH (endpoint) IN endpoints endpoint handle (handle) MPI_THREAD_DETACH (endpoint) IN endpoints endpoint handle (handle) Executed by each thread

12 After All Threads Attached Process appl thread Process appl thread EP appl thread 01 EP MPI_COMM_WORLD 02 MPI_COMM_EWORLD MPI_COMM_PROCESS

13 Remarks MPI calls invoked by thread pertain to endpoint the thread is attached to – E.g., sender rank is rank of endpoint MPI call may block if participation of endpoint is needed and no thread is attached to endpoint – E.g., collective or send to “orphan endpoint” Normally, endpoint will have attached thread after initialization

14 Missing Items MPI_FINALIZE() mpiexec Dynamic processes One-sided I/O Error codes...

15 MPI + Shared Memory Language/Library Goals – Allow more than one MPI endpoint per shared memory program – Simple use (but not for naïve users!) – Simple implementation

16 Shared Memory Programming Models Number of threads can vary Tasks can be migrated from one thread to another Work can be shared dynamically Process appl thread task Process appl thread work pool

17 General Binding MPI calls can be executed anywhere in the shared memory program, with “natural semantics” – blocking MPI call blocks task/iterate – nonblocking call can be completed in any place in the program execution that causally follows the initial nonblocking call – registrations are per task, and are inherited when control forks #pragma omp for for(i=0; i<N; i++) { MPI_Irecv(...,req[i]); } MPI_Waitall(N,req,stat);

18 Issues Blocking MPI call blocks thread; this may block work sharing OpenMP/TBB/... scheduler does not know about MPI; communication performance may be affected OpenMP (TBB, etc.) runtime needs to migrate implicit MPI state (e.g., identity of associated MPI endpoint) when task migrates or forks Need MPI_THREAD_MULTIPLE Recommend to specify but not to require – Detailed specification missing in current document

19 Restricted Binding MPI endpoint is associated with thread User has to make sure that MPI calls execute on “right” thread

20 OpenMP Rules MPI_ENDPOINT_CREATE is invoked in the initial, sequential part of the program. Each endpoint is attached by one OpenMP thread Only OpenMP threads attached to an endpoint invoke MPI. Each handle should be used in MPI calls by the same thread through program execution. – Handle is threadprivate – Handle is shared and is not reused in different parallel regions unless they are consecutive, neither is nested inside another parallel region, both regions use the same number of threads, and dyn-var = false – An MPI handle should not be used within a work sharing construct. – An MPI handle should not be used within an explicit task.

21 Example (1)... // omit declarations MPI_Endpoint_Init(argc, argv, MPI_THREAD_SINGLE, &provided); MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_ENDPOINTS, &pnum, &flag); #pragma omp parallel private(myid) { #pragma omp master { /* find number of threads in current team */ Nthreads = omp_get_num_threads(); if (Nthreads != *pnum) abort(); /* create endpoints */ MPI_Endpoint_create(Nthreads, *endpoints); }

22 Example (2) /* associate each thread with an endpoint */ myid = omp_get_thread_num(); MPI_Endpoint_register(myid, *endpoints); /* MPI communication involving all threads */ MPI_Comm_rank(MPI_COMM_EWORLD, &myrank); MPI_Comm_size(MPI_COMM_EWORLD, &size); MPI_Comm_rank(MPI_COMM_PROCESS, &mythreadid); if (myid != mythreadid) abort(); if (myrank > 0) MPI_Isend(buff, count, MPI_INT, myrank-1, 3, MPI_COM_EWORLD, &req[mythreadid); if (myrank < size) MPI_Recv(buff1, count, MPI_INT, myrank+1, 3, MPI_COMM_EWORLD); MPI_Wait(&req[mythreadid], &status[mythreadid]); }...

23 TBB et al Need TBB to provide a construct that binds a task to a specific thread. – Similar construct may be needed for GPU tasks and other similar messes

24 PGAS (UPC, Fortran 2008) One thread of control per locale (+ work sharing, in UPC) private data shared data thread locale (thread/image) private data shared data thread private data shared data thread local reference global reference local/global reference

25 PGAS Implementations (1) private data shared data thread private data shared data thread private data shared data thread private data shared data thread process node

26 PGAS Implementations (2) private data shared data thread private data shared data thread private data shared data thread private data shared data thread process

27 PGAS Invokes MPI private data shared data thread private data shared data thread private data shared data thread private data shared data thread EP

28 PGAS Invokes MPI (cont.) Each locale (thread/image) has MPI_COMM_EWORLD rank corresponding to its PGAS rank Each locale can invoke MPI (one endpoint per locale) – all arguments must be local – cannot be done within work sharing construct Model does not depend on chosen PGAS implementation Requires joint initialization of PGAS and of MPI – Both are initialized before user code starts running

29 MPI Invokes PGAS Need “awareness” of PGAS language within MPI code Link and invoke UPC routine from (SPMD) C – Collective call over all MPI processes (endpoints) in communicator Create shared array from (SPMD) C – collective malloc; shared array descriptor Pass shared array as argument Query shared arrays Problems mostly are on PGAS side; should we address them?