MPI-3 Hybrid Working Group Status. MPI interoperability with Shared Memory Motivation: sharing data between processes on a node without using threads.

Slides:



Advertisements
Similar presentations
Generalized Requests. The current definition They are defined in MPI 2 under the hood of the chapter 8 (External Interfaces) Page 166 line 16 The objective.
Advertisements

RMA Considerations for MPI-3.1 (or MPI-3 Errata)
OpenMP.
MPI Message Passing Interface
1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.
Threads Relation to processes Threads exist as subsets of processes Threads share memory and state information within a process Switching between threads.
6.1 Synchronous Computations ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.
Module R2 CS450. Next Week R1 is due next Friday ▫Bring manuals in a binder - make sure to have a cover page with group number, module, and date. You.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
4.7.1 Thread Signal Delivery Two types of signals –Synchronous: Occur as a direct result of program execution Should be delivered to currently executing.
A. Frank - P. Weisberg Operating Systems Threads Implementation.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
1 I/O Management in Representative Operating Systems.
Chapter 4: Threads READ 4.1 & 4.2 NOT RESPONSIBLE FOR 4.3 &
Programming with Shared Memory Introduction to OpenMP
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Parallel Programming in Java with Shared Memory Directives.
Parallel Edge Detection Daniel Dobkin Asaf Nitzan.
MPI3 Hybrid Proposal Description
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
國立台灣大學 資訊工程學系 Chapter 4: Threads. 資工系網媒所 NEWS 實驗室 Objectives To introduce the notion of a thread — a fundamental unit of CPU utilization that forms the.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
4.1 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 4: Threads Overview Multithreading Models Thread Libraries  Pthreads  Windows.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Chapter 4 Message-Passing Programming. The Message-Passing Model.
Efficient Multithreaded Context ID Allocation in MPI James Dinan, David Goodell, William Gropp, Rajeev Thakur, and Pavan Balaji.
A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.
FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.
Porting processes to threads with MPC instead of forking Some slides from Marc Tchiboukdjian (IPDPS’12) : Hierarchical Local Storage Exploiting Flexible.
Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.
MPI Communicator Assertions Jim Dinan Point-to-Point WG March 2015 MPI Forum Meeting.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 4: Threads.
UPC Parallel I/O Library
SHARED MEMORY PROGRAMMING WITH OpenMP
Computer Engg, IIT(BHU)
Introduction to OpenMP
Chapter 4: Threads.
Chapter 4: Threads.
MPI-Message Passing Interface
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
Allen D. Malony Computer & Information Science Department
Chapter 4: Threads.
Chapter 4 Threads!!!.
OpenMP Parallel Programming
Presentation transcript:

MPI-3 Hybrid Working Group Status

MPI interoperability with Shared Memory Motivation: sharing data between processes on a node without using threads Sandia applications motivation: – Dump data into a “common” memory region – Synchronize – All processes just use this memory region in read-only mode – And they want this in a portable manner

MPI interoperability with Shared Memory: Plan A The original plan was to provide – Routines to allocate/deallocate shared memory – Routines to synchronize operations, but not to operate on memory (similar to MPI_Win_sync) Operations on shared memory are done using load/store operations (unlike RMA) Synchronization would be done with something similar to MPI_Win_sync There was a suggestion to provide operations to operate on the data as well, but there was no consensus on that in the working group – A routine to create a communicator on which shared memory can be created The Forum’s feedback that this was not possible to do in MPI unless it knows the compilers capabilities Process 0 Process 1 store(X) MPI_Shm_sync() MPI_Barrier() MPI_Barrier() MPI_Shm_sync() load(X)

MPI interoperability with Shared Memory: Plan B The second plan was to remove shared memory synchronization routines; we still have: – Routines to allocate/deallocate shared memory – A routine to create a communicator on which shared memory can be created The Forum’s feedback was that allocation/deallocation might not be useful without operations to synchronize data – The use-case for only creating shared memory and expect the user to handle memory barriers was fairly minimal – Also, another feedback was to do this as an external library

MPI interoperability with Shared Memory: Plan C The third plan is to remove shared memory allocation routines; we still have: – A routine to create a communicator on which shared memory can be created MPI_Shm_comm_create(old_comm, info, &new_comm) – The info argument can provide implementation-specific information such as, within the socket, shared by a subset of processes, whatever else No predefined info keys – There has been a suggestion to generalize this to provide a communicator creator function that provides “locality” information Will create an array of communicators, where the lower index communicators “might contain” closer processes (best effort from the MPI implementation) An attribute on the communicator would tell if shared memory can be created on it

MPI Interoperability with Thread Teams Motivation: Allow coordination with the application to get access to threads, instead of the application maintaining a separate pool of threads than the MPI implementation Proposal in a nutshell – Thread teams are created – The user can provide the threads in the team to the MPI implementation to help out the MPI implementation – A user-defined team allows the user to have control on locality and blocking semantics for threads (i.e., which threads block waiting to help)

Point-to-point Communication on InfiniBand 5 fold performance difference

Allreduce on BG/P

Thread Teams Models Two models proposed – Synchronous model: all threads in the team synchronously join and leave the team – more restrictive for the application but has more optimization opportunity for MPI – Asynchronous model: threads join the team asynchronously; processes can leave synchronously or asynchronously break out Both models are essentially performance hints – Optimized MPI implementations can use this information

Thread Teams: Synchronous Model In the synchronous model, the Join and Leave calls can be synchronizing between the threads – The MPI implementation can assume that all threads will be available to help – The programmer should keep the threads synchronized for good performance The MPI implementation knows how many threads are going to help, so it can statically partition the available work, for example This model does not allow for threads to “escape” without doing a synchronous leave Allgather Leave Join Leave MPI can hold threads to help Leave call Returns Work Possible Implementation

Thread Teams: Asynchronous Model In the asynchronous model, the Join call is not synchronizing between the threads The leave call is still synchronizing, but threads are allowed to either help using “leave” or “break out” without helping The MPI implementation does not know how many threads are going to help This model allows for threads to “break out” without doing a synchronous leave Allgather Leave Break Join Leave MPI can hold threads to help Leave Returns (still synchronous) Possible Implementation Work Pool

Thread Teams Proposed API Team creation/freeing – MPI_Team_create(team_size, info, &team) One predefined info key for “synchronous”; default is “asynchronous” Similar to the MPI RMA chapter in that the info arguments are true assertions; if the user says “synchronous” and tries to break out, that’s an erroneous program – MPI_Team_free(team) Team join/leave functionality – MPI_Team_join(team) A thread can only join one team at a time – MPI_Team_leave(team) – MPI_Team_break(team)

Hartree-Fock Example do { One_Electron_Contrib(Density, Fock) while (task = next_task()) { {i, j, k} = task.dims X = Get(Density, {i,j,k}.. {i+C,j+C,k+C}) #pragma omp parallel { Y = Work({i,j,k}, X) ; < compute intensive omp_barrier() MPI_Team_join(team) if (omp_master) { Accumulate(SUM, Y, Fock, {i,j,k}, {i+C,j+C,k+C}) ; <----- communication intensive } ; OMP master MPI_Team_leave(team) } ; OMP – end parallel block } #pragma omp parallel { Update_Density(Density, Fock) ; <----- communication intensive my_energy = omp_gather() MPI_Team_join(team) if (omp_master) { energy = MPI_Allgather(my_energy) ; <----- moderately communication intensive } ; OMP master MPI_Team_leave(team) } ; OMP – end parallel block } while (abs(new_energy - energy) > tolerance)