Efficient Multithreaded Context ID Allocation in MPI James Dinan, David Goodell, William Gropp, Rajeev Thakur, and Pavan Balaji.

Slides:

Advertisements

Similar presentations

Generalized Requests. The current definition They are defined in MPI 2 under the hood of the chapter 8 (External Interfaces) Page 166 line 16 The objective.

Advertisements

MPI Message Passing Interface

Enabling MPI Interoperability Through Flexible Communication Endpoints

1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;

Handling Deadlocks n definition, wait-for graphs n fundamental causes of deadlocks n resource allocation graphs and conditions for deadlock existence n.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Goldilocks: Efficiently Computing the Happens-Before Relation Using Locksets Tayfun Elmas 1, Shaz Qadeer 2, Serdar Tasiran 1 1 Koç University, İstanbul,

Group-Collective Communicator Creation Ticket #286 Non-Collective Communicator Creation in MPI. Dinan, et al., Euro MPI ‘11.

Multiprocessor OS The functional capabilities often required in an OS for a multiprogrammed computer include the resource allocation and management schemes,

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.

1 Parallel Computing—Higher-level concepts of MPI.

Point-to-Point Communication Self Test with solution.

Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Lesson2 Point-to-point semantics Embarrassingly Parallel Examples.

3.5 Interprocess Communication

USER LEVEL INTERPROCESS COMMUNICATION FOR SHARED MEMORY MULTIPROCESSORS Presented by Elakkiya Pandian CS 533 OPERATING SYSTEMS – SPRING 2011 Brian N. Bershad.

MPI Point-to-Point Communication CS 524 – High-Performance Computing.

Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.

1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.

Intro to OS CUCS Mossé Processes and Threads What is a process? What is a thread? What types? A program has one or more locus of execution. Each execution.

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.

Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.

Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Threads in Java. History  Process is a program in execution  Has stack/heap memory  Has a program counter  Multiuser operating systems since the sixties.

Threading and Concurrency Issues ● Creating Threads ● In Java ● Subclassing Thread ● Implementing Runnable ● Synchronization ● Immutable ● Synchronized.

Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

1 Distributed Process Management Chapter Distributed Global States Operating system cannot know the current state of all process in the distributed.

1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.

MPI-3 Hybrid Working Group Status. MPI interoperability with Shared Memory Motivation: sharing data between processes on a node without using threads.

Threads G.Anuradha (Reference : William Stallings)

Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

Non-Collective Communicator Creation Tickets #286 and #305 int MPI_Comm_create_group(MPI_Comm comm, MPI_Group group, int tag, MPI_Comm *newcomm) Non-Collective.

CSCI-455/522 Introduction to High Performance Computing Lecture 4.

15.1 Threads and Multi- threading Understanding threads and multi-threading In general, modern computers perform one task at a time It is often.

Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.

MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.

Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.

FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.

Threads-Process Interaction. CONTENTS  Threads  Process interaction.

Endpoints Plenary James Dinan Hybrid Working Group December 10, 2013.

Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.

Object Oriented Simulation with OOSimL Models with Resources Fall 2015.

EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

4.1 Introduction to Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads.

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

Non-Collective Communicator Creation Ticket #286 int MPI_Comm_create_group(MPI_Comm comm, MPI_Group group, int tag, MPI_Comm *newcomm) Non-Collective Communicator.

1 Chapter 11 Global Properties (Distributed Termination)

Operating System Concepts

 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.

3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:

Background on the need for Synchronization

MPI Point to Point Communication

A Message Passing Standard for MPP and Workstations

Quiz Questions Parallel Programming MPI

CS510 Operating System Foundations

Presentation transcript:

Efficient Multithreaded Context ID Allocation in MPI James Dinan, David Goodell, William Gropp, Rajeev Thakur, and Pavan Balaji

Multithreading and MPI Communicators MPI_Init_thread(…, MPI_THREAD_MULTIPLE, …)  MPI-2 defined MPI+Threads semantics –One collective per comunicator at a time –Programmer must coordinate across threads –Multiple collectives concurrently on different communicators  Communicator creation: –Collective operation –Multiple can occur concurrently on different parent communicators –Requires allocation of a context id Unique integer, uniform across processes Matches messages to communicators 2

MPI-3: Non-Collective Communicator Creation  Communicator creation is collective only on new members, useful for: 1.Reduce overhead Small communicators when parent is large 2.Fault tolerance Not all ranks in parent can participate 3.Flexibility / Load balancing Resource sharing barriers [IPDPS ’12], DNTMC application study Asynchronous re-grouping in multi-level parallel computations  Implementable on top of MPI, performance is poor –Recursive intercommunicator creation/merging algorithm [IMUDI ‘12] –O(log G) create/merge steps – total O(log 2 G) cost 3 

MPI-3: MPI_COMM_CREATE_GROUP MPI_COMM_CREATE_GROUP(comm, group, tag, newcomm) IN comm intracommunicator (handle) IN group group, which is a subset of the group of comm (handle) IN tag “safe” tag (integer) OUT newcommnew communicator (handle)  “Tagged” collective –Multiple threads can call concurrently on same parent communicator –Calls are distinguished via the tag argument  Requires efficient, thread-safe context ID allocation

High-Level Context ID Allocation Algorithm  Extending to support MPI_Comm_create_group –Use a “tagged,” group-collective allreduce –Tag is shifted into a tagged collectives tag space by setting a high bit –Avoids conflicts with point-to-point messages 5 ctxid_mask[MAX_CTXID] = { 1, 1, … } 1.my_cid_avail = reserve( ctxid_mask ) 2.cid_avail = Allreduce( my_cid_avail, parent_comm ) 3.my_cid = select( cid_avail ) ctxid_mask[MAX_CTXID] = { 1, 1, … } 1.my_cid_avail = reserve( ctxid_mask ) 2.cid_avail = Allreduce( my_cid_avail, parent_comm ) 3.my_cid = select( cid_avail ) &-> Rank 0, my_cid_availRank 1, my_cid_availAllocation Result

Ensuring Successful Allocation  Deadlock avoidance: –Reserve( ) must be non-blocking, if mask is unavailable get dummy value –Avoid blocking indefinitely in Allreduce, may require multiple attempts  Livelock avoidance: –All threads in group must acquire mask to allocate – data race –MPI_CC: Prioritize based on parent communicator context id –MPI_CCG: Prioritize based on pair 6 ctxid_mask[MAX_CTXID] = { 1, 1, … } while (my_cid == 0) 1.my_cid_avail = reserve( ctxid_mask ) 2.cid_avail = Allreduce( my_cid_avail, parent_comm ) 3.my_cid = select( cid_avail ) ctxid_mask[MAX_CTXID] = { 1, 1, … } while (my_cid == 0) 1.my_cid_avail = reserve( ctxid_mask ) 2.cid_avail = Allreduce( my_cid_avail, parent_comm ) 3.my_cid = select( cid_avail ) &-> Rank 0, my_cid_availRank 1, my_cid_availAllocation Result

Full Context ID Allocation Algorithm (MPICH Var.) /* Input: my_comm, my_group, my_tag. Output: integer context ID */ /* Shared variables ( shared by threads at a each process ) */ mask[MAX_CTXID] = { 1 }/* Bit array, indicates if ctx ID is free */ mask_in_use = 0 /* Flag, indicates if mask is in use */ lowest_ctx_id = MAXINT, lowest_tag/* Indicates which thread has priority */ /* Private variables ( not shared across threads ) */ local_mask[MAX_CTXID]/* Thread private copy of the mask */ i_own_the_mask = 0 /* Flag indicating if this thread holds mask */ context_id = 0/* Output context ID */ /* Allocation loop */ while ( context_id == 0 ) { reserve_mask( ) MPIR_Allreduce_group ( local_mask, MPI_BAND, my_comm, my_group, my_tag ) select_ctx_id( ) } &-> Rank 0Rank 1Allocation Result

Full Context ID Allocation Algorithm, reserve reserve_mask( ) { Mutex_lock( ) if ( have_higher_priority( ) ) { lowest_ctx_id = my_comm->context_id lowest_tag = my_tag } if ( !mask_in_use && have_priority( ) ) { local_mask = mask, mask_in_use = 1, i_own_the_mask = 1 } else { local_mask = 0, i_own_the_mask = 0 } Mutex_unlock( ) } 8 mask[MAX_CTXID] while (my_cid == 0) 1.Reserve( ) 2.Allreduce( ) 3.Select( ) mask[MAX_CTXID] while (my_cid == 0) 1.Reserve( ) 2.Allreduce( ) 3.Select( ) local_mask =

Full Context ID Allocation Algorithm, Allreduce ctx_id = MPIR_Allreduce_group ( local_mask, MPI_BAND, my_comm, my_group, my_tag ) 9 mask[MAX_CTXID] while (my_cid == 0) 1.reserve( ) 2.Allreduce( ) 3.select( ) mask[MAX_CTXID] while (my_cid == 0) 1.reserve( ) 2.Allreduce( ) 3.select( ) & Rank 0Rank 1 Allocation Result

Full Context ID Allocation Algorithm, Select select_ctx_id( ) { if ( i_own_the_mask ) { Mutex_lock () if ( local_mask != 0 ) { context_id = location of first set bit in local_mask mask[ context_id ] = 0 if ( have_priority( ) ) lowest_ctx_id = MAXINT } mask_in_use = 0 Mutex_unlock () } 10 mask[MAX_CTXID] while (my_cid == 0) 1.reserve( ) 2.Allreduce( ) 3.select( ) mask[MAX_CTXID] while (my_cid == 0) 1.reserve( ) 2.Allreduce( ) 3.select( ) ctx_id = select() 00010

Deadlock Scenario if( thread_id == mpi_rank ) MPI_Comm_dup( MPI_COMM_SELF, &self_dup ); MPI_Comm_dup( thread_comm, &thread_comm_dup );  Necessary and sufficient conditions –Hold: A thread acquires the mask at a particular process –Wait: Thread enters allreduce, waits for others to make matching calls  Meanwhile, matching calls can’t be made –A context ID allocation must succeed first, but mask is unavailable 11

Deadlock Avoidance  Basic idea: Prevent threads from reserving context ID until all threads are ready to perform the operation.  Simple approach, initial barrier –MPIR_Barrier_group( my_comm, my_group, my_tag )  Eliminates wait condition and breaks deadlock –Threads can’t enter Allreduce until all group members have arrived –Threads can’t update priorities until all group members have arrived –Ensures that thread groups that are ready will be able to eventually acquire highest priority and succeed  Cost: additional collective 12

Eager Context ID Allocation  Basic idea: Do useful work during deadlock-avoiding synchronization.  Split context ID space into Eager and Base parts –Eager: used on first attempt (threads may hold-and wait) –Base: used on remaining attempts (threads can’t hold-and-wait)  If eager mask is not available, allocate on base mask –Allocations using base mask are deadlock free –Threads synchronize in initial eager Allreduce All threads are present during base allocation Eliminates wait condition Eager MaskBase Mask

Eager Context ID Allocation Algorithm  No priority in eager mode  Threads holding the eager space, blocked in Allreduce don’t prevent others from entering base allocation  Deadlock is avoided (detailed proof in the paper) 14 ctxid_mask[MAX_CTXID] = { 1, 1, … } 1.my_cid_avail = reserve_no_pri( ctxid_mask[0..SPLIT-1] ) 2.cid_avail = Allreduce( my_cid_avail, parent_comm ) 3.my_cid = select_no_pri( cid_avail ) while (my_cid == 0) 1.my_cid_avail = reserve( ctxid_mask[SPLIT..] ) 2.cid_avail = Allreduce( my_cid_avail, parent_comm ) 3.my_cid = select( cid_avail ) ctxid_mask[MAX_CTXID] = { 1, 1, … } 1.my_cid_avail = reserve_no_pri( ctxid_mask[0..SPLIT-1] ) 2.cid_avail = Allreduce( my_cid_avail, parent_comm ) 3.my_cid = select_no_pri( cid_avail ) while (my_cid == 0) 1.my_cid_avail = reserve( ctxid_mask[SPLIT..] ) 2.cid_avail = Allreduce( my_cid_avail, parent_comm ) 3.my_cid = select( cid_avail )

Is OpenMPI Affected?  OpenMPI uses a similar algorithm –MPICH reserves full mask –OpenMPI reserves one context ID at a time –Requires a second allreduce to check for success  Hold-and-wait can still occur –When number of threads at a process approaches number of free context ids –Less likely than in MPICH –Same deadlock avoidance technique can be applied 15 ctxid_mask[MAX_CTXID] = { 1, 1, … } while (my_cid == 0) 1.my_cid_avail = reserve_one( ctxid_mask ) 2.cid_avail = Allreduce( my_cid_avail, parent_comm, MPI_MAX ) 3.success = Allreduce( cid_avil == my_cid_avail, MPI_AND ) 4.If (success) my_cid = cid_avail ctxid_mask[MAX_CTXID] = { 1, 1, … } while (my_cid == 0) 1.my_cid_avail = reserve_one( ctxid_mask ) 2.cid_avail = Allreduce( my_cid_avail, parent_comm, MPI_MAX ) 3.success = Allreduce( cid_avil == my_cid_avail, MPI_AND ) 4.If (success) my_cid = cid_avail

Comparison: Base vs Eager, CC vs CCG  Parent communicator is MPI_COMM_WORLD (size = 1024)  Eager improves over base by factor of two –One Allreduce, versus Barrier + Allreduce  MPI_Comm_create_group( ) versus MPI_Comm_create( ) –Communication creation cost is proportional to output group size 16

Comparison With User-Level CCG  User-level [IMUDI ‘11]: log(p) intercomm create/merge steps –Total communication cost is log 2 (p)  Direct: One communicator creation step –Eliminates factor of log(p)  P = 512, 1024 was more expensive that MPI_Comm_create 17

Conclusions  Extended context ID allocation to support multithreaded allocation on the same parent communicator –Support MPI-3 MPI_Comm_create_group routine  Identified subtle deadlock issue  Deadlock avoidance –Break hold-and-wait through initial synchronization –Eager context ID allocation eliminates deadlock avoidance cost in common case 18

Thanks! 19