Approximating the Buffer Allocation Problem Using Epochs

Approximating the Buffer Allocation Problem Using Epochs
Jan Bækgaard Pedersen (U. of Nevada, Las Vegas) Alex Brodsky, (U. of Winnipeg, Canada)

Parallel Applications
Goal: Overlap communication and computation. Parallel Applications Consist of a number of asynchronous interconnected hosts <click> That run at different speeds, with no access to a global clock Communicate by send messages to each other over a local area network <click> To ensure applications are efficient <click> Applications typically overlap communication and computation 2/25/2019

Parallel Applications (cont.)
Today, parallel applications typically Run on heterogeneous platforms Are ported to systems with varying resource Run on hardware platforms that differ from those they were designed on. Goals: Ensure that ported applications do not deadlock Determine application resource requirements Focus: Determine an application’s message buffer needs See points on slide. Emphasize porting applications, applications that can be run on original hardware to compute comm. Graph (but don’t mention comm. Graph yet) 2/25/2019

Message Buffering Unbuffered (synchronous op.) Buffered
MPI_Send MPI_Recv Buffered (asynchronous op.) MPI_Send MPI_Recv If a message send is unbuffered <click> a sending process performs a send and then must wait for the receiver to perform a receive <click> Before the sender can complete the send and perform its own computation <click> With buffered message passing <click> the process can send the message <click> store the message in the buffer and compute the receiver can get the message from buffer at their convenience <click> thus allowing an overlap of computation and communications How if the buffer is filled up <click> The send will block <click> until the receiver performs a receive <click> If the application assumes that a buffer is available, this could be a problem. Buffer filled (synchronous op.) MPI_Send MPI_Recv 2/25/2019

The Problem Succeeds as long as 1 process has a free buffer.
MPI_Send MPI_Send MPI_Send MPI_Recv MPI_Recv MPI_Recv Succeeds as long as 1 process has a free buffer. If all buffers are exhausted deadlock occurs! Let’s look at the problem in more detail Consider the following communication pattern where each process performs a send then presumably performs some computations and then performs a receive The first process has a buffer, so when the third process performs a send <click> the message is buffered and does not block the sender who can then perform the receive and <click> allow the remaining processes to complete their sends However if no buffers are available <click> And the processes perform sends <click> then none of the sends can proceed <click> and deadlock ensues <click> Thus, we need to ensure <click> that the application has enough buffers MPI_Send MPI_Send MPI_Send Solution: Ensure application has sufficient # of buffers. 2/25/2019

Deadlock results by untimely use of buffer.
Complications MPI_Send MPI_Send MPI_Send Deadlock results by untimely use of buffer. Such problems arise even if the communication is oblivious (independent of application input) is point-to-point (no broadcast or multicast) does not use wild-cards (receiver must specify sender) Things can get even more complicated By performing a send, <click> A process can steal a buffer ment for anoth message Creating deadlock <click> because other sends cannot In fact <click> Follow points Interestingly <click> The problem of allocating buffers for an application is is NP-hard to find an optimal allocation [BPW’05] is coNP-complete to verify if an allocation is deadlock-free needs to be approximated 2/25/2019

The Model (Communication Graphs)
MPI_Send MPI_Send MPI_Send MPI_Recv MPI_Recv MPI_Recv comp. arc comm. send vert. recv We model this problem using communication graphs Given an application and its communication pattern We model the application as a graph <click> Describe communication graph…. One key point <click> Since we assume an oblivious communication pattern all application behaviour of interest (i.e., communication) Is represented by the communication graph Key point: An application is represented by a communication graph. 2/25/2019

Basic Approach Treat as a dependency graph Adding a buffer not enough
reverse computation arcs bidirectional communication arcs A cycle indicates a need for a buffer Adding a buffer not enough buffers may be stolen! Add buffer for every receive although, this is overkill ensures delay-freedom Our basic approach is to induce a dependency graph by reversing computation arcs and making communication arcs bidirectionnal A cycle in the graph indicates the need for a buffer <click> If there are no buffers <click> the sends will never complete. As we have seen <click> this may not be enough since buffers may be stolen by other processesthat perform untimely sends Thus, out initial approximation <clock> is to assign a buffer to each receive Although this guarantees that no process ever blocks this is clearly overkill in the number of buffers needed. A better approach is needed…. If no buffers, receive and send depend on each other. 2/25/2019

A Better Approximation
If send (2) always occurs before (3) buffer will not be stolen deadlock will be prevented only one buffer is needed Implementation issues how to order sends? how to enforce order? what kind of order? how many buffers? (1) (2) Observe that if we can ensure that send occurs before send 3 Then send 2 will grab the buffer and no deadlock will ensue The key point is that sends 1 and 2 must be ordered <click> before send 3 How do we do this? We need to address several issues <click> Follow points…. (3) 2/25/2019

Our Approach Partition communication graph into epochs
Epochs are well ordered Sends within an epoch may not be ordered Determine per-epoch buffer allocations Compute maximum buffer allocation over all per-epoch allocations This will be the buffer allocation for the application Group consecutive epochs into super-epochs Use barrier primitive to enforce execution order on the super-epochs Follow points…. 2/25/2019

Epochs Are maximal connected subgraphs All epochs are strictly ordered
Represent phases of an application No communication between epochs Execute in serial order All epochs are strictly ordered Each epoch is executed when preceding epoch finishes all buffers are freed Enforced via barriers Describe what epoch are s… <click> Describe how epochs are ordered To enforce epoch ordering <click> we use a barrier synchronization primitive 2/25/2019

Number of Buffers Needed
Compute # of buffers for each epoch Epochs with cycles: > 0 buffers Use approximation of [BPW’05] Epochs with no cycles: 0 buffers No cycles  no deadlock At end of each epoch all buffers are freed available for next epoch Total # of buffers needed is: the maximum over all epochs 1 buffer We then compute the number of buffers for each epoch Cycles need at least one buffer…. As in the first epoch below <click> Epochs with no cycles need no buffers… <click> because no deadlock can occur Note that <click> at end of each epoch all buffers are freed and become available to the next epoch Consequently <click> The total # of buffers needed is simply the maximum over all epochs 0 buffers Total # of buffers needed is max{1,0} = 1 buffer. 2/25/2019

Avoiding Unnecessary Barriers
Barriers are expensive operations In many cases barriers are not needed In following example each send is in its own epoch a barrier occurs after every send even though no buffers are needed Idea: combine epochs into a super-epoch eliminates unneeded barriers increases parallelism does not increase number of buffers Super-Epochs are treated like epochs serially ordered separated by barriers no explicit synchronization within use at most max # of buffers over all epochs However <click> barriers are expensive operations are not needed in many cases, explain why… To avoid unnecessary barriers <click> we combine epochs into superepochs Describe superepochs (follow points) 2/25/2019

Proposed Implementation
App. MPI O/S App. MPI O/S App. MPI O/S App. MPI O/S Compute barrier and buffer allocations Generate communication graph from the application Compute epochs and per-epoch buffer allocations Compute super-epochs and barrier locations Upload barrier locations and buffer allocations Before each send, MPI checks if a barrier is needed We propose the following implementations We first compute the barrier and buffer allocations by <click> generating the communication graph <click> computing the epochs <click> and performing per epoch buffer allocations. Then <click> we combine epochs into super-epochs <click> and determine locations of barriers. We then use a modified MPI library <click> into which we upload the barrier and buffer allocation information Before each send … describe what MPI should do. App. 2/25/2019

Technical Contributions
Algorithms for: partitioning the communication graph into epochs computing the buffer allocation for each epoch composing super-epochs from epochs Descriptions of: possible mechanism implementation within existing parallel systems possible mechanism extensions Analysis: comparing our approach to existing approaches complexity of proposed algorithms Follow points…. 2/25/2019

Approximating the Buffer Allocation Problem Using Epochs

Similar presentations

Presentation on theme: "Approximating the Buffer Allocation Problem Using Epochs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximating the Buffer Allocation Problem Using Epochs

Similar presentations

Presentation on theme: "Approximating the Buffer Allocation Problem Using Epochs"— Presentation transcript:

Similar presentations

About project

Feedback