Approximating the Buffer Allocation Problem Using Epochs

Slides:



Advertisements
Similar presentations
1 Deadlock Solutions: Avoidance, Detection, and Recovery CS 241 March 30, 2012 University of Illinois.
Advertisements

Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
1 Complexity of Network Synchronization Raeda Naamnieh.
Point-to-Point Communication Self Test with solution.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.
Deadlocks in Distributed Systems Deadlocks in distributed systems are similar to deadlocks in single processor systems, only worse. –They are harder to.
Copyright 2008 Kenneth M. Chipps Ph.D. Controlling Flow Last Update
1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 2 Computer-System Structures Slide 1 Chapter 2 Computer-System Structures.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 6 Deadlocks Slide 1 Chapter 6 Deadlocks.
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Pitfalls: Time Dependent Behaviors CS433 Spring 2001 Laxmikant Kale.
Synchronization Deadlocks and prevention
Fast Retransmit For sliding windows flow control we waited for a timer to expire before beginning retransmission of a packet TCP uses an additional mechanism.
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Computer Architecture Chapter (14): Processor Structure and Function
Concurrency: Deadlock and Starvation
Distributed Shared Memory
Operating systems Deadlocks.
Background on the need for Synchronization
The Echo Algorithm The echo algorithm can be used to collect and disperse information in a distributed system It was originally designed for learning network.
Parallel Programming By J. H. Wang May 2, 2017.
QlikView Licensing.
CS137: Electronic Design Automation
MPI Point to Point Communication
Lecture 18: Coherence and Synchronization
Last Class: RPCs and RMI
The University of Adelaide, School of Computer Science
Multiprocessor Cache Coherency
Maximal Independent Set
Virtual Memory Networks and Communication Department.
TIM 58 Chapter 8: Class and Method Design
On the Complexity of Buffer Allocation in Message Passing Systems
Parallel Programming in C with MPI and OpenMP
Parallel Programming with MPI and OpenMP
Operation System Program 4
Communication Chapter 2.
Chapter 14: Protection.
Designing Parallel Algorithms (Synchronization)
Distributed Consensus
More on MPI Nonblocking point-to-point routines Deadlock
MPI-Message Passing Interface
Module 2: Computer-System Structures
Introduction to locality sensitive approach to distributed systems
CS703 - Advanced Operating Systems
PRESENTATION COMPUTER NETWORKS
Operating systems Deadlocks.
Chapter 15 : Concurrency Control
CSCE569 Parallel Computing
COMP60621 Fundamentals of Parallel and Distributed Systems
Introduction to parallelism and the Message Passing Interface
Concurrency: Mutual Exclusion and Process Synchronization
IAS 0600 Digital Systems Design
Deadlock B.Ramamurthy CSE421 2/23/2019 B.Ramamurthy.
Typically for using the shared memory the processes should:
Barriers implementations
International Data Encryption Algorithm
CENG334 Introduction to Operating Systems
Parallel Programming in C with MPI and OpenMP
Module 2: Computer-System Structures
Synchronizing Computations
COMP60611 Fundamentals of Parallel and Distributed Systems
Module 2: Computer-System Structures
Lecture 18: Coherence and Synchronization
Synchronization and semaphores
Error Checking continued
Deadlock CSE 2431: Introduction to Operating Systems
Presentation transcript:

Approximating the Buffer Allocation Problem Using Epochs Jan Bækgaard Pedersen (U. of Nevada, Las Vegas) Alex Brodsky, (U. of Winnipeg, Canada)

Parallel Applications Goal: Overlap communication and computation. Parallel Applications Consist of a number of asynchronous interconnected hosts <click> That run at different speeds, with no access to a global clock Communicate by send messages to each other over a local area network <click> To ensure applications are efficient <click> Applications typically overlap communication and computation 2/25/2019

Parallel Applications (cont.) Today, parallel applications typically Run on heterogeneous platforms Are ported to systems with varying resource Run on hardware platforms that differ from those they were designed on. Goals: Ensure that ported applications do not deadlock Determine application resource requirements Focus: Determine an application’s message buffer needs See points on slide. Emphasize porting applications, applications that can be run on original hardware to compute comm. Graph (but don’t mention comm. Graph yet) 2/25/2019

Message Buffering Unbuffered (synchronous op.) Buffered MPI_Send MPI_Recv Buffered (asynchronous op.) MPI_Send MPI_Recv If a message send is unbuffered <click> a sending process performs a send and then must wait for the receiver to perform a receive <click> Before the sender can complete the send and perform its own computation <click> With buffered message passing <click> the process can send the message <click> store the message in the buffer and compute the receiver can get the message from buffer at their convenience <click> thus allowing an overlap of computation and communications How if the buffer is filled up <click> The send will block <click> until the receiver performs a receive <click> If the application assumes that a buffer is available, this could be a problem. Buffer filled (synchronous op.) MPI_Send MPI_Recv 2/25/2019

The Problem Succeeds as long as 1 process has a free buffer. MPI_Send MPI_Send MPI_Send MPI_Recv MPI_Recv MPI_Recv Succeeds as long as 1 process has a free buffer. If all buffers are exhausted deadlock occurs! Let’s look at the problem in more detail Consider the following communication pattern where each process performs a send then presumably performs some computations and then performs a receive The first process has a buffer, so when the third process performs a send <click> the message is buffered and does not block the sender who can then perform the receive and <click> allow the remaining processes to complete their sends However if no buffers are available <click> And the processes perform sends <click> then none of the sends can proceed <click> and deadlock ensues <click> Thus, we need to ensure <click> that the application has enough buffers MPI_Send MPI_Send MPI_Send Solution: Ensure application has sufficient # of buffers. 2/25/2019

Deadlock results by untimely use of buffer. Complications MPI_Send MPI_Send MPI_Send Deadlock results by untimely use of buffer. Such problems arise even if the communication is oblivious (independent of application input) is point-to-point (no broadcast or multicast) does not use wild-cards (receiver must specify sender) Things can get even more complicated By performing a send, <click> A process can steal a buffer ment for anoth message Creating deadlock <click> because other sends cannot In fact <click> Follow points Interestingly <click> The problem of allocating buffers for an application is is NP-hard to find an optimal allocation [BPW’05] is coNP-complete to verify if an allocation is deadlock-free needs to be approximated 2/25/2019

The Model (Communication Graphs) MPI_Send MPI_Send MPI_Send MPI_Recv MPI_Recv MPI_Recv comp. arc comm. send vert. recv We model this problem using communication graphs Given an application and its communication pattern We model the application as a graph <click> Describe communication graph…. One key point <click> Since we assume an oblivious communication pattern all application behaviour of interest (i.e., communication) Is represented by the communication graph Key point: An application is represented by a communication graph. 2/25/2019

Basic Approach Treat as a dependency graph Adding a buffer not enough reverse computation arcs bidirectional communication arcs A cycle indicates a need for a buffer Adding a buffer not enough buffers may be stolen! Add buffer for every receive although, this is overkill ensures delay-freedom Our basic approach is to induce a dependency graph by reversing computation arcs and making communication arcs bidirectionnal A cycle in the graph indicates the need for a buffer <click> If there are no buffers <click> the sends will never complete. As we have seen <click> this may not be enough since buffers may be stolen by other processesthat perform untimely sends Thus, out initial approximation <clock> is to assign a buffer to each receive Although this guarantees that no process ever blocks this is clearly overkill in the number of buffers needed. A better approach is needed…. If no buffers, receive and send depend on each other. 2/25/2019

A Better Approximation If send (2) always occurs before (3) buffer will not be stolen deadlock will be prevented only one buffer is needed Implementation issues how to order sends? how to enforce order? what kind of order? how many buffers? (1) (2) Observe that if we can ensure that send occurs before send 3 Then send 2 will grab the buffer and no deadlock will ensue The key point is that sends 1 and 2 must be ordered <click> before send 3 How do we do this? We need to address several issues <click> Follow points…. (3) 2/25/2019

Our Approach Partition communication graph into epochs Epochs are well ordered Sends within an epoch may not be ordered Determine per-epoch buffer allocations Compute maximum buffer allocation over all per-epoch allocations This will be the buffer allocation for the application Group consecutive epochs into super-epochs Use barrier primitive to enforce execution order on the super-epochs Follow points…. 2/25/2019

Epochs Are maximal connected subgraphs All epochs are strictly ordered Represent phases of an application No communication between epochs Execute in serial order All epochs are strictly ordered Each epoch is executed when preceding epoch finishes all buffers are freed Enforced via barriers Describe what epoch are s… <click> Describe how epochs are ordered To enforce epoch ordering <click> we use a barrier synchronization primitive 2/25/2019

Number of Buffers Needed Compute # of buffers for each epoch Epochs with cycles: > 0 buffers Use approximation of [BPW’05] Epochs with no cycles: 0 buffers No cycles  no deadlock At end of each epoch all buffers are freed available for next epoch Total # of buffers needed is: the maximum over all epochs 1 buffer We then compute the number of buffers for each epoch Cycles need at least one buffer…. As in the first epoch below <click> Epochs with no cycles need no buffers… <click> because no deadlock can occur Note that <click> at end of each epoch all buffers are freed and become available to the next epoch Consequently <click> The total # of buffers needed is simply the maximum over all epochs 0 buffers Total # of buffers needed is max{1,0} = 1 buffer. 2/25/2019

Avoiding Unnecessary Barriers Barriers are expensive operations In many cases barriers are not needed In following example each send is in its own epoch a barrier occurs after every send even though no buffers are needed Idea: combine epochs into a super-epoch eliminates unneeded barriers increases parallelism does not increase number of buffers Super-Epochs are treated like epochs serially ordered separated by barriers no explicit synchronization within use at most max # of buffers over all epochs However <click> barriers are expensive operations are not needed in many cases, explain why… To avoid unnecessary barriers <click> we combine epochs into superepochs Describe superepochs (follow points) 2/25/2019

Proposed Implementation App. MPI O/S App. MPI O/S App. MPI O/S App. MPI O/S Compute barrier and buffer allocations Generate communication graph from the application Compute epochs and per-epoch buffer allocations Compute super-epochs and barrier locations Upload barrier locations and buffer allocations Before each send, MPI checks if a barrier is needed We propose the following implementations We first compute the barrier and buffer allocations by <click> generating the communication graph <click> computing the epochs <click> and performing per epoch buffer allocations. Then <click> we combine epochs into super-epochs <click> and determine locations of barriers. We then use a modified MPI library <click> into which we upload the barrier and buffer allocation information Before each send … describe what MPI should do. App. 2/25/2019

Technical Contributions Algorithms for: partitioning the communication graph into epochs computing the buffer allocation for each epoch composing super-epochs from epochs Descriptions of: possible mechanism implementation within existing parallel systems possible mechanism extensions Analysis: comparing our approach to existing approaches complexity of proposed algorithms Follow points…. 2/25/2019