Revoke / Incarnation #s / Matching Discussion around how to reclaim context IDs (resources that are a part of message matching) after an MPI_Comm_revoke.

Slides:



Advertisements
Similar presentations
Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.
Advertisements

1 Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create (
Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory.
Part 2: Preventing Loops in the Network
CS 542: Topics in Distributed Systems Diganta Goswami.
Concurrency: Deadlock and Starvation Chapter 6. Deadlock Permanent blocking of a set of processes that either compete for system resources or communicate.
The Building Blocks: Send and Receive Operations
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Reliable Multicast Steve Ko Computer Sciences and Engineering University at Buffalo.
CS542 Topics in Distributed Systems Diganta Goswami.
PROTOCOL VERIFICATION & PROTOCOL VALIDATION. Protocol Verification Communication Protocols should be checked for correctness, robustness and performance,
Update on ULFM Fault Tolerance Working Group MPI Forum, San Jose CA December, 2014.
Distributed Storage March 12, Distributed Storage What is Distributed Storage?  Simple answer: Storage that can be shared throughout a network.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Termination Detection Part 1. Goal Study the development of a protocol for termination detection with the help of invariants.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Threading Part 2 CS221 – 4/22/09. Where We Left Off Simple Threads Program: – Start a worker thread from the Main thread – Worker thread prints messages.
Mobile Communications in a Mobile Agent Based Overlay System Ching-Feng Li.
Distributed Systems Fall 2009 Coordination and agreement, Multicast, and Message ordering.
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Collaboration Diagrams. Example Building Collaboration Diagrams.
CS603 Process Synchronization February 11, Synchronization: Basics Problem: Shared Resources –Generally data –But could be others Approaches: –Model.
Synchronization in Distributed Systems. Mutual Exclusion To read or update shared data, a process should enter a critical region to ensure mutual exclusion.
S A B D C T = 0 S gets message from above and sends messages to A, C and D S.
Distributed File System: Design Comparisons II Pei Cao Cisco Systems, Inc.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Chapter Resynchsonous Stabilizer Chapter 5.1 Resynchsonous Stabilizer Self-Stabilization Shlomi Dolev MIT Press, 2000 Draft of Jan 2004, Shlomi.
CSC 395 – Software Engineering Lecture 21: Overview of the Term & What Goes in a Data Dictionary.
Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
CS 603 Data Replication February 25, Data Replication: Why? Fault Tolerance –Hot backup –Catastrophic failure Performance –Parallelism –Decreased.
Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.
(Business) Process Centric Exchanges
Parallel and Distributed Simulation Memory Management & Other Optimistic Protocols.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Convoy Processing in BizTalk Orchestration
1 © R. Guerraoui Regular register algorithms R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
CS536 Semantic Analysis Introduction with Emphasis on Name Analysis 1.
Capabilities, Plans, and Events Each capability is further broken down either into further capabilities or, eventually into the set of plans that provide.
Lustre* Filesystem Doug Oucharek Intel ® High Performance Data Division * Some names and brands may be claimed as the property of others.
Efficient Multithreaded Context ID Allocation in MPI James Dinan, David Goodell, William Gropp, Rajeev Thakur, and Pavan Balaji.
CS603 Fault Tolerance - Communication April 17, 2002.
CSE 123 Discussion 10/05/2015 Updated from Anup and Narendran’s excellent notes.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
Testing OO software. State Based Testing State machine: implementation-independent specification (model) of the dynamic behaviour of the system State:
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Several sets of slides by Prof. Jennifer Welch will be used in this course. The slides are mostly identical to her slides, with some minor changes. Set.
Capability Model & B2B – Draft for Discussion IBM Research – Haifa Moti Nisenson.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
“Distributed Algorithms” by Nancy A. Lynch SHARED MEMORY vs NETWORKS Presented By: Sumit Sukhramani Kent State University.
Implementing Lock. From the Previous Lecture  The “too much milk” example shows that writing concurrent programs directly with load and store instructions.
ISA 95 Working Group (Business) Process Centric Exchanges Dennis Brandl A Modest Proposal July 22, 2015.
Communication Process. Defining Communication On a sticky note, write down your own definition of communication. Be as detailed as possible. With a group,
Atomic Tranactions. Sunmeet Sethi. Index  Meaning of Atomic transaction.  Transaction model Types of storage. Transaction primitives. Properties of.
© 2006 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 © 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential.
Error Handler Rework Fault Tolerance Working Group.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Model and complexity Many measures Space complexity Time complexity
Packets & Routing Lower OSI layers (1-3) concerned with packets and the network Packets carry data independently through the network, and into other networks…
Do Now Factor. 18a5b2 – 30a3b7 64 – x2 x2 – 10x + 24.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Implementing Mutual Exclusion
Lecture 8: Efficient Address Translation
EEC 688/788 Secure and Dependable Computing
Multiple Meaning Words
Presentation transcript:

Revoke / Incarnation #s / Matching Discussion around how to reclaim context IDs (resources that are a part of message matching) after an MPI_Comm_revoke Basic problem: revoke is one-sided and can be called by multiple processes in the communicator – There is a race between calling revoke and when all correct processes update their local state to revoked – Need to ensure that all processes have revoked the communicator before context ID can be reused Scenario: – Communicator with correct processes A, B, and C is revoked – A and B free revoked communicator and create a new communicator using the old context ID – C calls revoke on the old communicator -- what happens at A and B? – OR -- C sends a message to A/B who has posted an ANY_SOURCE receive -- does it match? Several solutions were discussed: 1.Incarnation number -- An additional number on each context ID that becomes a part of the matching 2.Group guards -- Check incoming messages to ensure that the sender is in the group of the communicator 3.Fault tolerant MPI_Comm_free/create -- Enhance create/free algorithms to quiesce context IDs before they are used

RMA Semantics Pavan raised a concern about the definition of RMA window memory in the context of shared memory windows It may be impossible to guarantee that only locations updated in the window are invalid Suggested weakening the semantic to the entire window being undefined Requires further discussion

Shared Memory What happens if a process with shared memory goes down and another process has posted messages using its shared memory? – Yes this is an implementation issue, but is it possible to do anything?