The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.

Slides:



Advertisements
Similar presentations
CS 542: Topics in Distributed Systems Diganta Goswami.
Advertisements

Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
CS 582 / CMPE 481 Distributed Systems
Causality & Global States. P1 P2 P Physical Time 4 6 Include(obj1 ) obj1.method() P2 has obj1 Causality violation occurs when order.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
CS 582 / CMPE 481 Distributed Systems
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Transis Efficient Message Ordering in Dynamic Networks PODC 1996 talk slides Idit Keidar and Danny Dolev The Hebrew University Transis Project.
CS 582 / CMPE 481 Distributed Systems Replication.
Multicast Protocols Jed Liu 28 February Introduction  Recall Atomic Broadcast:  All correct processors receive same set of messages.  All messages.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Chapter 18: Distributed Coordination (Chapter 18.1 – 18.5)
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
Lab 1 Bulletin Board System Farnaz Moradi Based on slides by Andreas Larsson 2012.
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
ARMADA Middleware and Communication Services T. ABDELZAHER, M. BJORKLUND, S. DAWSON, W.-C. FENG, F. JAHANIAN, S. JOHNSON, P. MARRON, A. MEHRA, T. MITTON,
TOTEM: A FAULT-TOLERANT MULTICAST GROUP COMMUNICATION SYSTEM L. E. Moser, P. M. Melliar Smith, D. A. Agarwal, B. K. Budhia C. A. Lingley-Papadopoulos University.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Practical Byzantine Fault Tolerance
Dealing with open groups The view of a process is its current knowledge of the membership. It is important that all processes have identical views. Inconsistent.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
Hwajung Lee. A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types of groups:
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
November NC state university Group Communication Specifications Gregory V Chockler, Idit Keidar, Roman Vitenberg Presented by – Jyothish S Varma.
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Fault Tolerance and Replication
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,
Building Dependable Distributed Systems, Copyright Wenbing Zhao
Group Communication Theresa Nguyen ICS243f Spring 2001.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Reliable Communication in the Presence of Failures Kenneth P. Birman and Thomas A. Joseph Presented by Gloria Chang.
Fault Tolerance (2). Topics r Reliable Group Communication.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.
Distributed Computing Systems Replication Dr. Sunny Jeong. Mr. Colin Zhang With Thanks to Prof. G. Coulouris,
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
More on Fault Tolerance
Fault Tolerance Prof. Orhan Gemikonakli
Reliable group communication
Outline Announcements Fault Tolerance.
EEC 688/788 Secure and Dependable Computing
Middleware for Fault Tolerant Applications
Advanced Operating System
ACM Transactions on Information and System Security, November 2001
Active replication for fault tolerance
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Abstractions for Fault Tolerance
Last Class: Fault Tolerance
M. Mock and E. Nett and S. Schemmer
Presentation transcript:

The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella

Introduction The Totem single-ring protocol  Supports high-performance fault-tolerance distributed systems that continue to operate despite network partitioning and remerging, and processors fail and restart.  Provides totally ordered message delivery with low overhead, high throughput and low latency using a logical token-passing ring.  Provides rapid detection of network partitioning and processor failure together with reconfiguration and membership services.

Model Distributed system is built on a broadcast domain consisting of finite number of processors that communicated by broadcast messages. Originate: the first broadcast message generated by the application To achieve reliability, a message is retransmitted. A processor receives all of its own broadcast messages. Broadcast domain can be partitioned in to components. Each processor has its own identifier and stable storage. If the processor fails and restarts, its id does not change and its states (all/partial) may have been retained in the stable storage

Model Processors are logically arranged into a ring:  Each ring has a representative, and identifier that consists of ring sequence number and the identifier of the representative  Ring refers to the infrastructure of the Totem Configuration is the view provided to the application  Membership of a configuration is a set of processor identifiers.  Regular configuration has the same membership and identifier as its corresponding ring.  Transitional configuration consists of processors that are members of a new ring coming directly from the old ring. Token controls access to the ring; only processor that has possession of the token can broadcast a message.

Services Membership services Reliable Ordered Delivery Services

Membership Services Uniqueness of Configuration Consensus Termination Configuration Change Consistency

Reliable Ordered Delivery Services Reliable Delivery for Configuration C  each message has unique identifier  if processor p delivers message m, p delivers m once only. If p delivers two different messages, the p delivers 1 of those messages strictly before delivers the other.  if p originates message m, then p will deliver m or will fail before delivering a Configuration Change message to install a new regular configuration  if p is a member of regular configuration C, and no configuration change occurs, then p will deliver in C all the messages originated in C  if p delivers message m originated in C, then p is a member of C  if p and q are both members of configurations C1 and C2 then p and q deliver the same set of messages in C1 before delivering a Configuration Change message that terminates C1 and starts C2.

Reliable Ordered Delivery Services Delivery in Causal Delivery for Configuration C  delivery order should respect Lamport causality within a configuration Delivery in Agreed Order  guarantees that processors deliver messages in a consistent total order. When a processor delivers a message, it has delivered all preceding messages in the same total order Delivery in Safe Order  When processor delivers a message, it has determined that every processor in the current configuration has received the message and will deliver that message unless that processor fails.

Virtual Synchrony all processors agree on group membership ensures that configuration change occurs at the same point in the message delivery history for all operational processors. processors that are members of two successive configurations must deliver the same set of messages in the first configuration failures do not result in incomplete delivery of messages if the system partitions, only processors in the primary component continue to operate

Extended Virtual Synchrony Totem extends the virtual synchrony model to systems:  all components of a partitioned system continue to operate and subsequently remerge  failed processor can be repaired and can rejoin the system with stable storage intact. Two processors can deliver different set of messages, but they must not deliver messages in inconsistent order. i.e. p delivers m1 before m2, q must not deliver m2 before m1. delivery in agreed order or safe order as requested by the originator of the message  if processor p deliver message m as safe in configuration C, then every processor in C has received m and deliver m before it install a new configuration, unless that configuration fails.  This is achieved by installing a transitional configuration to deliver any remaining messages from the prior configuration.

Atomic Multicast Guarantee that a message is delivered either to all the processes or none at all Requires that all messages are delivered in the same order to all the processes.

Total Ordering Protocol When processor gets a hold of the token, it can broadcast one or more messages Each message header contains a sequence number derived from a field of the token. Hence, the is single sequence of message sequence numbers for all the processors in the ring. Delivery of messages in sequence number order is agreed delivery The token has an additional field aru (all-receive-up- to) which is used in safe delivery to determine when all the processors in the ring have received a message.

Total Ordering Protocol – cont’ The total ordering protocol is unable to continue when the token is lost. A Token Retransmission Timeout is used to detect such lost. On a timeout, the processor retransmits the token to the next processor. There is mechanism to detect redundant tokens. The Membership Protocol handles the lost of all copies of the token

Membership Protocol resolves processor failure, network partitioning, and lost of all copies of the token. detects such failures and reconstructs a new ring on which the total ordering protocol can resume operation ensures consensus generates a new token and recovers messages that had not been received by some of the processors when the failure occurred.

Membership Protocol Join Message  contains set of identifiers of the processors that the sender is considering for membership in a new ring – proc_set  contains a set of identifier of failed processors – fail_set  processor broadcast Join message to achieve consensus Configuration Change Message  processor delivers this message directly to the application  describes a change from an old configuration to a transitional configuration, or from a transitional configuration to a new configuration

Membership Protocol Commit Token  sent out by the representative of the ring to circulate the new ring  has a member_list field containing the membership of the new ring  each processor in the member_list has a highest_delivered field which indicates the largest sequence number of a message that the process has delivered on the old ring.  first rotation of the Commit Token is used to collect information needed to determine correct handling of messages from old ring.

Operations of the Membership Protocol - handling failure Mechanism to detect failures is the Token Loss timeout When an Token timeout expires, or a Join message is received, the processor invokes the protocol for the formation of a new ring. Processor collects info about operational processors and failed processors and broadcast that info in the proc_set and fail_set fileds of a Join message. Upon receive the Join Message, processor updates the its my_proc_set and m_fail_set. If these are changed, it broadcasts a Join message with the updated sets. Consensus is met when my_proc_set and my_fail_set are equal to proc_set and fail_set of all Join messages from every processor.

Operations of the Membership Protocol The representative of the ring is the processor that has the lowest identifier. The representative then broadcasts a Commit Token to collect needed information to determine correct handling of messages from the old ring When the Commit Token are done circulating twice, a new ring is formed but not yet installed. The recovery protocol is then executed to retransmit messages from their old ring that must be exchanged to maintain agreed and safe delivery. In one atomic action, each processor delivered the exchanged messages to the application along with the Configuration Change message. The new ring is installed. Message delivery is then resumed.

Questions ?