6/21/2015Page 1 This presentation is based on WS-Membership: Failure Management in Web Services World B. Ramamurthy Based on Paper by Werner Vogels and.

Slides:



Advertisements
Similar presentations
Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Byzantine Generals. Outline r Byzantine generals problem.
Teaser - Introduction to Distributed Computing
1 Transactions and Web Services. 2 Web Environment Web Service activities form a unit of work, but ACID properties are not always appropriate since Web.
Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
Exercises for Chapter 17: Distributed Transactions
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
Slides for Chapter 10: Time and Global State
Byzantine Generals Problem in the Light of P2P Computing Natalya Fedotova Luca Veltri International Workshop on Ubiquitous Access Control July 17, 2006.
Distributed Systems Fall 2011 Gossip and highly available services.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Last Class: Weak Consistency
6/27/2015Page 1 This presentation is based on WS-Membership: Failure Management in Web Services World B. Ramamurthy Based on Paper by Werner Vogels and.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Distributed Mutex EE324 Lecture 11.
Communication (II) Chapter 4
University of Palestine software engineering department Testing of Software Systems Fundamentals of testing instructor: Tasneem Darwish.
1 System Models. 2 Outline Introduction Architectural models Fundamental models Guideline.
DISTRIBUTED SYSTEMS II AGREEMENT (2-3 PHASE COM.) Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Exercises for Chapter 2: System models
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 18: Replication.
Slides for Chapter 12: Coordination and Agreement From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 4, © Pearson.
Exercises for Chapter 18: Replication From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 3, © Addison-Wesley 2001.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Exercises for Chapter 9: Web Services.
University of Tampere, CS Department Distributed Commit.
Architecture Models. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 21: Designing.
Chapter 5: Distributed objects and remote invocation Introduction Remote procedure call Events and notifications.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
Fault Tolerant Services
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Exercises for Chapter 15: COORDINATION AND AGREEMENT From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 3, © Addison-Wesley.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 17: Distributed.
Replication and Group Communication. Management of Replicated Data FE Requests and replies C Replica C Service Clients Front ends managers RM FE RM Instructor’s.
Group Communication Theresa Nguyen ICS243f Spring 2001.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Indirect Communication.
Exercises for Chapter 2: System models From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 4, © Pearson Education 2005.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 15: Coordination.
Fault Tolerance (2). Topics r Reliable Group Communication.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 System Models by Dr. Sarmad Sadik.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
PERFORMANCE MANAGEMENT IMPROVING PERFORMANCE TECHNIQUES Network management system 1.
Slides for Chapter 11: Coordination and Agreement From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 3, © Addison-Wesley.
Exercises for Chapter 14: Replication
Exercises for Chapter 11: COORDINATION AND AGREEMENT
Coordination and Agreement
Distributed Mutex EE324 Lecture 11.
Slides for Chapter 2: Architectural Models
Outline Announcements Fault Tolerance.
Slides for Chapter 2: Architectural Models
Slides for Chapter 14: Distributed transactions
Distributed Systems through Web Services
Presented by: Francisco Martin-Recuerda
Indirect Communication Paradigms (or Messaging Methods)
Slides for Chapter 15: Replication
B. Ramamurthy Based on Paper by Werner Vogels and Chris Re
Indirect Communication Paradigms (or Messaging Methods)
Exercises for Chapter 14: Distributed Transactions
Slides for Chapter 11: Time and Global State
By Werner Vogels and Chris Re
By Werner Vogels and Chris Re
Slides for Chapter 18: Replication
Network management system
Slides for Chapter 14: Time and Global States
Presentation transcript:

6/21/2015Page 1 This presentation is based on WS-Membership: Failure Management in Web Services World B. Ramamurthy Based on Paper by Werner Vogels and Chris Re

Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005 Figure 12.1 A network partition

Agreement in Pepperland (p.51) Consider an army in Pepperland: Apples and Oranges: two groups are located on the hills Blue Meanies are invader. Now located in the valley. Apple and Orange have to decide when to attack. They exchange messages on their strength: number of attack items (personal, machinery etc). They reach a consensus on who will attack first based on strength. Then attack message is sent from stronger team to weaker team. The message delay {min… max} Apple (say) send the attack message, waits for min minutes; then starts attack; Other team is supposed to wait for 1 min after it receives attack msg; Ideal guarantee: Orange will start attack no more than {max-min+1} minutes. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005

Consensus in the presence of failure (p.54) How do you detect failure? What if the messenger from Apple to Orange is captured? How does Orange know if Apple has been defeated? Impossibility in reaching agreement in the presence of failures: to surrender or to attack? If the messenger is captured there is way to achieve agreement. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005

Consensus in Space Mission Launches Mission critical applications Set of divisions each in charge of a module of the (shuttle) mission They have to agree or come to a consensus to launch (attack in Pepperland) or abort (surrender in Pepperland). In case of failure of message, consensus is reached to abort the mission. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005

Shuttle Story “I had a friend who was one of the programmers who worked on the Shuttle guidance system, which used three computers in parallel, and operated under a consensus model. If two computers agreed on a decision, the third would remain quiescent. If they disagreed, however, the third computer would jump in to cast a tie-breaking vote. I asked what would happen if the three computers came up with three different answers, and he shook his head. I asked what would happen if one of those computers dropped out, and he said the other two would end up dead-locked.” Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4 © Pearson Education 2005

6/21/2015Page 7 Introduction to WS- Membership An important factor in the successful deployment of federated web-services-based business activities will be the ability to guarantee reliable distributed operation and execution. Failure management is essential for systems constructed out web services on the network. ws-membership –a coordination service –a generic web-service interface for tracking registered web-services and –for providing membership monitoring information. A prototype membership service –based on epidemic protocol techniques has been implemented Context: Obduro project which focuses on global scalable distributed systems based on web-service technologies.

Obduro Project Development of advanced distributed services in the context of WS Coordination framework Development of high performance server technology for web services routing Integration of reliability and other distributed services into coordination and choreography engines. Development of a framework for global event management 6/21/2015Page 8

Failure Management Failure management is essential for building reliable distributed systems Tracking which services are participating in an activity and what their status is drives the progress of the activity. Ws-membership is developed in the context of ws- coordination standard of w3c. Membership can be realized by simple heartbeat (as in Hadoop). Failure detection can also be used as the building block to simplify the implementation of consensus. Consensus is used when a set of processes have to agree upon the outcome of an operation. 6/21/2015Page 9

6/21/2015Page 10 WS-Membership Membership services is about service availability Coordination protocol Tracks registered members Presents membership updates to monitors Two components WS-Membership Failure detectionMembership dissemination

6/21/2015Page 11 Component services Epidemic communication State management Development of advanced distributed services in the context of the web- services Coordination framework. –These services will include a failure management service, a consensus service and a lightweight distributed state-sharing engine.

6/21/2015Page 12 The Membership Framework: Five Roles Modeled Coordination service –Receives activation and membership requests and routes them to membership service Membership Service – Provides failure detection of registered web-services and disseminates membership information

6/21/2015Page 13 File roles (contd.) Member Service – A software component that has registered itself for failure detection, either directly with a Membership Service Membership Proxy – A software component that is interposed between a member service and the Membership Service for reasons of efficiency or accuracy Membership Monitor – This service registers itself with the Membership Service to receive changes to the membership state

6/21/2015Page 14 Activation & Registration Coordination service provides 2-step access to membership service: Activation at a URI: Membership Service is created –createCoordinationContext (CoordinationType) returns coordinationContext Registration: proxies and services register with Membership Service –requestMembership (serviceURI, coordContext, port for probe) Other methods: –memberProbe, memberAlive, memberLeaves

6/21/2015Page 15 See Fig.1 for activation & registration sequence Change App3 on the right end  App2

The Epidemic Membership Based on gossip-style failure detection This is a probabilistic approach to failure detection It is based on epidemic state maintenance techniques which provide an excellent foundation for constructing loosely couples, asynchronous, autonomous distributed components. Eventual consistency is reached (somewhat like the Chnady&Lamport algorithm). 6/21/2015Page 16

6/21/2015Page 17 Epidemic membership Service (EMS) Each participant holds a list of known peers Eventual consistency Best for loosely coupled, asynchronous systems Operational details: Fig.2,3 : gossip received + local membership state  new membership state Gossip: If Membership Service fails all members are marked failed.

Features of EMS Strong mathematical underpinning allows us to compute probability of mistakes Communication techniques used to exchanges messages are highly robust Membership exchanges between members is asynchronous Participants are able to make decisions autonomously about failures of other participants. 6/21/2015Page 18

6/21/2015Page 19 Types of information thru’ gossip Members. This is the list of the Member Service URIs that are registered and are active. This information set includes a logical timestamp it was last updated. Joined. A list of Member Services that have recently registered, with each the logical timestamp of the moment of registration. Left. When a Member Service gracefully exits, it should send a MemberLeaves indication to the Membership Service it has registered with. This will remove the members from the Members list and place it in the Left set, annotated with the logical timestamp. Failed. After a member has been detected as failed it is removed from the Members set and placed in this set, annotated with the logical timestamp. Suspected. An option at Activation time is to specify a threshold that would mark a member as suspected, before it is marked failed.

Operational Details EMS developed in the context of XEROX Clearing house project Each participant maintains list of known peers Periodically they update a heartbeat counter and send msg to their peers. Push-pull model instead of just push Gossip model Study Fig. 3 to understand operational details 6/21/2015Page 20

6/21/2015Page 21 Fault model? How would you use EMS to realize a fault model for your system?