A Progressive Fault Detection and Service Recovery Mechanism in Mobile Agent Systems Wong Tsz Yeung Aug 26, 2002.

Slides:

Advertisements

Similar presentations

Remus: High Availability via Asynchronous Virtual Machine Replication

Advertisements

Causality in online gaming Objectives – Understand how online gaming relates to causality research in distributed systems – Be able to apply distributed.

Impossibility of Distributed Consensus with One Faulty Process

CS 542: Topics in Distributed Systems Diganta Goswami.

Teaser - Introduction to Distributed Computing

1 Fault-Tolerance Techniques for Mobile Agent Systems Prepared by: Wong Tsz Yeung Date: 11/5/2001.

(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)

Byzantine Generals Problem: Solution using signed messages.

Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.

Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

1 Complexity of Network Synchronization Raeda Naamnieh.

Termination and Recovery MESSAGES RECEIVED SITE 1SITE 2SITE 3SITE 4 initial state committablenon Round 1(1)CNNN-NNNN Round 2FAILED(1)-CNNN--NNN Round 3FAILED.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.

Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.

Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Reliable Communication for Highly Mobile Agents ECE 7995: Term Paper.

Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.

Authors: Lyu, M.R., Xinyu Chen, and Tsz Yeung Wong From : Intelligent Systems, IEEE [see also IEEE Intelligent Systems and Their Applications vol.19,Issue.

1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.

Composition Model and its code. bound:=bound+1.

A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.

1 Rollback-Recovery Protocols II Mahmoud ElGammal.

Distributed Deadlocks and Transaction Recovery.

Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?

Chapter 19 Recovery and Fault Tolerance Copyright © 2008.

CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.

Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.

Distributed Transactions Chapter 13

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.

University of Tampere, CS Department Distributed Commit.

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

CS603 Fault Tolerance - Communication April 17, 2002.

Improving TCP Performance over Wireless Networks

Building Dependable Distributed Systems, Copyright Wenbing Zhao

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Reliable Communication in the Presence of Failures Kenneth P. Birman and Thomas A. Joseph Presented by Gloria Chang.

Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.

FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.

Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety)

Seminar On Rain Technology

SEMINAR TOPIC ON “RAIN TECHNOLOGY”

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Outline Introduction Background Distributed DBMS Architecture

EEC 688/788 Secure and Dependable Computing

Reliable Sockets: A Foundation for Mobile Communications

Outline Announcements Fault Tolerance.

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Database Recovery 1 Purpose of Database Recovery

Distributed Databases Recovery

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

The SMART Way to Migrate Replicated Stateful Services

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Last Class: Fault Tolerance

Presentation transcript:

A Progressive Fault Detection and Service Recovery Mechanism in Mobile Agent Systems Wong Tsz Yeung Aug 26, 2002

Outline Introduction of the problem How to Solve the Problem – Server failure detection and recovery – Agent failure detection and recovery – Link failure Failure Detection and Recovery Mechanism Analysis – Liveness proof – Mechanism simplification analysis

Outline Reliability Evaluations – Using agent implementation – Using Stochastic Petri Net Simulation

Introduction of the problem Focus on designing a fault-tolerant mobile agent system The challenge is: – Guarantee the availability of the servers. – Guarantee the availability of the agents. – Preserve data consistency in both agents and servers. – Preserve the exactly-once property. – Guarantee the agent can eventually finish its tasks.

Introduction of the problem Fault-tolerance is classified into levels – Level 0: No tolerance to faults – Level 1: Server failure detection and recovery – Level 2: Agent failure detection and recovery – Level 3: Link failure

Level 0 No tolerance to faults – When agent dies because of server failure because of faults inside agent – Application has to restart manually. – Affected server may leave an inconsistent state after recovery.

Level 1 Server failure detection and recovery – Have a failure detection program running. – When a server restarts, abort all uncommitted transactions in the server. This preserves data consistency – When the agent re-executes after the initial states visited servers will be visited again Violates exactly-once execution property

Level 2 Agent failure detection and recovery – When server fails, agents resides are lost. We aims to recover such loss in this level – By using checkpointing We checkpoint agent internal data We use checkpointed data to recover lost agents.

Level 2 – Since we use checkpointed agent data Agent data consistency is preserved – Recovery of agent happens on the failed server This preserves the exactly-once execution property.

Level 3 Link Failure – We assume the agent agent is now ready to migrate from server u to server v, but a link failure happens – 3 scenarios before the agent leaves u. while the agent is traveling to v. after the agent has reached v. – Different scenarios has different problems and corresponding solutions.

Design of Level 1 FT We have a global daemon which monitors all the servers. Single point of failure problem monitoring daemon server pool

Design of Level 1 FT When the daemon recovers a server – it aborts all the uncommitted transactions performed by those lost agents. – This preserves data consistency in the server. This technique is – easy to implement – can be deployed on every existing mobile agent platform, without modifying the platform.

Design of Level 2 FT We use cooperative agents. – Actual agent – Witness agent Actual agent performs actual computation for the user. Witness agent monitors the availability of actual agent. – It lags behind the actual agent.

Design of Level 2 FT In our protocol, actual agents are able to communicate with the witness agent – the message is not a broadcast one, but a peer-to- peer one – Actual agent can assume that the witness agent is in the previous server – Actual agent must know the address of the previous server

Protocol of Level 2 FT Server i-1 Server iServer i+1 Server log Arrive at i Agent messages box Leave i leave arrive Checkpointing happens!!

Protocol of Level 2 FT Server i-1 Server iServer i+1 Server log Agent messages box Arrive at i+1 Leave i+1 Arrive at i+1 Leave i+1 spawn Arrive at i Leave i

Failure and Recovery Scenarios We cover only cover stopping failures. (Byzantine failures do not exist) We handle most kinds of failures: – Witness agent fails to receive “arrive at i” message – Witness agent fails to receive “leave i” message – Witness agent failures

Missing arrive message The reason may be: 1. message is lost message is lost 2. message arrives after timeout period message arrives after timeout period 3. actual agent dies when it is ready to leave server i-1 actual agent dies when it is ready to leave server i-1 4. actual agent dies when it has just arrive at server i, without logging. actual agent dies when it has just arrive at server i, without logging. 5. actual agent dies when it has just arrive at server i, with logging. actual agent dies when it has just arrive at server i, with logging. Arrive at i Zzz.. Next

Missing arrive message It is simple for the 1 st and 2 nd case. Server i-1 Server i Server log Arrive at i probe found log retransmit message Back

Missing arrive message For the 3 rd and 4 th cases, recovery takes place. Server i-1 Server i Server log checkpoint data no log retransmit message Back

Missing arrive message For the 5 th case, it results in missing detection. – since log appears in the server – the consequence is that “leave i” message never arrives. Back

Missing leave message The reason may be: 1. message is lost. 2. message arrives after timeout period 3. actual agent dies when it has just sent the “arrive at i” message actual agent dies when it has just sent the “arrive at i” message 4. actual agent dies when it has just logged the message “leave i” message. actual agent dies when it has just logged the message “leave i” message leave i Zzz.. Next

Missing leave message The 3 rd case is the same as the previous missing detection case. Server i-1 Server i Server log checkpoint data no log retransmit message Arrive at i

Missing leave message In this case, the recovery action is the same as the previous section. – When failure happens, the agent should be performing computation. – So, when server recovers, the agent’s computation has aborted. Back

Missing leave message This results in missing detection again. – This can be compensated by the 3 rd case in the previous discussion.3 rd case – It is because the witness will never receive “arrive i+1”.

Witness Failure Scenarios There is a chain of witness agents leaves on the itinerary of the agent – The latest witness monitors the actual agent. – Other witnesses monitor the witness that is before it i-1i Witnessing dependency

Witness Failure Scenarios Server i-2 Server i-1Server i recovery

Simplification Assume that 2-server failure would not happen – We can simplify our witnessing dependency i-2i-1i

Simplification If failure strikes server i-1 – witness on server i-2 can recover witness on server i-1 If failure strikes server i-2 – Will not recover it – Because within a short period, no failure would happen

Analysis – Liveness proof Notations – We define several timeouts T_recover: The timeout of waiting for a server to be recovered. T_arrive: the timeout of waiting for the arrive message. T_leave: the timeout of waiting for the leave message. T_alive: the timeout of waiting for the alive message. – Also, define several constants r_s: the maximum time for a server to be recovered when detected. r_a: the maximum time for an actual agent to be recovered. a: the maximum agent traveling time between 2 servers. m: the maximum message traveling time between 2 servers. e: the maximum execution time for an agent.

Analysis – Liveness proof If the system is blocked forever, one of the three timeouts will reach infinity. The outline of the proof: – derive the lower and the upper bounds of the timeouts – Given that the itinerary of the agent is of finite length and infinite number of failures, if none of the timeouts approach infinity, the system is blocking-free.

Analysis – Liveness proof Level 1 FT analysis – A failed server will eventually be recovered, the time bound is: – In the worse case, all servers are stopped. – Need to recover n servers.

Analysis – Liveness proof Level 2 FT analysis – We derive the lower bounds for the timeouts

Analysis – Liveness proof   m mm T_arrive = 0 e T_leave = e

Analysis – Liveness proof We define the failure inter-arrival time be If the system is not blocked forever, – 2 cases are needed to be considered. Does the actual agent have enough time to migrate from one host to another? Also, does the witness agent have enough time to migrate?

Analysis- Liveness proof Assume that all failures are happening in S_i – During the actual agent is migrating, there should be no failures – So, the required time is a+e. S_i-1S_iS_i+1 error-prone

Analysis – Liveness proof Again, assume that all failures are happening in S_i – The required time = a + min(T_arrive) + min(T_leave) = a + e S_i-1S_iS_i+1

Analysis – Liveness proof Useful results: where where k is the number of failures

Analysis – Liveness proof By the above results, we conclude that: – The system is blocked iff all failures is happening on one server, and – It follows from the upper and lower bounds of T_arrive, T_leave, and T_alive.

Simplification Analysis We define the following notation: – Define be the inter-arrival time of the failures throughout the system.

Link Failure Link failure is beyond the control of mobile agents system. Assume that the actual agent is ready to leave server u and migrate to v. Then, a link failure happens: – before the agent leaves u. – while the agent is traveling to v. – after the agent has reached v. We propose solutions to remedy these problems.

Link Failure Failure happens before the agent leaves u: – Problem: the agent cannot proceed. the agent waits in server u until the link is recovered. – Solution: Travel to server v’ instead of v based on number of migration trials. Technical problem: Knowledge of the locations of the unvisited servers.

Link Failure If network partitioning happens: v’ uv

Link Failure Failure happens while the agent is traveling to v: – Problem: The agent is lost. Recovery is required. However, the witness agent cannot proceed to server v. – Solution: The witness agent cannot recovery the actual agent in another server, say v’. Have to wait until the link is recovered.

Link Failure Failure happens after the agent has arrived at v: – Problem: The actual agent survives. Messages between u and v cannot reach the destinations. Witness agent cannot follow the actual agent.

Link Failure Solution: – The actual agent keeps on advancing until: It is lost in one of the servers. – After the link failure is recovered, the probe can eventually find such a failure. It has reached the destination. – The probe can finally catch up.

Reliability Evaluation The results are obtained by – an agent system implementation using Concordia. – simulation using Stochastic Petri Net. – aim: to measure the percentage of successful round-trip-travel. Home

Reliability Evaluation

about 60% about 5%

Reliability Evaluation about 800%

Reliability Evaluation For agent failure detection only

Reliability Evaluation 100% about 60%

Reliability Evaluation about 70%

Reliability Evaluation about 140% # of original agent # of recovered agent

Conclusion Categorize the fault-tolerance of mobile agent system. Designed a scheme for both server and agent failure detection and recovery. Analyzed most failure scenarios in mobile agent systems. Conducted performance evaluations which show – Our scheme is a promising technique – Trade-off between cost and levels of reliability

Future Work Model and perform simulations on Level 3 fault-tolerant mechanism. More detailed analysis is required. Extended stopping failures to Byzantine failures.

THE END Q & A Session