Fault Tolerance CSCI 4780/6780. RPC Semantics in Presence of Failures 5 types of exceptions Client cannot locate server Request to server is lost Server.

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC621 Advanced Operating Systems Fault Tolerance.
Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.
CSE 486/586 Distributed Systems Remote Procedure Call
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
Remote Procedure Call (RPC)
Remote Procedure Call Design issues Implementation RPC programming
L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.
Tam Vu Remote Procedure Call CISC 879 – Spring 03 Tam Vu March 06, 03.
Computing Systems 15, 2015 Next up Client-server model RPC Mutual exclusion.
Distributed Object & Remote Invocation Vidya Satyanarayanan.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS533 - Concepts of Operating Systems 1 Remote Procedure Calls - Alan West.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Lecture 4 Remote Procedure Calls (cont). EECE 411: Design of Distributed Software Applications [Last time] Building Distributed Applications: Two Paradigms.
Distributed Systems CS Fault Tolerance- Part II Lecture 14, Oct 19, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
CS 582 / CMPE 481 Distributed Systems Communications (cont.)
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.
1 Fault Tolerance Chapter 7. 2 Fault Tolerance An important goal in distributed systems design is to construct the system in such a way that it can automatically.
Top Three Layers Session Layer Presentation Layer Application Layer.
.NET Mobile Application Development Remote Procedure Call.
EECS122 - UCB 1 CS 194: Distributed Systems Communication Protocols, RPC Computer Science Division Department of Electrical Engineering and Computer Sciences.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
1 Transport Layer Computer Networks. 2 Where are we?
Real Time Multimedia Lab Fault Tolerance Chapter – 7 (Distributed Systems) Mr. Imran Rao Ms. NiuYu 22 nd November 2005.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Fault Tolerance. Agenda Overview Introduction to Fault Tolerance Process Resilience Reliable Client-Server communication Reliable group communication.
Messaging is an important means of communication between two systems. There are 2 types of messaging. - Synchronous messaging. - Asynchronous messaging.
RPC Design Issues Presented By Gayathri Vijay S-8,CSE.
System Reliability and Fault Tolerance  Reliable Communication  Byzantine Fault Tolerance.
REQUEST/REPLY COMMUNICATION
ICS362 – Distributed Systems
Chapter 5: Distributed objects and remote invocation Introduction Remote procedure call Events and notifications.
Remote Procedure Call RPC
Two-Phase Commit Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
Kyung Hee University 1/33 Fault Tolerance Chap 7.
Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,
Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.
- Manvitha Potluri. Client-Server Communication It can be performed in two ways 1. Client-server communication using TCP 2. Client-server communication.
Remote Procedure Call and Serialization BY: AARON MCKAY.
Manish Kumar,MSRITSoftware Architecture1 Remote procedure call Client/server architecture.
Reliable Client-Server Communication. Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable.
Computer Science Lecture 3, page 1 CS677: Distributed OS Last Class: Communication in Distributed Systems Structured or unstructured? Addressing? Blocking/non-blocking?
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Distributed Systems CS Fault Tolerance- Part II Lecture 18, Nov 19, 2012 Majd F. Sakr and Mohammad Hammoud 1.
Distributed objects and remote invocation Pages
1 CHAPTER 5 Fault Tolerance Chapter 5-- Fault Tolerance.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.
Distributed Systems Lecture 8 RPC and marshalling 1.
RPC 6/14/20161BALAJI K - AP. Design issues of RPC Programming with interfaces Call Semantics associated with RPC Transparency and related to procedure.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Object Interaction: RMI and RPC 1. Overview 2 Distributed applications programming - distributed objects model - RMI, invocation semantics - RPC Products.
Fault Tolerance Chap 7.
“Request /Reply Communication”
03 – Remote invoaction Request-reply RPC RMI Coulouris 5
DISTRIBUTED COMPUTING
Outline Announcements Fault Tolerance.
Distributed Systems CS
Distributed Systems CS
Reliable Client-Server Communication
Lecture 6: RPC (exercises/questions)
Lecture 6: RPC (exercises/questions)
Lecture 7: RPC (exercises/questions)
Distributed Systems CS
Last Class: Communication in Distributed Systems
Last Class: Fault Tolerance
Presentation transcript:

Fault Tolerance CSCI 4780/6780

RPC Semantics in Presence of Failures 5 types of exceptions Client cannot locate server Request to server is lost Server crashes after receiving request Reply message from server is lost Client crashes after sending in request

Not Locating Server Causes: –Server might be down –Version mismatch between client and server stubs Possible solutions –Raising exception Relying on programming language for a systems problem Not all languages have exceptions Transparency is compromised

Lost Request Messages Easiest to handle Use timers Retransmission on timeout Duplicate detection at server end

Server Crashes Server can crash either before executing or after executing (before sending reply) Crash after execution needs to be reported to client Crash before execution can be handled by retransmission Client’s OS cannot distinguish between the two

Server Crashes A server in client-server communication a)Normal case b)Crash after execution c)Crash before execution

Handling Server Crashes Wait until server reboots and try again –At least once semantics Give up immediately and report failure –At most once semantics Guarantee nothing The need is for exactly once semantics

Server and Client Strategies Server strategies –Send completion message before operation –Send completion message after operation Client strategies –Never reissue a request –Always reissue a request –Only reissue request if acknowledgement not received –Only reissue if acknowledgement is received Client never knows the exact sequence of crash Server failures changes RPC fundamentally

Server Crash Scenarios M -> P -> C M -> C-> (P) C -> (M -> P) P -> M -> C P -> C -> (M) C -> (P -> M)

Server Crashes Different combinations of client and server strategies in the presence of server crashes. ClientServer Strategy M -> PStrategy P -> M Reissue strategyMPCMC(P)C(MP)PMCPC(M)C(PM) AlwaysDUPOK DUP OK NeverOKZERO OK ZERO Only when ACKedDUPOKZERODUPOKZERO Only when not ACKedOKZEROOK DUPOK

Lost Reply Messages Timer at client –Client is not sure whether the reply is lost or server is slow Idempotent operations Can all operations be made idempotent? Sequence numbers in requests –Server refuses to perform a duplicate request –Server should maintain state of each client A bit to distinguish duplicates from originals

Client Crashes Can lead to orphans Wastages of resources Confusions or reboots Extermination with logging –Maintain logs of RPC calls –Explicit termination of orphans –Logging is expensive –Grand-orphans

Client Crashes Reincarnation with epochs –Time is divided into epochs –Broadcast epoch on client reboot –Orphans are killed when a server receives new epoch announcement Gentler re-incarnation –Kill computations whose owners cannot be located Expiration –Time window for completion with explicit extension –Client waits before rebooting