Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed Systems - Comp 655

Similar presentations


Presentation on theme: "Distributed Systems - Comp 655"— Presentation transcript:

1 Distributed Systems - Comp 655
Fault Tolerance Fault tolerance concepts Implementation – distributed agreement Distributed agreement meets transaction processing: 2- and 3-phase commit Bonus material Implementation – reliable point-to-point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing 5/14/2019 Distributed Systems - Comp 655

2 Fault tolerance concepts
Availability – can I use it now? Usually quantified as a percentage Reliability – can I use it for a certain period of time? Usually quantified as MTBF Safety – will anything really bad happen if it does fail? Maintainability – how hard is it to fix when it fails? Usually quantified as MTTR 5/14/2019 Distributed Systems - Comp 655

3 Distributed Systems - Comp 655
Comparing nines 1 year = 8760 hr Availability levels 90% = 876 hr downtime/yr 99% = 87.6 hr downtime/yr 99.9% = 8.76 hr downtime/yr 99.99% = min downtime/yr 99.999% = min downtime/yr 5/14/2019 Distributed Systems - Comp 655

4 Exercise: how to get five nines
Brainstorm what you would have to deal with to build a single-machine system that could run for five years with 25 min downtime. Consider: Hardware failures, especially disks Power failures Network outages Software installation What else? Come up with some ideas about how to solve the problems you identify 5/14/2019 Distributed Systems - Comp 655

5 Distributed Systems - Comp 655
Multiple machines at 99% Assuming independent failures 5/14/2019 Distributed Systems - Comp 655

6 Distributed Systems - Comp 655
Multiple machines at 95% Assuming independent failures 5/14/2019 Distributed Systems - Comp 655

7 Distributed Systems - Comp 655
Multiple machines at 80% Assuming independent failures 5/14/2019 Distributed Systems - Comp 655

8 Distributed Systems - Comp 655
1,000 components 5/14/2019 Distributed Systems - Comp 655

9 Things to watch out for in availability requirements
What constitutes an outage … A client PC going down? A client applet going into an infinite loop? A server crashing? A network outage? Reports unavailable? If a transaction times out? If 100 transactions time out in a 10 min period? etc 5/14/2019 Distributed Systems - Comp 655

10 Distributed Systems - Comp 655
More to watch out for What constitutes being back up after an outage? When does an outage start? When does it end? Are there outages that don’t count? Natural disasters? Outages due to operator errors? What about MTBF? 5/14/2019 Distributed Systems - Comp 655

11 Ways to get 99% availability
MTBF = 99 hr, MTTR = 1 hr MTBF = 99 min, MTTR = 1 min MTBF = 99 sec, MTTR = 1 sec 5/14/2019 Distributed Systems - Comp 655

12 Distributed Systems - Comp 655
More definitions failure error fault causes may cause Types of faults: transient intermittent permanent Fault tolerance is continuing to work correctly in the presence of faults. 5/14/2019 Distributed Systems - Comp 655

13 Distributed Systems - Comp 655
Types of failures 5/14/2019 Distributed Systems - Comp 655

14 If you remember one thing
Components fail in distributed systems on a regular basis. Distributed systems have to be designed to deal with the failure of individual components so that the system as a whole Is available and/or Is reliable and/or Is safe and/or Is maintainable depending on the problem it is trying to solve and the resources available … 5/14/2019 Distributed Systems - Comp 655

15 Distributed Systems - Comp 655
Fault Tolerance Fault tolerance concepts Implementation – distributed agreement Distributed agreement meets transaction processing: 2- and 3-phase commit 5/14/2019 Distributed Systems - Comp 655

16 Distributed Systems - Comp 655
Two-army problem Red army has 5,000 troops Blue army and White army have 3,000 troops each Attack together and win Attack separately and lose in serial Communication is by messenger, who might be captured Blue and white generals have no way to know when a messenger is captured 5/14/2019 Distributed Systems - Comp 655

17 Activity: outsmart the generals
Take your best shot at designing a protocol that can solve the two-army problem Spend ten minutes Did you think of anything promising? 5/14/2019 Distributed Systems - Comp 655

18 Distributed Systems - Comp 655
Conclusion: go home “agreement between even two processes is not possible in the face of unreliable communication” - p 372 5/14/2019 Distributed Systems - Comp 655

19 Distributed Systems - Comp 655
Byzantine generals Assume perfect communication Assume n generals, m of whom should not be trusted The problem is to reach agreement on troop strength among the non-faulty generals 5/14/2019 Distributed Systems - Comp 655

20 Byzantine generals - example
n = 4, m = 1 (units are K-troops) Multicast troop-strength messages Construct troop-strength vectors Compare notes: majority rules in each component Result: 1, 2, and 4 agree on (1,2,unknown,4) 5/14/2019 Distributed Systems - Comp 655

21 Distributed Systems - Comp 655
Doesn’t work with n=3, m=1 5/14/2019 Distributed Systems - Comp 655

22 Distributed Systems - Comp 655
Fault Tolerance Fault tolerance concepts Implementation – distributed agreement Distributed agreement meets transaction processing: 2- and 3-phase commit 5/14/2019 Distributed Systems - Comp 655

23 Distributed commit protocols
What is the problem they are trying to solve? Ensure that a group of processes all do something, or none of them do Example: in a distributed transaction that involves updates to data on three different servers, ensure that all three commit or none of them do 5/14/2019 Distributed Systems - Comp 655

24 Distributed Systems - Comp 655
2-phase commit What to do when P, in READY state, contacts Q Coordinator Participant 5/14/2019 Distributed Systems - Comp 655

25 If coordinator crashes
Participants could wait until the coordinator recovers Or, they could try to figure out what to do among themselves Example, if P contacts Q, and Q is in the COMMIT state, P should COMMIT as well 5/14/2019 Distributed Systems - Comp 655

26 Distributed Systems - Comp 655
2-phase commit What to do when P, in READY state, contacts Q If all surviving participants are in READY state, Wait for coordinator to recover Elect a new coordinator (?) 5/14/2019 Distributed Systems - Comp 655

27 Distributed Systems - Comp 655
3-phase commit Problem addressed: Non-blocking distributed commit in the presence of failures Interesting theoretically, but rarely used in practice 5/14/2019 Distributed Systems - Comp 655

28 Distributed Systems - Comp 655
3-phase commit Coordinator Participant 5/14/2019 Distributed Systems - Comp 655

29 Distributed Systems - Comp 655
Bonus material Implementation – reliable point-to-point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing 5/14/2019 Distributed Systems - Comp 655

30 RPC, RMI crash & omission failures
Client can’t locate server Request lost Server crashes after receipt of request Response lost Client crashes after sending request 5/14/2019 Distributed Systems - Comp 655

31 Distributed Systems - Comp 655
Can’t locate server Raise an exception, or Send a signal, or Log an error and return an error code Note: hard to mask distribution in this case 5/14/2019 Distributed Systems - Comp 655

32 Distributed Systems - Comp 655
Request lost Timeout and retry Back off to “cannot locate server” if too many timeouts occur 5/14/2019 Distributed Systems - Comp 655

33 Server crashes after receipt of request
Possible semantic commitments Exactly once At least once At most once Normal Work done Work not done 5/14/2019 Distributed Systems - Comp 655

34 Behavioral possibilities
Server events Process (P) Send completion message (M) Crash (C) Server order P then M M then P Client strategies Retry every message Retry no messages Retry if unacknowledged Retry if acknowledged 5/14/2019 Distributed Systems - Comp 655

35 Distributed Systems - Comp 655
Combining the options 5/14/2019 Distributed Systems - Comp 655

36 Distributed Systems - Comp 655
Lost replies Make server operations idempotent whenever possible Structure requests so that server can distinguish retries from the original 5/14/2019 Distributed Systems - Comp 655

37 Distributed Systems - Comp 655
Client crashes The server-side activity is called an orphan computation Orphans can tie up resources, hold locks, etc Four strategies (at least) Extermination, based on client-side logs Client writes a log record before and after each call When client restarts after a crash, it checks the log and kills outstanding orphan computations Problems include: Lots of disk activity Grand-orphans 5/14/2019 Distributed Systems - Comp 655

38 Client crashes, continued
More approaches for handling orphans Re-incarnation, based on client-defined epochs When client restarts after a crash, it broadcasts a start-of-epoch message On receipt of a start-of-epoch message, each server kills any computation for that client “Gentle” re-incarnation Similar, but server tries to verify that a computation is really an orphan before killing it Of course, all of these have some trouble with network partitions 5/14/2019 Distributed Systems - Comp 655

39 Yet more client-crash strategies
One more strategy Expiration Each computation has a lease on life If not complete when the lease expires, a computation must obtain another lease from its owner Clients wait one lease period before restarting after a crash (so any orphans will be gone) Problem: what’s a reasonable lease period? 5/14/2019 Distributed Systems - Comp 655

40 Common problems with client-crash strategies
Crashes that involve network partition (communication between partitions will not work at all) Killed orphans may leave persistent traces behind, for example Locks Requests in message queues 5/14/2019 Distributed Systems - Comp 655

41 Distributed Systems - Comp 655
Bonus material Implementation – reliable point-to-point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing 5/14/2019 Distributed Systems - Comp 655

42 Distributed Systems - Comp 655
How to do it? Redundancy applied In the appropriate places In the appropriate ways Types of redundancy Data (e.g. error correcting codes, replicated data) Time (e.g. retry) Physical (e.g. replicated hardware, backup systems) 5/14/2019 Distributed Systems - Comp 655

43 Triple Modular Redundancy
5/14/2019 Distributed Systems - Comp 655

44 Distributed Systems - Comp 655
Tandem Computers TMR on CPUs Memory Duplicated Buses Disks Power supplies A big hit in operations systems for a while 5/14/2019 Distributed Systems - Comp 655

45 Replicated processing
Based on process groups A process group consists of one or more identical processes Key events Message sent to one member of a group Process joins group Process leaves group Process crashes Key requirements Messages must be received by all members All members must agree on group membership 5/14/2019 Distributed Systems - Comp 655

46 Distributed Systems - Comp 655
Flat or non-flat? 5/14/2019 Distributed Systems - Comp 655

47 Effective process groups require
Distributed agreement On group membership On coordinator elections On whether or not to commit a transaction Effective communication Reliable enough Scalable enough Often, multicast Typically looking for atomic multicast 5/14/2019 Distributed Systems - Comp 655

48 Process groups also require
Ability to tolerate crash failures and omission failures Need k+1 processes to deal with up to k silent failures Ability to tolerate performance, response, and arbitrary failures Need 3k+1 processes to reach agreement with up to k Byzantine failures Need 2k+1 processes to ensure that a majority of the system produces the correct results with up to k Byzantine failures 5/14/2019 Distributed Systems - Comp 655

49 Distributed Systems - Comp 655
Bonus material Implementation – reliable point-to-point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing 5/14/2019 Distributed Systems - Comp 655

50 Reliable multicasting
5/14/2019 Distributed Systems - Comp 655

51 Distributed Systems - Comp 655
Scalability problem Too many acknowledgements One from each receiver Can be a huge number in some systems Also known as “feedback implosion” 5/14/2019 Distributed Systems - Comp 655

52 Basic feedback suppression in scalable reliable multicast
If a receiver decides it has missed a message, it waits a random time, then multicasts a retransmission request while waiting, if it sees a sufficient request from another receiver, it does not send its own request server multicasts all retransmissions 5/14/2019 Distributed Systems - Comp 655

53 Hierarchical feedback suppression for scalable reliable multicast
messages flow from root toward leaves acks and retransmit requests flow toward root from coordinators each group can use any reliable small-group multicast scheme The authors point out that scalable reliable multicast is an area that still needs significant research 5/14/2019 Distributed Systems - Comp 655

54 Distributed Systems - Comp 655
Atomic multicast Often, in a distributed system, reliable multicast is a step toward atomic multicast Atomic multicast is atomicity applied to communications: Either all members of a process group receive a message, OR No members receive it Often requires some form of order agreement as well 5/14/2019 Distributed Systems - Comp 655

55 How atomic multicast helps
Assume we have atomic multicast, among a group of processes, each of which owns a replica of a database One replica goes down Database activity continues The process comes back up Atomic multicast allows us to figure out exactly which transactions have to be re-played (see pp ) 5/14/2019 Distributed Systems - Comp 655

56 Distributed Systems - Comp 655
More concepts Group view View change Virtually synchronous Each message is received by all non-faulty processes, or If sender crashes during multicast, message could be ignored by all processes 5/14/2019 Distributed Systems - Comp 655

57 Virtual synchrony picture
Basic idea: in virtual synchrony, a multicast cannot cross a view-change 5/14/2019 Distributed Systems - Comp 655

58 Distributed Systems - Comp 655
Receipt vs Delivery Remember totally-ordered multicast … 5/14/2019 Distributed Systems - Comp 655

59 What about multicast message order?
Two aspects: Relationship between sending order and delivery order Agreement on delivery order Send/delivery ordering relationships Unordered FIFO-ordered Causally-ordered If receivers agree on delivery order, it’s called totally-ordered multicast 5/14/2019 Distributed Systems - Comp 655

60 Distributed Systems - Comp 655
Unordered Process P1 Process P2 Process P3 sends m1 sends m2 delivers m1 delivers m2 delivers m2 delivers m1 5/14/2019 Distributed Systems - Comp 655

61 Distributed Systems - Comp 655
FIFO-ordered Process P1 Process P2 Process P3 Process P4 sends m1 sends m2 delivers m1 delivers m3 delivers m2 delivers m4 delivers m3 delivers m1 delivers m2 delivers m4 sends m3 sends m4 Agreement on: m1 before m2 m3 before m4 5/14/2019 Distributed Systems - Comp 655

62 Six types of virtually synchronous reliable multicast
Relationship between sending order and delivery order Agreement on delivery order 5/14/2019 Distributed Systems - Comp 655

63 Implementing virtual synchrony
Don’t deliver a message until it’s been received everywhere - but “everywhere” can change 7’s crash is detected by 4, which sends a view-change message Processes forward unstable messages, followed by flush When have flush from all processes in new view, install new view 5/14/2019 Distributed Systems - Comp 655

64 Distributed Systems - Comp 655
Bonus material Implementation – reliable point-to-point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing 5/14/2019 Distributed Systems - Comp 655

65 Distributed Systems - Comp 655
Recovery from error Two main types: Backward recovery to a checkpoint (assumed to be error-free) Forward recovery (infer a correct state from available data) 5/14/2019 Distributed Systems - Comp 655

66 More about checkpoints
They are expensive Usually combined with a message log Message logs are cleared at checkpoints Recovering a crashed process: Restart it Restore its state to the most recent checkpoint Replay the message log 5/14/2019 Distributed Systems - Comp 655

67 Recovery line == most recent distributed snapshot
5/14/2019 Distributed Systems - Comp 655

68 Distributed Systems - Comp 655
Domino effect 5/14/2019 Distributed Systems - Comp 655

69 Distributed Systems - Comp 655
Bonus material Implementation – reliable point-to-point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing 5/14/2019 Distributed Systems - Comp 655

70 Distributed Systems - Comp 655
Sparing Not really fault tolerance But it can be cheaper, and provide fast restoration time after a failure Types of spares Cold Hot Warm The spare may or may not also have regular responsibilities in the system 5/14/2019 Distributed Systems - Comp 655

71 Distributed Systems - Comp 655
Switchover Repair is accomplished by switching processing away from a failed server to a spare 5/14/2019 Distributed Systems - Comp 655

72 Questions on switchover
Has the failed system really failed? Is the spare operational? Can the spare handle the load? May need a way to block medium to low priority work during switchovers How will the spare get access to the failed server’s data? What client session data will be preserved, and how? 5/14/2019 Distributed Systems - Comp 655

73 More switchover questions
What about configuration files? What about network addressing? What about switching back after the failed server has been repaired? Partial shutdown of the spare Updating directories to redirect part of the load Making up for lost medium-to-low priority work 5/14/2019 Distributed Systems - Comp 655


Download ppt "Distributed Systems - Comp 655"

Similar presentations


Ads by Google