Abstractions for Fault-Tolerant Distributed Computing Idit Keidar MIT LCS.

Slides:



Advertisements
Similar presentations
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Advertisements

Teaser - Introduction to Distributed Computing
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Byzantine Generals Problem: Solution using signed messages.
Lab 2 Group Communication Andreas Larsson
Distributed Processing, Client/Server, and Clusters
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Algorithm for Virtually Synchronous Group Communication Idit Keidar, Roger Khazan MIT Lab for Computer Science Theory of Distributed Systems Group.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
1 Availability Study of Dynamic Voting Algorithms Kyle Ingols and Idit Keidar MIT Lab for Computer Science.
Transis Efficient Message Ordering in Dynamic Networks PODC 1996 talk slides Idit Keidar and Danny Dolev The Hebrew University Transis Project.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Reliable Communication for Highly Mobile Agents ECE 7995: Term Paper.
Software Engineering and Middleware: a Roadmap by Wolfgang Emmerich Ebru Dincel Sahitya Gupta.
Group Communication Phuong Hoai Ha & Yi Zhang Introduction to Lab. assignments March 24 th, 2004.
1 Dynamic Atomic Storage Without Consensus Alex Shraer (Technion) Joint work with: Marcos K. Aguilera (MSR), Idit Keidar (Technion), Dahlia Malkhi (MSR.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 3: Fault-Tolerant.
Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.
 Idit Keidar, Technion Intel Academic Seminars, February Octopus A Fault-Tolerant and Efficient Ad-hoc Routing Protocol Idit Keidar, Technion Joint.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
1 Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group Paradigms for Building Distributed Systems: Performance Measurements and.
1 A Framework for Highly Available Services Based on Group Communication Alan Fekete Idit Keidar University of Sidney MIT.
Optimistic Virtual Synchrony Jeremy Sussman - IBM T.J.Watson Idit Keidar – MIT LCS Keith Marzullo – UCSD CS Dept.
Transis 1 Fault Tolerant Video-On-Demand Services Tal Anker, Danny Dolev, Idit Keidar, The Transis Project.
Evaluating the Running Time of a Communication Round over the Internet Omar Bakr Idit Keidar MIT MIT/Technion PODC 2002.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
CSC 600 Internetworking with TCP/IP Unit 8: IP Multicasting (Ch. 17) Dr. Cheer-Sun Yang Spring 2001.
ARMADA Middleware and Communication Services T. ABDELZAHER, M. BJORKLUND, S. DAWSON, W.-C. FENG, F. JAHANIAN, S. JOHNSON, P. MARRON, A. MEHRA, T. MITTON,
SPREAD TOOLKIT High performance messaging middleware Presented by Sayantam Dey Vipin Mehta.
Architectures of distributed systems Fundamental Models
Lab 2 Group Communication Farnaz Moradi Based on slides by Andreas Larsson 2012.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Consistent and Efficient Database Replication based on Group Communication Bettina Kemme School of Computer Science McGill University, Montreal.
7/26/ Design and Implementation of a Simple Totally-Ordered Reliable Multicast Protocol in Java.
Farnaz Moradi Based on slides by Andreas Larsson 2013.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
Toward Fault-tolerant P2P Systems: Constructing a Stable Virtual Peer from Multiple Unstable Peers Kota Abe, Tatsuya Ueda (Presenter), Masanori Shikano,
The InetAddress Class A class for storing and managing internet addresses (both as IP numbers and as names). The are no constructors but “class factory”
2007/1/15http:// Lightweight Probabilistic Broadcast M2 Tatsuya Shirai M1 Dai Saito.
November NC state university Group Communication Specifications Gregory V Chockler, Idit Keidar, Roman Vitenberg Presented by – Jyothish S Varma.
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.
The CoBFIT Toolkit PODC-2007, Portland, Oregon, USA August 14, 2007 HariGovind Ramasamy IBM Zurich Research Laboratory Mouna Seri and William H. Sanders.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.
Fault Tolerance (2). Topics r Reliable Group Communication.
NTT - MIT Research Collaboration — Bi-Annual Report, July 1—December 31, 1999 MIT : Cooperative Computing in Dynamic Environments Nancy Lynch, Idit.
Distributed Systems Lecture 7 Multicast 1. Previous lecture Global states – Cuts – Collecting state – Algorithms 2.
Replication & Fault Tolerance CONARD JAMES B. FARAON
Algorithm for Virtually Synchronous Group Communication
Network Load Balancing
Replication Middleware for Cloud Based Storage Service
Introduction There are many situations in which we might use replicated data Let’s look at another, different one And design a system to work well in that.
Architectures of distributed systems Fundamental Models
Active replication for fault tolerance
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Indirect Communication Paradigms (or Messaging Methods)
Indirect Communication Paradigms (or Messaging Methods)
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Performance Evaluation of a Communication Round over the Internet
Evaluating the Running Time of a Communication Round over the Internet
Presentation transcript:

Abstractions for Fault-Tolerant Distributed Computing Idit Keidar MIT LCS

? The Big Question Q: How can we make it easier to build good* distributed systems? *good = efficient; fault-tolerant; correct; flexible; extensible; … A: We need good abstractions, implemented as generic services (building blocks)

In This Talk Abstraction: Group Communication Application: VoD Algorithm: Moshe Implementation, Performance Other work, new directions

Abstraction: Group Communication (GC) GC Send ( Grp, Msg ) Receive ( Msg ) Join / Leave ( Grp ) View ( Members, Id)

Example: Highly Available Video-On-Demand (VoD) [ Anker, Dolev, Keidar ICDCS 99] True VoD: clients make online requests Dynamic set of loosely-coupled servers –Fault-tolerance, dynamic load balancing –Clients talk to “abstract” service VoD Service

Movie Group Chocolat Movie Group Gladiator Movie Group Spy Kids Abstraction: Group Addressing (Dynamic) start update Movies? Service Group control Session Group

Abstraction: Virtual Synchrony Connected group members receive same sequence of events - messages, views Abstraction: state-machine replication –VoD servers in movie group share info about clients using messages and views Make load-balancing decisions based on local copy –Upon start message –When view reports of server failure Joining servers get state transfer

VoD server implemented in ~2500 C++ lines including all fault tolerance logic using GC library, commodity hardware

General Lessons Learned GC saves a lot of work –Especially for replication with dynamic groups and “local” consistency –E.g., VoD servers, shared white-board,... Good performance but… only on LAN –Next generation will be on WANs (geoplexes)

WAN: the Challenge Message latency large and unpredictable Frequent message loss è Time-out failure detection inaccurate è Number of inter-LAN messages matters è Algorithms may change views frequently, view changes require communication e.g., state transfer, costly in WAN

GC Multicast & Membership New Architecture “Divide and Conquer” [ Anker, Chockler, Dolev, Keidar DIMACS 98 ] Virtual Synchrony Membership Moshe Notification Service (NS) Multicast [ Keidar, Khazan ] [ Keidar et al. ]

New Architecture Benefits Less inter-LAN messages Less remote time-outs Membership out of way of regular multicast Two semantics: –Notification Service - “who is around” –Group membership view = for virtual synchrony

Moshe: A Group Membership Algorithm for WAN [Keidar, Sussman, Marzullo, Dolev ICDCS 00 ] Designed for WAN from the ground up Avoids delivery of “obsolete” views –Views that are known to be changing –Not always terminating –Avoid excessive load at unstable periods Runs in 1 round, optimistically –All previous ran in 2

New Membership Spec Conditional Liveness: If situation eventually stops changing and NS_Set is eventually accurate, then all NS_Set members have same last view Composable –Can prove application liveness Termination not required – no obsolete views Temporary disagreement allowed – optimism

Feedback Cycle: Breaking Conceptions Application Abstraction specs Algorithms Implementation No obsolete views Optimism Allowed

The Model Asynchronous – no bound on latency Local NS module at every server –Failure detection, join/leave propagation –Output: NS_Set Reliable communication –Message received or failure detected –E.g., TCP

Algorithm – Take 1 Upon NS_Set, send prop to other servers with NS_Set, current view id Store incoming props When received props for NS_Set from all servers, deliver new view: –Members – NS_Set, –Id higher than all previous

Optimistic Case Once all servers get same last NS_Set: –All send props for this NS_Set –All props reach all servers –All servers use props to deliver same last view

To avoid deadlock: A must respond But how? Out-of-Sync Case: Unexpected Proposal X prop +c X -c prop ABC

Algorithm – Take 2 Upon unexpected prop for NS_Set, join in: –Send prop to other servers with NS_Set, current view id

view Does this Work? +C+AB+C ABC -C +C view Live-lock!

Q: Can all Deadlocks be Detected by Extra Proposals?...Turns out, no Abstraction specs Algorithms Verification Add deadlock detection, no extra messages

All C props C props Algorithm – Take 3 Quiescent Optimistic Algorithm NS_Set Deadlock detection Conservative Algorithm Unexpected prop All Opt props Opt props Props have increasing numbers NS_Set

The Conservative Algorithm Upon deadlock detection –Send C prop for latest NS_Set with number = max(last_received, last_sent + 1) –Update last_sent Upon receipt of C prop for NS_Set with number higher than last_sent –Send C prop with this number; update last_sent Upon receipt of C props for NS_Set with same number from all, deliver new view

Rational for Termination All deadlock cases detected (see paper) Conservative algorithm invoked upon detection Once all servers in conservative algorithm (without exiting) number does not increase –Exit only upon NS_Set Servers match highest number received Eventually, all send props with max number

How Typical is the “typical” Case? Depends on the notification service (NS) –Classify NS good behaviors: symmetric and transitive perception of failures Typical case should be very common Need to measure

Implementation Use CONGRESS [ Anker et al. ] –Overlay Network and NS for WAN –Always symmetric, can be non-transitive –Logical topology can be configured

The Experiment Run over the Internet –US: MIT, Cornell (CU), UCSD –Taiwan: NTU –Israel: HUJI 10 clients at each location (50 total) –Continuously join/leave 10 groups Run 10 days in one configuration, 2.5 days in another

Two Experiment Configurations

Percentage of “Typical” Cases Configuration 1: –10,786 views, 10,661 one round % Configuration 2: –2,559 views, 2,555 one round % Overwhelming majority for one round! Depends on topology, good for sparse overlays

Performance milliseconds number of cases Histogram of Moshe duration MIT, configuration 1, runs up to 4 seconds (97%)

Performance: Configuration II Histogram of Moshe duration MIT, configuration 2, runs up to 3 seconds (99.7%) milliseconds number of cases

Performance over the Internet: What’s Going on? Without message loss, running time close to biggest round-trip-time, ~650 ms. –As expected Message loss has a big impact Configuration 2 has much less loss – More cases of good performance

Observation: Triangle Inequality does not Hold over the Internet Concurrently observed by Detour, RON projects

Conclusion: Moshe Features Scalable divide and conquer architecture –Less WAN communication Avoids obsolete views –Less load at unstable times Usually one round (optimism) Uses NS for WAN –Good abstraction –Flexibility to configure multiple ways

Optimistic Sussman, Keidar, Marzullo 00 Survey Chockler, Keidar, Vitenberg 01 VS Algorithm, Formal Study Keidar, Khazan 00 Moshe Keidar, Sussman, Marzullo, Dolev 00 Replication Keidar, Dolev 96 CSCW Anker, Chockler, Dolev, Keidar 97 VoD Anker, Dolev, Keidar 99 The Bigger Picture Applications Abstraction specs Algorithms Implementation

Other Abstractions Atomic Commit [Keidar, Dolev 98] Atomic Broadcast [Keidar, Dolev 96] Consensus [Keidar, Rajsbaum 01] Dynamic Voting [Yeger-Lotem, Keidar, Dolev 97], [Ingols, Keidar 01] Failure Detectors [Dolev, Friedman, Keidar, Malkhi 97], [Chockler, Keidar, Vitenberg 01]

New Directions Performance study: practice  theory  practice –Measure different parameters on WAN, etc. –Find good models & metrics for performance study –Find best solutions –Adapt to varying situations Other abstractions –User-centric: improve ease-of-use –Framework for policy adaptation, e.g., for collaborative computing –Real-time, mobile, etc.

Bon Apetit

Group Communication in the Real World Isis used in NY Stock Market, Swiss stock exchange, French air traffic control, Navy radar system... Enabling technology for –Fault-tolerant cluster computing, e.g., IBM, Windows 2000 Cluster –SANs, e.g., IBM, North Folk Networks –Navy battleship DD-21 OMG standard for fault-tolerant CORBA Emerging: SUN Jini group-RMI Freeware, e.g., mod_log_spread for Apache Research projects at Nortel, BBN, military,... *LAN only, WAN should come next