Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group
Modern Distributed Applications (in WANs) Highly available servers –Web –Video-on-Demand Collaborative computing –Shared white-board, shared editor, etc. –Military command and control –On-line strategy games Stock market
Important Issues in Building Distributed Applications Consistency of view –Same picture of game, same shared file Fault tolerance, high availability Performance –Conflicts with consistency? Scalability –Topology - WAN, long unpredictable delays –Number of participants
Generic Primitives - Middleware, “Building Blocks” E.g., total order, group communication Abstract away difficulties, e.g., –Total order - a basis for replication –Mask failures Important issues: –Well specified semantics - complete –Performance
Research Approach Rigorous modeling, specification, proofs, performance analysis Implementation and performance tuning Services Applications Specific examples General observations
G Send(G) Group Communication Group abstraction - a group of processes is one logical entity Dynamic Groups (join, leave, crash) Systems: Ensemble, Horus, ISIS, Newtop, Psync, Sphynx, Relacs, RMP, Totem, Transis
Example: Highly Available VoD [ Anker, Dolev, Keidar ICDCS1999] Dynamic set of servers Clients talk to “abstract” service Server can crash, client shouldn’t know
VoD Service: Exploiting Group Communication Group abstraction for connection establishment and transparent migration (with simple clients) Membership services detect conditions for migration - fault tolerance and load balancing Reliable group multicast among servers for consistently sharing information Reliable messages for control Server: ~2500 C++ lines –All fault tolerance logic at server
A Scalable Architecture for Group Membership in WANs [ Anker, Chockler, Dolev, Keidar] Dedicated distributed membership servers “divide and conquer” –Servers involved only in membership changes –Members communicate with each other directly (implement “virtual synchrony”) Two levels of membership –Notification Service NSView - “who is around” –Agreed membership views
Architecture NSView: "Who is around" failure/join/leave Agreed View: Members set and identifier Notification Service (NS) Membership {A,B,C,D,E},7 Notification Service (NS) Membership {A,B,C,D,E},7
Moshe: A Group Membership Algorithm for WANs Idit Keidar, Jeremy Sussman Keith Marzullo, Danny Dolev ICDCS 2000
Membership in WAN: the Challenge Message latency is large and unpredictable Frequent message loss è Time-out failure detection is inaccurate è We use a notification service (NS) for WANs è Number of communication rounds matters è Algorithms may change views frequently è View changes require communication for state transfer, which is costly in WAN
Moshe Designed for WANs from the ground up –Previous systems emerged from LAN Avoids delivery of “obsolete” views –Views that are known to be changing –Not always terminating (but NS is) Runs in a single round (“typically”)
Experimenting with Moshe Run over the Internet –In the US: MIT, Cornell (CU), UCSD –In Taiwan: NTU –In Israel: HUJI Run for 10 days (~10,000 times) in one configuration, 2.5 days in another 10 clients at each location –continuously join/leave 10 groups
Moshe Features Avoiding obsolete views A single round –98% of the time in one configuration –99.8% of the time in another Using a notification service for WANs –Good abstraction –Flexibility to configure multiple ways –Future work: configure more ways Scalable “divide and conquer” architecture
Retrospective: Role of Theory Specification –Possible to implement –Useful for applications (composable) Specification can be met in one round “typically” (unlike Consensus) Correctness proof exposes subtleties –Need to avoid live-lock –Two types of detection mechanisms needed
Future Work: The QoS Challenge Some distributed applications require QoS –Guaranteed available bandwidth –Bounded delay, bounded jitter Membership algorithm terminates in one round under certain circumstances –Can we leverage on that to guarantee QoS under certain assumptions? Can other primitives guarantee QoS?