Download presentation
Presentation is loading. Please wait.
Published byAubrie Houtchens Modified over 10 years ago
1
Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University
2
Mission-Critical Datacenters COTS Datacenters Online e-tailers, search engines, corporate applications Web-services Mission-Critical Apps Need: Scalability, Availability, Fault-Tolerance … Timeliness!
3
The Time-Critical Datacenter Migrating time-critical applications to commodity datacenters… … conversely, providing datacenter web- services with time-critical performance.
4
Whats a Time-Critical System? Not real time, but real fast! Financial calculators, military command and control… air traffic control (ATC) … foobooks.com! Technology Gap: Real-Time focuses on determinism, scale-up architectures
5
The French ATC System Mid to Late 90s Teams of 3-5 air traffic controllers on a cluster of desktop consoles 50-200 of these console clusters in an air traffic control center Why study the French ATC?
6
ATC Subsystems Radar Image Weather Alert Track Updates Updates to Flight Plans Console to Console State Updates System Management and Monitoring ATC center to center Updates Multicast ubiquitous…
7
Two Kinds of Multicast Virtually Synchronous Multicast: very reliable, not particularly fast Unreliable Multicast: very fast, not particularly reliable Nothing in between!
8
Two Kinds of Subsystems Category 1: Complete reliability (virtual synchrony) e.g: Routing decisions Category 2: Careful application design + natural hardware properties + management policies. e.g: Radar
9
Multicast in the French ATC Engineering Lessons: Structure application to tolerate partial failures Exploit natural hardware properties Can we generalize to modern systems? Research Direction: Time-Critical Reliability Can we design communication primitives that encapsulate these lessons?
10
Anatomy of a Cloned Service
11
Services An Amazon web-page is constructed by 100s of co-operating services* Multicast is used for: Updating Cloned Services Publish-Subscribe / Eventing Datacenter Management/Monitoring * Werner Vogels, CTO of amazon.com, at SOSP 2005
12
Multicast in the Datacenter A node is in many multicast groups: One for each service it hosts One for each topic it subscribes to One or more administration groups Large Numbers of Overlapping Groups!
13
Service Semantics Data Store Services: stale data can result in overselling / underselling loss of real- world dollars Cache Services: updated periodically by back-end data-stores
14
The Challenge Datacenter Blades are failure-prone: Crash failures Byzantine behavior Bursty Packet Loss : End-hosts kernels drop packets when subjected to traffic spikes.
15
A New Reliability Model Rapid delivery is more important than perfect reliability Probabilistic Timeliness Graceful Degradation
16
Wanted: a multicast primitive that 1. Scales to large numbers of arbitrarily overlapping multicast groups 2. Delivers multicasts quickly 3. Tolerates datacenter failure modes – bursty packet loss, node failures 4. Offers probabilistic properties 5. Gives up on lost data after a threshold period
17
Ricochet: Lateral Error Correction Receivers exchange error correction XORs of multicast traffic Works very well with multiple groups – scales upto a thousand groups per node Probabilistic Timeliness: probability distribution of delivery latencies
18
Predictive Total Ordering (Plato) Delivers messages to applications with no ordering delay in most cases Orders messages only if there is a high probability of out-of-order delivery across different nodes Probabilistic Timeliness: probability distribution of ordered delivery latency
19
Performance SRM takes seconds to recover lost packets Ricochet recovers almost all packets within ~70 milliseconds
20
Conclusion Move from R/T to T/C yields huge benefits! Ricochet is faster… slashes latency… scalable… Clean delivery delay curve a powerful design tool, replaced traditional hard (but conservative) limits Were open for business: Software and detailed paper available for download Give it a try… tell us what you think! www.cs.cornell.edu/projects/quicksilver/ricochet.html
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.