Download presentation
Presentation is loading. Please wait.
Published byBryce Reynolds Modified over 9 years ago
1
Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University
2
The Contemporary Datacenter Building-wide super-clusters: 1000s of commodity blade-servers Typically used as commercial website back-ends: Amazon, etc. Software Paradigms: SOA, Eventing, Publish/Subscribe… … many-to-many communication, Multicast!
3
Multicast in the Datacenter IP Multicast available: adding reliability to it is a well-researched technology… Scalability dimensions Number of receivers Number of senders? Number of groups? Metrics Throughput Timeliness?
4
Time-Critical Applications … dealing in perishable data: stock quotes, location updates … willing to trade complete reliability for timeliness … requiring tunable reliability/ timeliness/ overhead tradeoffs Probabilistic Guarantee of Timeliness? For x% overhead, y% of lost packets are recovered in time t. Remainder can be optionally recovered in time t’.
5
Design Space Reactive vs. Proactive Reactive: Loss Discovery ACK Sender-Based Sequencing If the multicast rate in a group is constant, the inter-multicast time at any sender goes up linearly with the number of senders Gossip – Scalable Proactive: FEC – Tunable
6
Slingshot Overview Receiver-Based FEC: Senders send initially via unreliable IP Multicast Phase 1: Receivers repair losses by proactively sending each other FEC repair packets Phase 2: Remaining losses are recovered from the sender Each receiver sends an error correction (XOR) packet to c randomly selected receivers with the last r packets it received Rate-of-fire parameter (r, c): Allows tuning of overhead-timeliness tradeoff
7
Protocol Details 0 Two Packet Types: Packet ID (Sender, SeqNo) Application Payload XOR of Data Packets List of Data Packet IDs: (sender1,seqno1), (sender2,seqno2)…. Data Packet : Repair Packet : Application MTU: 1024 Less than Network MTU Terminology: Data packets are included in repair packet
8
Protocol Details 1 Data Structures: Data Buffer: received data packets Repair Bin: pointers to last <r data packets Arrival of Data Packet dp at Receiver: dp is added to the data buffer &dp is added to the repair bin If repair bin size equals r, a repair packet rp is created from its contents, and the repair bin is cleared rp is dispatched to c random receivers
9
Protocol Details 2 Arrival of Repair Packet rp at Receiver: If #(missing included data packets) == 0: rp is discarded 1: it is recovered by XORing rp with the other r-1 data packets >1: rp is stored in a special buffer, in case future data packet arrivals and recoveries make it usable
10
Evaluation Setup 64 node rack-style cluster at Cornell Loss rate fixed at 1%: packets dropped at end buffers All nodes send and receive Inter-node latencies = 50-100 microseconds Group Data Rate: 1000 packets per second Each node multicasts 64 packets per second; i.e one packet every 64 milliseconds
11
Slingshot Tunability For 27% overhead, 93.5% Lost Packets are recovered at an avg. of 3.5 milliseconds Example Tradeoff Points between Overhead, Timeliness, and Reliability Overhead and Recovered Packets plotted on left y-axis, Recovery Time on right
12
Slingshot vs SRM Slingshot recovers 93% in 10 ms, 97% in 25 ms Fastest SRM packet Recovery is 2.2 seconds 93% in 4.85 seconds, 97% in 5.1 seconds 2-3 Orders of Magnitude faster
13
Slingshot Scalability: Group Size Gossip-Style Scalability: Insensitive to scale beyond a certain size Simulation Results:
14
Conclusion Slingshot provides a tunable, probabilistic guarantee of timeliness Outperforms SRM by 2 orders of magnitude in a 64 node system Insensitive to number of senders Future Work: Achieve scalability in other dimensions (number of groups) Build a time-critical middleware layer that uses Slingshot as a generic primitive
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.