CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 20 Nov. 2
Scalable Reliable Multicast Became a hot topic with the introduction of Berkeley’s SRM Idea is to provide a primitive that –Has best-effort reliability (tries to repair problems… doesn’t promise) –Is receiver driven Sender doesn’t know who is in the group Receiver must solicit repairs –Has localized costs
SRM goal compared to virtual synchrony Virtual synchrony provides a strong guarantee –Either all processes that stay operational receive the message –Or none does But this seems to limit scalability Virtual synchrony can be fast but apparently is limited to groups of about 50 processes or fewer
Stock Exchange Problem: Vsync. multicast is too “fragile” Most members are healthy…. … but one is slow
Figure 1: Multicast throughput in an 8-member group perturbed by transient failures Ideal Actual
The problem gets worse as the system scales up Virtually synchronous Ensemble multicast protocols perturb rate average throughput on nonperturbed members group size: 32 group size: 64 group size: 96
SRM Unlike virtual synchrony the membership of the group will be loosely defined –Do it in the style of IP multicast –Receivers join at will Receiver has a way to request missing data
Application Level Framing Really, a separate idea But was bundled with SRM: –Application is assumed to be doing video multicast streaming –Perhaps it has its own way to recover lost data So instead of retransmitting precisely the lost data… … we solicit retransmission but let the application decide what to send in response
Comparison Virtual synchrony: –Strong membership –Each multicast has its own identity –Strong guarantees of ordering, synchronization, atomicity SRM: –Weak membership –Multicast has a sequence number but retransmission determined separately –Weak guarantees
Flow control Virtual synchrony does flow control –Sender has some amount of space –If it runs low, new multicasts are stopped SRM lacks any form of flow control Sender just runs at whatever rate it may select, oblivious to network In fact, SRM doesn’t even use acknowledgements from receiver to sender!
How SRM works Assumes that network has an IP multicast routing tree within it So, IP multicast is reasonably cheap SRM is built using broadcasts within this tree –One exception: so called “scalable session messages” –We’ll return to this later
Basic algorithm Application generates and numbers a packet Sends it using IP multicast Receiver receives it, puts it in order, delivers it … cost is one packet per link in the IP multicast tree, one send and one receive per process
Basic algorithm: application perspective S R R R R R R S: Sender R: Receiver : Network Router
Basic algorithm: network perspective S R R R R R R S: Sender R: Receiver : Network Router
Things to notice IP multicast lives “in the network” SRM protocol lives in the application There may be many more network nodes involves than application nodes Similar observation applies to any protocol unless implemented by the router itself
SRM error recovery Suppose a packet is lost… This will either happen –In the O/S near the receiver Affects just one process –Or in the network or the O/S of the sender Affects a subtree In SRM, subtree might be very large
Basic algorithm: network perspective S R R R R R R S: Sender R: Receiver : Network Router
Receiver notices Either it receives the next packet in the stream –In this case it notices a gap Or it receives a “session message” –Again, this causes it to see a gap –Sent periodically
A wave of discovery Receivers have some distance from the sender Can measure it in milliseconds Close receivers notice the problem first.
Basic algorithm: network perspective S R R R R R R S: Sender R: Receiver : Network Router
NAK Basic algorithm: network perspective S R R R R R R S: Sender R: Receiver : Network Router
NAK Basic algorithm: network perspective S R R R R R R S: Sender R: Receiver : Network Router
NAK spreads like a wave Other receivers see the NAK –It inhibits them from sending their retransmission request Some upstream sender sees it too –It generates a repair
NAK or Repair Flood But why won’t all receivers at a given distance NAK simultaneously? Or all senders retransmit? –In fact this would be a problem as protocol was described
SRM borrows an idea from the Ethernet Introduce a few milliseconds of randomness Instead of sending the NAK after timeout … pick a delay and send after a randomly selected delay in time interval [ … + ] With any luck at all, first to NAK inhibits the others Similarly for retransmissions
Why will this work? Ideally, packet loss is low –Hence zero or one losses in the entire tree –But many impacted receivers With this scheme –Ideally, one NAK inhibits others –One retransmission, inhibits others
Scalable Session Messages In a large tree may see losses in different subtrees Want to limit cost of –NAK –Repair Do this using the regional TTL setting for IP multicast Limits the propagation of the various messages, except for initial multicast of the message itself
Experience? SRM works well in networks with low loss rates But if packet loss becomes higher –SRM overheads begin to soar –May see many solicitations, repairs on each network link –Eventually, this overloads the network Example: with 100 receivers in a large network we easily provoked scenarios where each packet triggers 5 or 10 NAKs and a similar number of repairs
Why is this bad? All processes see these NAK and repair messages Hence costs are global
A quadratic cost! Risk of a “multi-loss” rises with network size, roughly linearly And cost of a NAK or repair is also linear Could argue that SRM has an O(n 2 ) mechanism buried within In practice, dies around 50 processes… similar to vsync!
Status? In the Internet today, IP multicast is mostly disabled If used at all, limited to individual companies –But even then is usually disabled Internet mbone set aside for multicast But has a rigid structure, used for a single Internet “video feed”
Althernatives to SRM RMTP was developed at Lucent –Reliable Multicast Transport Protocol Supports 1-many file transfer Uses notion of regional repair servers –Each region has a repair server that handles processes in its vicinity –Hierarchical repair architecture –Also has flow-control built in
RMTP status Experience suggests it works well –But placement of repair servers is very important –Also, not very helpful for arbitrary uses; most experience is with file transfer applications An IETF standard as of 1999
PGM A Cisco proposal Idea is to put a form of reliable multicast right into the routers –Routers talk to each other –Recovery of lost packets occurs locally –But mechanism is very simple and offers no guarantees
PGM Cisco promoting it as a standard But seems not to be taking off To be useful, needs to work on all routers in a network –Cisco has a tunneling scheme to leapfrog small numbers of non- compliant routers
LGM Lightweight group multicast: another network-level proposal Getting a lot of attention now But no implementations in the field And there are three or four other proposals
Forward Error Correction Can be used with any protocol Idea is: –Instead of just sending data: m 0.. m 1 … m 2 … –Send an error correction code from time to time: m 0.. m 1 … m 2 … ECC012… m 3 … Receiver can use code to correct for packet loss
FEC continued Example? –Suppose that ECC simply is XOR of prior k packets –Than with any k-1 of them and ECC we can reconstruct the missing packet –Hence would need to lose 2 to have an unrecoverable gap Can use more powerful codes but they tend to be slower to compute
FEC continued With genuinely random packet loss, FEC works really well No need to send a NAK or a repair Good match to Internet router behavior –RED drops `em –FEC fixes em! One idea: FEC plus congestion control equals SRM!
Issues? Dependency on a single route or single IP multicast forwarding tree –Routers rarely fail… –But if router becomes heavily loaded, may drop many packets –Similarly if routes change – c.f. Jahanian study
Wishful thinking? Recall our idea of a network routing scheme that would offer failure- independent redundant routes –(A,B) and (A’,B’) “independent” If we had such a scheme for IP multicast… … we could retransmit over it and would have a good chance of recovery even if network becomes very overloaded!
IBM Gryphon Routers Gryphon: high speed publish- subscribe project at IBM –Idea was to build hardware –Started with software on UNIX boxes… highly optimized Special purpose “network” just for publish-subscribe applications –Targets stock markets, financial systems –Likely to have a substantial market –But expensive: dedicated infrastructure
What will future bring? Topic of much debate Two broad schools of thought –Mine (pessimistic) –Theirs (optimistic)
Mine? Perhaps I’m too negative But I suspect: –The Internet is a mess –QoS is not coming any time soon –Multicast isn’t about to start working –Situation broken and we need a whole new “web” design But perhaps can reuse many pieces of existing architecture and technology Much early evidence suggests this can work
Theirs The prevailing academic research perspective They see –Diffsrv, multilevel queuing… –Various multicast mechanisms And they expect –Internet telephony –Publish-subscribe hardware –Streaming media over IP Vast riches for all!