Part III: Overlays, peer-to-peer Jinyang Li In addition to my own contributions, many of the slides are borrowed liberally from networking class notes from Robert Morris, Hari Balakrishnan, David Andersen and Nick Feamster
Overlays are everywhere Internet is an overlay on top of telephone networks Overlays: a network on top of Internet Endpoints (instead of routers) are nodes Multi-hop paths among routers are links Instant deployment!
What can overlays do? Routing New applications Improve routing robustness (e.g. convergence speed) Multicast Anonymous communication New applications Peer-to-peer file sharing and lookup Content distribution networks Peer-to-peer live streaming Your imagination is the limit
Why overlays? Internet is ossified IPv6 proposed in 1992, still not widely deployed Multicast (1988), QoS (early 90s) etc. Avoid burdening routers with new features End hosts are cheap and capable Copy and store files Perform expensive cryptographic operations Perform expensive coding/decoding operations …
Today’s class Overlays that take over routers’ jobs Resilient Overlay Networks (RON) Application-level multicast (NICE)
RON’s motivation Internet routing is not reliable Paxson 95-97 3.3% of all routes had serious problems Labovitz 97-00 10% of routes available < 95% of the time 65% of routes available < 99.9% of the time 3-min minimum detection+recovery time; often 15 mins 40% of outages took 30+ mins to repair Chandra 01 5% of faults last more than 2.75 hours Paxson’s study measures 40,000 end-to-end routes
Internet routing is unsatisfactory Slow in detecting outage and recovery Unable to use multiple redundant paths Unable to detect badly performing paths Applications have no control of paths BGP must be scalable Topology information is highly summarized (due to policy requirements and scalability requirements) Routing updates must be damped to prevent oscillation Do not respond to traffic conditions (to prevent oscillation) Multihome only recovers slowly Q: Why can’t we fix BGP? Q2: Hasn’t multi-homing already solved the fault tolerance problem?
BGP converges slowly Given a failure, can take up to 15 minutes to see BGP. Sometimes, not at all. [Feamster]
RON in a nutshell What failures? A small set of (<100) nodes) Scalable BGP-based IP routing substrate What failures? Outages: configuration/software error, broken links Performance failures: severe congestion, Dos attacks
RON’s goals Fast failure detection and recovery Detect & fail-over within seconds Applications influence path selection Applications define failures Applications define path metrics Expressive and fine-grained policies Who and what applications are allowed to use what paths
Why would RON work? RON routes around many link “failures” RON testbed study (2003): About 60% of failures within two hops of edge RON testbed study (2003): About 60% of failures within two hops of the edge RON routes around many link “failures” If exists a node whose paths to S, D doe not contain failed link RON cannot route around access link failure
RON Design Nodes in Different ASes RON library Forwarder Conduit Performance Database Prober Router Link-state routing protocol, disseminates info using RON! Application-specific routing tables Policy routing module
RON reduces loss rate 30-min avg loss rate on Internet 30-min avg loss rate with RON RON loss rate is never more than 30%
RON routes around failures 30-minute average loss rates Loss Rate RON Better No Change RON Worse 10% 479 57 47 20% 127 4 15 30% 32 50% 20 80% 14 100% 10 Show as hours, not samples? 6,825 “path hours” represented here 5 “path hours” of 100% loss (complete outage) 38 “path hours” of TCP outage (>= 30% loss) RON routed around all of these! One indirection hop provides almost all the benefit!
Resilience Against DoS Attacks
Throughput Improvement 5%
Lessons of RON End hosts know better about performance and outages than routers Internet routing trades off scalability for performance and fast failover A small amount of redundancy goes a long way
RON’s tradeoff BGP Scalability Performance (fast convergence etc.) Flexibility (application specific metric & policy) BGP ??? Routing overlays (e.g., RON)
Open Questions Efficiency Scaling generates redundant traffic on access links Scaling Probing traffic is O(N^2) Can a RON be made to scale to > 50 nodes? Is a 1000 node RON much better than 50-node? Interaction of overlays and IP network Interaction of multiple overlays
Application level multicast A.k.a. overlay multicast End host multicast
Why multicast? Send the same stream of data to many hosts Internet radio/TV/conference Stock quote dissemination Multiplayer network games An efficient way to send data to many hosts Multicast is at packet granularity
Naïve approach is wasteful Sender’s outgoing link carries n copies of data 128Kbps mp3 stream, 10,000 listeners = 1.28Gbps
IP multicast service model Mimic LAN broadcast Anyone can send, everyone hears Use multicast address 224.0.0.0 -- 239.255.255.255 (2^28 addresses) Each address is called a “group” End hosts register with routers to receive packets
Basic multicast techniques Construct trees Why trees? (why not meshes?) How many trees? Shared vs. source specific trees Criteria of a “good” tree? Who build trees? Routers vs. end hosts
IP multicast Routers construct multicast trees for packet replication and forwarding Efficient (low latency, no dup pkts on links)
IP multicast: Augmenting DV How to broadcast using DV routing tables without loops? Idea: shortest paths from S to all nodes form a tree RPF protocol: A router duplicates and forwards all packets if they arrive via the shortest path to S
Reverse path flooding (RPF) a: a, 0 b: b, 1 c: c, 10 d: c, 11 c: c, 1 d: d, 0 a: a, 1 b: b, 0 d: c, 2 a: a, 10 c: c, 0 d: d, 1 a 1 d b 10 1 1 c C does not forward packets from A and vice versa However, link a <--> c sees two packets
Reverse path broadcast (RPB) RPF causes every ‘upstream’ routers on a LAN (link) to send a copy RPB: only one router sends a copy Routers listen to each others’ DV advertisements Only the one with lowest hopcount sends
IP multicast: augmenting DV Requires symmetric paths Needs to prune unnecessary broadcast packets to achieve multicast [Deering et. Al. SIGCOMM 1988, TOCS 1990]
IP multicast: augmenting LS Basic LS: each router floods with changes in link state LS w/ multicast: routers monitor local multicast group membership and changes result in flooding Routers use Dijkstra to compute SP trees How expensive to compute trees for N nodes, E edges, G groups?
IP multicast has not taken off Requires support from routers Do ISPs have incentives to support multicast? Not scalable Routers keep state for every active group! Multicast group addresses cannot be aggregated Group membership changes much more frequently than links going up and down Difficult to provide congestion/flow control, reliability and security
Overlay multicast No change to IP infrastructure needed Multicast code run on end hosts End hosts can copy&store data No change to IP infrastructure needed Easy to implement complex functionalities: flow control, security, layered multicast etc. Less efficient: higher delay, duplicate pkts per link
Overlay multicast challenge How can hosts form an efficient tree? Hosts do know all that routers know What’s wrong with a random tree? Stretch: packets travel farther than have to Stress: packets traverse links multiple times A particular concern with access links and cross country links
Bad tree vs good tree
Cluster-based trees (NICE) Reside in 1 cluster Reside in 2 clusters Reside in 3 clusters A hierarchy of clusters Cluster consists of [k,3k-1] members Log N depth
Cluster-based trees (NICE) Each node knows all members of its cluster(s)
Cluster-based trees Cluster nodes according to latency Not perfect packets do not travel too far out of the way Not perfect Packets are sent to cluster heads (who are in the middle) so might overshoot
NICE in action How to join a hierarchy? How to split/merge clusters? Which is the right cluster? How long does join take? How to split/merge clusters? What if a cluster head fails?
When do clustering not work well? Cogent MCI MIT Harvard Boston U MIT & Harvard peers with each other Key assumption: low latency is transitive As a node descends tree to join, assumes children of close-by cluster head are also close-by
What did you learn today?
Lessons Where should a functionality reside? Routers vs. end hosts Scalability vs. Performance Flexibility Instant deployment! Routers Efficiency
Project draft report You should be able to reuse your draft for the final report You should have complete related work by now You should have a complete plan Most of the system design Most of the experiment designs If you have preliminary graphs, use them, try to explain them
The sandwich method for explanation An easy example illustrating the basic idea Detailed explanations of challenges and how your system addresses them Does it work in general environments? Projector problem: contact andrew case 83383 WWH1022