RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris MIT Laboratory for Computer Science
Fault-tolerant Networking Network Any-to-any communication, routing around failures A B C D
The Internet Transit Mom-and-pop ISP Big ISP Really-big ISP everyone’s afraid of Peering Autonomous System (AS) BGP4 Scalability via aggressive aggregation and information hiding Commercial reality via peering & transit relationships Scalability via aggressive aggregation and information hiding Commercial reality via peering & transit relationships AS
How Robust is Internet Routing? 1.Slow outage detection and recovery 2.Inability to detect badly performing paths 3.Inability to efficiently leverage redundant paths 4.Inability to perform application-specific routing 5.Inability to express sophisticated routing policy Paxson % of all routes had serious problems Labovitz % of routes available < 95% of the time 65% of routes available < 99.9% of the time 3-min minimum detection+recovery time; often 15 mins 40% of outages took 30+ mins to repair Chandra 01 5% of faults last more than 2.75 hours
Our Goal To improve communication availability for small groups by at least a factor or 10 Many applications –Collaboration and conferencing –Virtual Private Networks (VPNs) across public Internet –Overlay Internet Service
RON: Routing Using Overlays Cooperating end-systems in different routing domains can conspire to do better than scalable wide-area protocols Types of failures –Outages: Configuration/operational errors, backhoes, etc. –Performance failures: Severe congestion, denial-of-service attacks, etc. Scalable BGP-based IP routing substrate Reliability via path monitoring and re-routing Reliability via path monitoring and re-routing Reliability via path monitoring and re-routing Reliability via path monitoring and re-routing
RON Design Prober Router Forwarder Conduit Link-state routing protocol, disseminates info using RON! Performance Database Application-specific routing tables Policy routing module RON library Nodes in different routing domains (ASes)
Many Research Questions Does the RON approach work at all? Each RON is small in size, no more than 50 or 100 nodes –How fast can failure detection & recovery happen? Policy routing –Doesn’t RON violate AUPs and other policies? Routing behavior –Can stable routing be achieved? –Implementing efficient multi-criteria routing Is it safe to deploy a large number of (small) interacting RONs on the Internet?
RON Deployment (19 sites).com (ca),.com (ca), dsl (or), cci (ut), aros (ut), utah.edu,.com (tx) cmu (pa), dsl (nc), nyu, cornell, cable (ma), cisco (ma), mit, vu.nl, lulea.se, ucl.uk, kaist.kr, univ-in-venezuela To vu.nl lulea.se ucl.uk To kaist.kr,.ve
RON Experiments Measure loss, latency, and throughput with and without RON 13 hosts in the US and Europe 3 days of measurements from data collected in March minute average loss rates –A 30 minute outage is very serious! Note: Experiments done with “No-Internet2- for-commercial-use” policy
RON greatly improves loss-rate 30-min average loss rate with RON 30-min average loss rate on Internet 13,000 samples RON loss rate never more than 30%
An order-of-magnitude fewer failures Loss Rate RON Better No Change RON Worse 10% % % % % % minute average loss rates 6,825 “path hours” represented here 12 “path hours” of essentially complete outage 76 “path hours” of TCP outage RON routed around all of these! One indirection hop provides almost all the benefit! 6,825 “path hours” represented here 12 “path hours” of essentially complete outage 76 “path hours” of TCP outage RON routed around all of these! One indirection hop provides almost all the benefit!
Resilience Against DoS Attacks
Conclusion Improved availability of Internet communication paths using small overlays –Layered above scalable IP substrate –RON provides a set of libraries and programs to facilitate this application-specific routing Experimental data suggest that this approach works –Over 10X availability –Outage detection and recovery in about 15 seconds –Able to route around certain denial-of-service attacks Many interesting questions remain…
Policy Routing Today, wide-area policy expression is a sledgehammer Policy control is important –From talking to some providers –E.g., rate control policy; Internet2, etc. True, RONs could violate AUPs But, the RON approach enables more flexible policies –More complex routing decisions; rate-based too –Multiple routing tables –Deeper packet inspection, etc.
Example
Throughput Improvement