Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru.

Similar presentations


Presentation on theme: "Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru."— Presentation transcript:

1 Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

2 What We Should Aim Toward Carrier airlines (2002 FAA Fact Book)  41 accidents, 6.7 million flights (five “nines” availability) 911 phone service (1993 NRIC report)  29 minutes downtime per year per line (four “nines” availability) Standard phone service (various sources)  53 minutes downtime per year per line (four “nines” availability) The Internet?  One to two “nines”

3 Example Catastrophic Failures “…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.” -- news.com, April 25, 1997 “Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes.” -- wired.com, January 25, 2001 “WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue." -- cnn.com, October 3, 2002 "A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).” -- dslreports.com, February 23, 2004

4 NANOG List Failure “Analysis” Note: Only includes problems openly discussed on this list. More than 70% of threads discussing failures related to router configuration or route announcement problems

5 Faults and Failures Fault = Underlying defect in a component that causes it to violate a specification  Latent or Active (i.e., cause errors) Unmasked faults (errors) cause failures  Failure of subsystem (spec violation) causes fault in system Internet faults occur for complex reasons  Hardware, software, protocol, design, implementation, operational faults: could be triggered by malice Internet failure: A cannot communicate with B

6 Three Directions Configuration as programming  Defines BGP behavior  Tools to cope with routing complexity Coping with protocol faults: failure-atomic interdomain routing  Prefix-based routing considered harmful End-to-end routing  Exposing multiple paths to end systems (and stubs)

7 Configuration Defines BGP Behavior Which neighboring networks can send traffic Where traffic enters and leaves the network How routers within the network learn routes to external destinations Flexibility for realizing goals in complex business landscape FlexibilityComplexity Traffic Route No Route

8 Today: Reactive Operation Problems cause downtime Problems often not immediately apparent What happens if I tweak this policy…?

9 Coping with Complexity View configuration as (distributed) programming  Large-scale: over 1M lines of code in some networks Programming tools to reduce fault frequency  Static analysis can detect many faults [rcc]  Sandboxing to overcome current “stimulus- response” reasoning [FR03] Centralize configuration platform  More “intentional” config specs  Push configs to routers  Push routes to routers [RCP:F+04]  Use static analysis and sandboxing tools

10 Proactive Operation with rcc http://nms.csail.mit.edu/rcc Faults Represent complex, distributed configuration Define a correctness specification Map specification to constraints Configure Detect Faults Deploy rcc Normalized Representation Correctness Specification Constraints Distributed router configurations (Single AS)

11 Factoring Routing Configuration Ranking: route selection Dissemination: internal route advertisement Filtering: route advertisement Customer Competitor Primary Backup Hundreds of thousands of lines of configuration in hundreds of routers.

12 Correctness Specification Path Visibility Every destination with a usable path has a route advertisement Route Validity Every route advertisement corresponds to a usable path Example violation: Signaling partition Example violation: Routing loop If there exists a path, then there exists a route If there exists a route, then there exists a path

13 Results: Faults across 17 ASes Route ValidityPath Visibility Every AS had faults, regardless of network size Most faults can be attributed to distributed configuration

14 Web-based & Command Line Interfaces http://nms.csail.mit.edu/rcc

15 Three Directions Configuration as programming  Tools to cope with routing complexity Coping with protocol faults: failure-atomic interdomain routing  Prefix-based routing considered harmful End-to-end routing  Exposing multiple paths to end systems

16 Prefixes are too coarse-grained Validity: If a failure occurs that makes a network unreachable via a given path, then the route corresponding to that path must be withdrawn 70% of intra-AS failures not visible in BGP [FABK03]

17 …but they are also too fine-grained! ~70% of discontiguous prefix pairs from the same AS are announced from the same location Allocation explains about 60% of these cases:  Registries often allocate discontiguous address blocks to a single AS on the same day Routes for these prefixes will “flap” together.  135.36.0.0/16 (Agere) and 135.12.0.0/14 (Lucent) Route objects should correspond to an “atom” of hosts that share fate

18 Proposal: Atomic Interdomain Protocol (AIP) Exterminate prefixes Name “atomic domains” (AD) directly  Addressing, forwarding and routing on ADs  Like current AS numbers, but finer-grained  Example: MIT, Microsoft Redmond, one PoP of a large ISP, … Flat AD IDs can carry cryptographic meaning  Self-certifying (hash of public key) End-system addresses have the form [AD : LocalID]

19 Exposing Paths to End Systems Ultimately, failure recovery is an end-to-end function Current architecture doesn’t expose multiple paths to end systems and stubs Result: Various hacks to “discover” distinct paths across overlays and underlays…

20 Summary It’s worth shooting for a two or three order-of- magnitude improvement in Internet availability It’s possible to get four or five nines of Internet availability, if we:  Develop tools to cope with configuration complexity  Develop a failure-atomic routing system  Expose multiple IP-layer paths to higher layers


Download ppt "Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru."

Similar presentations


Ads by Google