Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru.

Slides:



Advertisements
Similar presentations
Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)
Advertisements

The Role of a Registry Certificate Authority Some Steps towards Improving the Resiliency of the Internet Routing System: The Role of a Registry Certificate.
Improving Internet Availability. Some Problems Misconfiguration Miscoordination Efficiency –Market efficiency –Efficiency of end-to-end paths Scalability.
Resonance: Dynamic Access Control in Enterprise Networks Ankur Nayak, Alex Reimers, Nick Feamster, Russ Clark School of Computer Science Georgia Institute.
Data Mining Challenges for Network Management Nick Feamster, Georgia Tech Dave Andersen, CMU (joint with Jay Lepreau and Emulab)
Networking Research Nick Feamster CS Nick Feamster Ph.D. from MIT, Post-doc at Princeton this fall Arriving January 2006 –Here off-and-on until.
Internet Availability Nick Feamster Georgia Tech.
Network Support for Accountability Nick Feamster Georgia Tech Collaborative Response with David Andersen (CMU), Hari Balakrishnan (MIT), Scott Shenker.
Nick Feamster Research Interest: Networked Systems Arriving January 2006 Likely teaching CS 7260 in Spring 2005 Here off-and-on until then. works.
Multihoming and Multi-path Routing
Holding the Internet Accountable David Andersen, Hari Balakrishnan, Nick Feamster, Teemu Koponen, Daekyeong Moon, Scott Shenker.
Network Operations Nick Feamster
Network Troubleshooting: rcc and Beyond Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina)
Network Operations Nick Feamster
Network Operations Research Nick Feamster
Multihoming and Multi-path Routing
Deployment of MPLS VPN in Large ISP Networks
Routing Basics.
11 TROUBLESHOOTING Chapter 12. Chapter 12: TROUBLESHOOTING2 OVERVIEW  Determine whether a network communications problem is related to TCP/IP.  Understand.
Routing Basics By Craig Lindstrom. Overview Routing Process Routing Process Default Routing Default Routing Static Routing Static Routing Dynamic Routing.
Best Practices for ISPs
Towards a Logic for Wide-Area Internet Routing Nick Feamster and Hari Balakrishnan M.I.T. Computer Science and Artificial Intelligence Laboratory Kunal.
An Operational Perspective on BGP Security Geoff Huston GROW WG IETF 63 August 2005.
Troubleshooting Network Configuration Nick Feamster CS 6250: Computer Networking Fall 2011.
Inherently Safe Backup Routing with BGP Lixin Gao (U. Mass Amherst) Timothy Griffin (AT&T Research) Jennifer Rexford (AT&T Research)
Network Monitoring for Internet Traffic Engineering Jennifer Rexford AT&T Labs – Research Florham Park, NJ 07932
1 Interdomain Routing Policy Reading: Sections plus optional reading COS 461: Computer Networks Spring 2008 (MW 1:30-2:50 in COS 105) Jennifer Rexford.
Spring Routing & Switching Umar Kalim Dept. of Communication Systems Engineering 06/04/2007.
© 2009 Cisco Systems, Inc. All rights reserved. ROUTE v1.0—6-1 Connecting an Enterprise Network to an ISP Network Considering the Advantages of Using BGP.
MPLS L3 and L2 VPNs Virtual Private Network –Connect sites of a customer over a public infrastructure Requires: –Isolation of traffic Terminology –PE,
Nick Feamster Interdomain Routing Correctness and Stability.
1 Internet Addresses (You should read Chapter 4 in Forouzan) IP Address is 32 Bits Long Conceptually the address is the pair ( NETID, HOSTID ) Addresses.
1 Version 3.1 Module 4 Learning About Other Devices.
Towards a Logic for Wide- Area Internet Routing Nick Feamster Hari Balakrishnan.
1 Multi-Protocol Label Switching (MPLS). 2 MPLS Overview A forwarding scheme designed to speed up IP packet forwarding (RFC 3031) Idea: use a fixed length.
An Introduction to Software Architecture
CS 3700 Networks and Distributed Systems Inter Domain Routing (It’s all about the Money) Revised 8/20/15.
Protocol implementation Next-hop resolution Reliability and graceful restart.
RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris MIT Laboratory for Computer Science
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks BGP.
RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris MIT Laboratory for Computer Science
Chapter 9. Implementing Scalability Features in Your Internetwork.
Interdomain Routing Security. How Secure are BGP Security Protocols? Some strange assumptions? – Focused on attracting traffic from as many Ases as possible.
A Firewall for Routers: Protecting Against Routing Misbehavior1 June 26, A Firewall for Routers: Protecting Against Routing Misbehavior Jia Wang.
SDX: A Software-Defined Internet eXchange Jennifer Rexford Princeton University
T. S. Eugene Ngeugeneng at cs.rice.edu Rice University1 COMP/ELEC 429/556 Introduction to Computer Networks Inter-domain routing Some slides used with.
Routing and Routing Protocols
Geographic Locality of IP Prefixes Mythili Vutukuru Joint work with Michael Freedman, Nick Feamster and Hari Balakrishnan.
CS 4396 Computer Networks Lab BGP. Inter-AS routing in the Internet: (BGP)
Evolving Toward a Self-Managing Network Jennifer Rexford Princeton University
Evolving Toward a Self-Managing Network Jennifer Rexford Princeton University
CS 640: Introduction to Computer Networks Aditya Akella Lecture 11 - Inter-Domain Routing - BGP (Border Gateway Protocol)
1 Agenda for Today’s Lecture The rationale for BGP’s design –What is interdomain routing and why do we need it? –Why does BGP look the way it does? How.
© 2005 Cisco Systems, Inc. All rights reserved. BGP v3.2—5-1 Customer-to-Provider Connectivity with BGP Connecting a Multihomed Customer to a Single Service.
1 Chapter 4: Internetworking (IP Routing) Dr. Rocky K. C. Chang 16 March 2004.
Michael Schapira, Princeton University Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks
Chapter-5 STP. Introduction Examine a redundant design In a hierarchical design, redundancy is achieved at the distribution and core layers through additional.
Multi Node Label Routing – A layer 2.5 routing protocol
Connecting an Enterprise Network to an ISP Network
2017 session 1 TELE3118: Network Technologies Week 6: Network Layer Control Plane Inter-Domain Routing Protocols Some slides have been adapted from:
CS 3700 Networks and Distributed Systems
Interdomain Traffic Engineering with BGP
Module Summary BGP is a path-vector routing protocol that allows routing policy decisions at the AS level to be enforced. BGP is a policy-based routing.
Department of Computer and IT Engineering University of Kurdistan
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
1 Multi-Protocol Label Switching (MPLS). 2 MPLS Overview A forwarding scheme designed to speed up IP packet forwarding (RFC 3031) Idea: use a fixed length.
COMP/ELEC 429/556 Introduction to Computer Networks
BGP Instability Jennifer Rexford
Hari Balakrishnan Hari Balakrishnan Computer Networks
Presentation transcript:

Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

What We Should Aim Toward Carrier airlines (2002 FAA Fact Book)  41 accidents, 6.7 million flights (five “nines” availability) 911 phone service (1993 NRIC report)  29 minutes downtime per year per line (four “nines” availability) Standard phone service (various sources)  53 minutes downtime per year per line (four “nines” availability) The Internet?  One to two “nines”

Example Catastrophic Failures “…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.” -- news.com, April 25, 1997 “Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes.” -- wired.com, January 25, 2001 “WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue." -- cnn.com, October 3, 2002 "A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).” -- dslreports.com, February 23, 2004

NANOG List Failure “Analysis” Note: Only includes problems openly discussed on this list. More than 70% of threads discussing failures related to router configuration or route announcement problems

Faults and Failures Fault = Underlying defect in a component that causes it to violate a specification  Latent or Active (i.e., cause errors) Unmasked faults (errors) cause failures  Failure of subsystem (spec violation) causes fault in system Internet faults occur for complex reasons  Hardware, software, protocol, design, implementation, operational faults: could be triggered by malice Internet failure: A cannot communicate with B

Three Directions Configuration as programming  Defines BGP behavior  Tools to cope with routing complexity Coping with protocol faults: failure-atomic interdomain routing  Prefix-based routing considered harmful End-to-end routing  Exposing multiple paths to end systems (and stubs)

Configuration Defines BGP Behavior Which neighboring networks can send traffic Where traffic enters and leaves the network How routers within the network learn routes to external destinations Flexibility for realizing goals in complex business landscape FlexibilityComplexity Traffic Route No Route

Today: Reactive Operation Problems cause downtime Problems often not immediately apparent What happens if I tweak this policy…?

Coping with Complexity View configuration as (distributed) programming  Large-scale: over 1M lines of code in some networks Programming tools to reduce fault frequency  Static analysis can detect many faults [rcc]  Sandboxing to overcome current “stimulus- response” reasoning [FR03] Centralize configuration platform  More “intentional” config specs  Push configs to routers  Push routes to routers [RCP:F+04]  Use static analysis and sandboxing tools

Proactive Operation with rcc Faults Represent complex, distributed configuration Define a correctness specification Map specification to constraints Configure Detect Faults Deploy rcc Normalized Representation Correctness Specification Constraints Distributed router configurations (Single AS)

Factoring Routing Configuration Ranking: route selection Dissemination: internal route advertisement Filtering: route advertisement Customer Competitor Primary Backup Hundreds of thousands of lines of configuration in hundreds of routers.

Correctness Specification Path Visibility Every destination with a usable path has a route advertisement Route Validity Every route advertisement corresponds to a usable path Example violation: Signaling partition Example violation: Routing loop If there exists a path, then there exists a route If there exists a route, then there exists a path

Results: Faults across 17 ASes Route ValidityPath Visibility Every AS had faults, regardless of network size Most faults can be attributed to distributed configuration

Web-based & Command Line Interfaces

Three Directions Configuration as programming  Tools to cope with routing complexity Coping with protocol faults: failure-atomic interdomain routing  Prefix-based routing considered harmful End-to-end routing  Exposing multiple paths to end systems

Prefixes are too coarse-grained Validity: If a failure occurs that makes a network unreachable via a given path, then the route corresponding to that path must be withdrawn 70% of intra-AS failures not visible in BGP [FABK03]

…but they are also too fine-grained! ~70% of discontiguous prefix pairs from the same AS are announced from the same location Allocation explains about 60% of these cases:  Registries often allocate discontiguous address blocks to a single AS on the same day Routes for these prefixes will “flap” together.  /16 (Agere) and /14 (Lucent) Route objects should correspond to an “atom” of hosts that share fate

Proposal: Atomic Interdomain Protocol (AIP) Exterminate prefixes Name “atomic domains” (AD) directly  Addressing, forwarding and routing on ADs  Like current AS numbers, but finer-grained  Example: MIT, Microsoft Redmond, one PoP of a large ISP, … Flat AD IDs can carry cryptographic meaning  Self-certifying (hash of public key) End-system addresses have the form [AD : LocalID]

Exposing Paths to End Systems Ultimately, failure recovery is an end-to-end function Current architecture doesn’t expose multiple paths to end systems and stubs Result: Various hacks to “discover” distinct paths across overlays and underlays…

Summary It’s worth shooting for a two or three order-of- magnitude improvement in Internet availability It’s possible to get four or five nines of Internet availability, if we:  Develop tools to cope with configuration complexity  Develop a failure-atomic routing system  Expose multiple IP-layer paths to higher layers