Virtually Eliminating Router Bugs Minlan Yu Princeton University Joint work with Eric Keller (Princeton), Matt Caesar (UIUC),

Slides:



Advertisements
Similar presentations
Using Network Virtualization Techniques for Scalable Routing Nick Feamster, Georgia Tech Lixin Gao, UMass Amherst Jennifer Rexford, Princeton University.
Advertisements

Path Splicing with Network Slicing
Path Splicing with Network Slicing Nick Feamster Murtaza Motiwala Santosh Vempala.
Performance Evaluation of Open Virtual Routers M.Siraj Rathore
1 Quagga Status RIPE65 Martin Winter OpenSourceRouting.org.
Scalable Flow-Based Networking with DIFANE 1 Minlan Yu Princeton University Joint work with Mike Freedman, Jennifer Rexford and Jia Wang.
Projects Related to Coronet Jennifer Rexford Princeton University
1 In VINI Veritas: Realistic and Controlled Network Experimentation Jennifer Rexford with Andy Bavier, Nick Feamster, Mark Huang, and Larry Peterson
VROOM: Virtual ROuters On the Move Jennifer Rexford Joint work with Yi Wang, Eric Keller, Brian Biskeborn, and Kobus van der Merwe
Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:
Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, Y. Richard Yang Laboratory of Networked Systems Yale University.
Refactoring Router Software to Minimize Disruption Eric Keller Advisor: Jennifer Rexford Princeton University Final Public Oral - 8/26/2011.
VROOM: Virtual ROuters On the Move Jennifer Rexford Joint work with Yi Wang, Eric Keller, Brian Biskeborn, and Kobus van der Merwe (AT&T)
A Routing Control Platform for Managing IP Networks Jennifer Rexford Computer Science Department Princeton University
Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, and Y. Richard Yang Laboratory of Networked Systems Yale University February.
1 Design and implementation of a Routing Control Platform Matthew Caesar, Donald Caldwell, Nick Feamster, Jennifer Rexford, Aman Shaikh, Jacobus van der.
A Routing Control Platform for Managing IP Networks Jennifer Rexford Princeton University
Network Monitoring for Internet Traffic Engineering Jennifer Rexford AT&T Labs – Research Florham Park, NJ 07932
Rethinking Routers in the Age of Virtualization Jennifer Rexford Princeton University
Routing and Routing Protocols
RRAPID: Real-time Recovery based on Active Probing, Introspection, and Decentralization Takashi Suzuki Matthew Caesar.
A Routing Control Platform for Managing IP Networks Jennifer Rexford Princeton University
BASE: Using Abstraction to Improve Fault Tolerance Rodrigo Rodrigues, Miguel Castro, and Barbara Liskov MIT Laboratory for Computer Science and Microsoft.
© 2009 Cisco Systems, Inc. All rights reserved. ROUTE v1.0—6-1 Connecting an Enterprise Network to an ISP Network Considering the Advantages of Using BGP.
Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.
A Routing Control Platform for Managing IP Networks Jennifer Rexford Princeton University
Jennifer Rexford Princeton University MW 11:00am-12:20pm Wide-Area Traffic Management COS 597E: Software Defined Networking.
BUFFALO: Bloom Filter Forwarding Architecture for Large Organizations Minlan Yu Princeton University Joint work with Alex Fabrikant,
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
1 Latency Equalization: A Programmable Routing Service Primitive Minlan Yu Joint work with Marina Thottan, Li Li at Bell Labs.
Better by a HAIR: Hardware-Amenable Internet Routing Brent Mochizuki University of Illinois at Urbana-Champaign Joint work with: Firat Kiyak (Illinois)
Virtual ROuters On the Move (VROOM): Live Router Migration as a Network-Management Primitive Yi Wang, Eric Keller, Brian Biskeborn, Kobus van der Merwe,
1 Introducing Routing 1. Dynamic routing - information is learned from other routers, and routing protocols adjust routes automatically. 2. Static routing.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
CH2 System models.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Overview of implementations openBGP (and openOSPF) –Active development Zebra –Commercialized Quagga –Active development XORP –Hot Gated –Dead/commercialized.
Protocol implementation Next-hop resolution Reliability and graceful restart.
Cisco S2 C4 Router Components. Configure a Router You can configure a router from –from the console terminal (a computer connected to the router –through.
Reducing Transient Disconnectivity using Anomaly-Cognizant Forwarding Andrey Ermolinskiy, Scott Shenker University of California – Berkeley and ICSI.
1.4 Open source implement. Open source implement Open vs. Closed Software Architecture in Linux Systems Linux Kernel Clients and Daemon Servers Interface.
Central Control over Distributed Routing fibbing.net SIGCOMM Stefano Vissicchio 18th August 2015 UCLouvain Joint work with O. Tilmans (UCLouvain), L. Vanbever.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
A Firewall for Routers: Protecting Against Routing Misbehavior1 June 26, A Firewall for Routers: Protecting Against Routing Misbehavior Jia Wang.
A Snapshot on MPLS Reliability Features Ping Pan March, 2002.
OSPF Offloading: The HELLO Protocol A First Step Toward Distributed Heterogeneous Offloading Speaker: Mary Bond.
CprE 458/558: Real-Time Systems
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
Evolving Toward a Self-Managing Network Jennifer Rexford Princeton University
1 Version 3.1 Module 6 Routed & Routing Protocols.
Evolving Toward a Self-Managing Network Jennifer Rexford Princeton University
1 7-Jan-16 S Ward Abingdon and Witney College Dynamic Routing CCNA Exploration Semester 2 Chapter 3.
Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003.
1.4 Open source implement. Open source implement Open vs. Closed Software Architecture in Linux Systems Linux Kernel Clients and Daemon Servers Interface.
A Snapshot on MPLS Reliability Features Ping Pan March, 2002.
Bringing External Connectivity and Experimenters to GENI Nick Feamster Georgia Tech.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Separating Routing From Routers Jennifer Rexford Princeton University
Coping with Link Failures in Centralized Control Plane Architecture Maulik Desai, Thyagarajan Nandagopal.
Separating Routing From Routers Jennifer Rexford Princeton University
Atrium Router Project Proposal Subhas Mondal, Manoj Nair, Subhash Singh.
BUFFALO: Bloom Filter Forwarding Architecture for Large Organizations Minlan Yu Princeton University Joint work with Alex Fabrikant,
Week#3 Software Quality Engineering.
Multi Node Label Routing – A layer 2.5 routing protocol
Shadow Configurations: A Network Management Primitive
Fault Tolerance In Operating System
Refactoring Router Software to Minimize Disruption
Fault Tolerance Distributed Web-based Systems
COS 461: Computer Networks
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

Virtually Eliminating Router Bugs Minlan Yu Princeton University Joint work with Eric Keller (Princeton), Matt Caesar (UIUC), Jennifer Rexford (Princeton) 1 CoNEXT’09

Router Bugs in the News 2

3

1 misconfiguration tickled 2 bugs (2 vendors) – Real bugs on Feb 16, 2009 – Huge increase in the global rate of updates – 10x increase in global instability for an hour Misconfiguration: as-path prepend MikroTik bug: no-range check prepended 252 times Did not filter Cisco bug: Long AS paths AS path Prepending After: len > 255 Notification AS47878 AS Example of Router Bugs Global Instability by Country

Router Bugs Router bugs are a serious problem – Routers are getting more complicated Quagga 220K lines, XORP 826K lines – Vendors are allowing third-party software – Other outages are becoming less common Router bugs are hard to detect and fix – Byzantine failures don’t simply crash the router – Violate protocol, can cause cascading outages – Often discovered after serious outage 5 How to detect bugs and stop their effects before they spread?

Avoiding Bugs via Diversity Run multiple, diverse routing instances – Use voting to select majority result – Software and Data Diversity (SDD) ensures correctness E.g., XORP and Quagga, different update timing – Similar approach applied in other fields – But new challenges and opportunities in routing 6 Vote

SDD Challenges in Routers Making replication transparent – Interoperate with existing routers – Duplicate network state to routing instances – Present a common configuration interface Handling transient, real-time nature of routers – React quickly to network events E.g., buggy behaviors, link failures – But not over-react to transient inconsistency 7 time Routing Instance I A A Routing Instance II B B C C B B A A C C

SDD Opportunities in Routers Easy to vote on standardized output – Control plane: IETF-standardized routing protocols – Data plane: forwarding-table entries Easy to recover from errors via bootstrap – Routing has limited dependency on history – Don’t need much information to bootstrap instance Diversity is effective in avoiding router bugs – Based on our studies on router bugs and code 8

Outline Exploiting software and data diversity (SDD) – Effective in avoiding bugs – Enough hardware resources to support diversity Bug-tolerant router (BTR) architecture – Make replication transparent with low overhead – React quickly and handle transient inconsistency Prototype and evaluation – Small, trusted code base – Low processing overhead 9

Outline Exploiting software and data diversity (SDD) – Effective in avoiding bugs – Enough hardware resources to support diversity Bug-tolerant router (BTR) architecture – Make replication transparent with low overhead – React quickly and handle transient inconsistency Prototype and evaluation – Small, trusted code base – Low processing overhead 10

Why Diversity Works? Enough diversity in routers – Software: Quagga, XORP, BIRD – Protocols: OSPF and IS-IS – Environment: timing, ordering, memory Enough resources for diversity – Extra processor blades for hardware reliability – Multi-core processors, separate route servers Effective in avoiding bugs 11

Evaluate Diversity Effect Most bugs can be avoided by diversity – Reproduce and avoid real bugs –.. in XORP and Quagga bugzilla database Diversity on execution environment Diversity MechanismAvoid bugs in database Timing/Order of Messages39% Configuration25% Timing/Order of Connections12% Combining all execution diversity 88% 12

Effect of Software Diversity Sanity check on implementation diversity – Picked 10 bugs from XORP, 10 bugs from Quagga – None were present in the other implementation Static code analysis on version diversity – Overlap decreases quickly between versions 75% of bugs in Quagga are fixed in Quagga % of bugs in Quagga are newly introduced Vendors can also achieve software diversity – Different code versions, different code trains – Code from acquired companies, open-source 13

Outline Exploiting software and data diversity (SDD) – Effective in avoiding bugs – Enough hardware resources to support diversity Bug-tolerant router (BTR) architecture – Make replication transparent with low overhead – React quickly and handle transient inconsistency Prototype and evaluation – Small, trusted code base – Low processing overhead 14

Bug-tolerant Router Architecture 15 UPDATE VOTER FIB VOTER REPLICA MANAGER Hypervisor Forwarding table (FIB) Interface 1 Iinterface 2 Protocol daemon Routing table Protocol daemon Routing table Protocol daemon Routing table

UPDATE VOTER FIB VOTER REPLICA MANAGER Hypervisor Forwarding table (FIB) Interface 1 Iinterface 2 Protocol daemon Routing table Protocol daemon Routing table Protocol daemon Routing table Replicating Incoming Routing Messages /8 Update No need for protocol parsing – operates at socket level 16

UPDATE VOTER FIB VOTER REPLICA MANAGER Hypervisor Forwarding table (FIB) Interface 1 Iinterface 2 Protocol daemon Routing table Protocol daemon Routing table Protocol daemon Routing table Voting: Updates to Forwarding Table /8  IF /8 Update 17 Transparent by intercepting calls to “Netlink”

UPDATE VOTER FIB VOTER REPLICA MANAGER Hypervisor Forwarding table (FIB) Interface 1 Iinterface 2 Protocol daemon Routing table Protocol daemon Routing table Protocol daemon Routing table Voting: Control-Plane Messages /8  IF /8 Update 18 Transparent by intercepting socket system calls

Simple Voting Mechanisms Tolerate transient periods of disagreement – Different replicas can have different outputs – … during routing-protocol convergence Several different voting mechanisms – Master-slave: speeding reaction time – Continuous majority: handling transience 19 Routing Instance I A A Routing Instance II B B C C B B A A C C A A C C Routing Instance III time master

Simple Voting Mechanisms Tolerate transient periods of disagreement – Different replicas can have different outputs – … during routing-protocol convergence Several different voting mechanisms – Master-slave: speeding reaction time – Continuous majority: handling transience 20 Routing Instance I A A Routing Instance II B B C C B B A A C C A A C C Routing Instance III time Continuous majority A A B B A A A A B B C C C C C C C C

Simple Voting and Recovery Recovery – Hiding replica failure from neighboring routers – Hypervisor kills faulty instance, invokes new one Small, trusted software component – No parsing, treats data as opaque strings – Just 514 lines of code in voter implementation 21

Outline Exploiting software and data diversity (SDD) – Effective in avoiding bugs – Enough hardware resources to support diversity Bug-tolerant router (BTR) architecture – Make replication transparent with low overhead – React quickly and handle transient inconsistency Prototype and evaluation – Small, trusted code base – Low processing overhead 22

Prototype Prototype implementation – No modification of routing software – Simple, trusted hypervisor – Built on Linux with XORP and Quagga Evaluation environment – Evaluated in 3GHz Intel Xeon – BGP trace from Route Views on March, 2007 Evaluation metric – Voting delay and fault rate of different voting algo. – Delay of hypervisor 23

Effectiveness of Voting Setup – 3 XORP and 3 Quagga routing instances – Inject bugs of realistic frequency and duration 24 Voting algorithmAvg voting delay (sec) Fault rate Single router-0.066% Master-slave % Continuous-majority %

Small Overhead Small increase on FIB pass through time – Time between receiving an update to FIB changes – Delay overhead of just hypervisor is 0.1% (0.06sec) – Delay overhead of 5 routing instances is 4.6% Little effect on network-wide convergence – ISP networks from Rocketfuel, and cliques – Found no significant change in convergence (beyond the pass through time) 25

Conclusion Seriousness of routing software bugs – Cause outages, misbehaviors, vulnerabilities – Violate protocol semantics, so not handled by traditional failure detection and recovery Software and data diversity (SDD) – Effective, has reasonable overhead Design and prototype of bug-tolerant router – Works with Quagga and XORP software – Low overhead, and small trusted code base 26

More information at Thanks! Questions? 27