Management of Routing Protocols in IP Networks

Slides:



Advertisements
Similar presentations
1 A. Sshaikh, A. Greenberg; Nov 01 UCSC Sigcomm IMW Experience in Black-box OSPF Measurement Aman Shaikh, UCSC Albert Greenberg, AT&T Labs-Research.
Advertisements

1 Aman Shaikh UCSC SHS IMW A Case-study of OSPF Behavior in a Large Enterprise Network Aman Shaikh, UCSC Chris Isett, Siemens Health Services Albert.
1 Aman Shaikh: June 02 UCSC INFOCOM 2002 Avoiding Instability during Graceful Shutdown of OSPF Aman Shaikh, UCSC Joint work with Rohit Dube, Xebeo Communications.
1 Aman Shaikh Ph.D. Defense Management of Routing Protocols in IP Networks Ph.D. Defense Aman Shaikh Computer Engineering, UCSC November 18, 2003.
Dynamic Routing Scalable Infrastructure Workshop, AfNOG2008.
CCNA 2 v3.1 Module 6.
Network Monitoring for Internet Traffic Engineering Jennifer Rexford AT&T Labs – Research Florham Park, NJ 07932
Routing and Routing Protocols
Routing.
OSPF Monitor Architecture, Design and Deployment Experience
Lecture Week 3 Introduction to Dynamic Routing Protocol Routing Protocols and Concepts.
1 Semester 2 Module 6 Routing and Routing Protocols YuDa college of business James Chen
1 Chapter 27 Internetwork Routing (Static and automatic routing; route propagation; BGP, RIP, OSPF; multicast routing)
Dynamic Routing Protocols  Function(s) of Dynamic Routing Protocols: – Dynamically share information between routers (Discover remote networks). – Automatically.
Routing/Routed Protocols. Remember: A Routed Protocol – defines logical addressing. Most notable example on the test – IP A Routing Protocol – fills the.
Routing and Routing Protocols Routing Protocols Overview.
1 Introducing Routing 1. Dynamic routing - information is learned from other routers, and routing protocols adjust routes automatically. 2. Static routing.
M.Menelaou CCNA2 ROUTING. M.Menelaou ROUTING Routing is the process that a router uses to forward packets toward the destination network. A router makes.
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 2 Module 6 Routing and Routing Protocols.
1 Chapter 27 Internetwork Routing (Static and automatic routing; route propagation; BGP, RIP, OSPF; multicast routing)
“Intra-Network Routing Scheme using Mobile Agents” by Ajay L. Thakur.
Routing protocols Basic Routing Routing Information Protocol (RIP) Open Shortest Path First (OSPF)
1. 2 Anatomy of an IP Packet IP packets consist of the data from upper layers plus an IP header. The IP header consists of the following:
Routing/Routed Protocols Part I. Routed Protocol Definition: Routed Protocol – used to transmit user data (packets) through an internetwork. Routed protocols.
1 Internet Routing. 2 Terminology Forwarding –Refers to datagram transfer –Performed by host or router –Uses routing table Routing –Refers to propagation.
IGP Data Plane Convergence draft-ietf-bmwg-dataplane-conv-meth-14.txt draft-ietf-bmwg-dataplane-conv-term-14.txt draft-ietf-bmwg-dataplane-conv-app-14.txt.
TCOM 509 – Internet Protocols (TCP/IP) Lecture 06_a Routing Protocols: RIP, OSPF, BGP Instructor: Dr. Li-Chuan Chen Date: 10/06/2003 Based in part upon.
CCNA 2 Week 6 Routing Protocols. Copyright © 2005 University of Bolton Topics Static Routing Dynamic Routing Routing Protocols Overview.
Routing and Routing Protocols
IP Routing Principles. Network-Layer Protocol Operations Each router provides network layer (routing) services X Y A B C Application Presentation Session.
1 Version 3.1 Module 6 Routed & Routing Protocols.
Routing protocols. 1.Introduction A routing protocol is the communication used between routers. A routing protocol allows routers to share information.
Routing Protocols Brandon Wagner.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 1 Module 10 Routing Fundamentals and Subnets.
Prof. Alfred J Bird, Ph.D., NBCT Office – Science 3rd floor – S Office Hours – Monday and Thursday.
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607 Office Hours – Monday 3:00 to 4:00 and.
Routing and Routing Protocols CCNA 2 v3 – Module 6.
+ Dynamic Routing Protocols 2 nd semester
Lec4: Introduction to Dynamic Routing Protocol
Working at a Small-to-Medium Business or ISP – Chapter 6
Dynamic Routing Protocols II OSPF
Routing Protocols and Concepts
Routing Jennifer Rexford.
Dynamic routing Routing Algorithm (Dijkstra / Bellman-Ford) – idealization All routers are identical Network is flat. Not true in Practice Hierarchical.
ICMP ICMP – Internet Control Message Protocol
What Are Routers? Routers are an intermediate system at the network layer that is used to connect networks together based on a common network layer protocol.
Routing/Routed Protocols
Dynamic Routing Protocols part2
CCNA 2 v3.1 Module 6 Routing and Routing Protocols
IS3120 Network Communications Infrastructure
© 2002, Cisco Systems, Inc. All rights reserved.
Intra-Domain Routing Jacob Strauss September 14, 2006.
Introduction to Dynamic Routing Protocol
Routing.
Chapter 5: Dynamic Routing
(How the routers’ tables are filled in)
Dynamic Routing Protocols part2
Chapter 3: Dynamic Routing
Introduction to Dynamic Routing Protocol
CS 268: Lecture 8 Intra-domain Routing Protocols
COS 561: Advanced Computer Networks
Dynamic Routing and OSPF
COS 561: Advanced Computer Networks
PRESENTATION COMPUTER NETWORKS
COS 561: Advanced Computer Networks
Working at a Small-to-Medium Business or ISP – Chapter 6
COS 461: Computer Networks
Dynamic Routing Protocols part3 B
Computer Networks Protocols
Routing.
Presentation transcript:

Management of Routing Protocols in IP Networks Ph.D. Defense Aman Shaikh Computer Engineering, UCSC November 18, 2003 Ph.D. Defense

Introduction Internet connects millions of computers Internet is packet-switched: Each packet travels independently of the rest Routers provide connectivity Routers forward packets so that they reach their ultimate destination Forwarding is destination-based and hop-by-hop Router decides next-hop (i.e., neighbor router) for each packet based on its destination address Routing protocols allow routers to determine next-hop(s) for every destination Ph.D. Defense

Management of Routing Infrastructure Management of routing infrastructure is a nightmare “Simple core (= routing infrastructure), smart edge (= end hosts)” design paradigm Internet only provides a best-effort, connectionless, unreliable service Routing is not designed with manageability in mind Large distributed system Hundreds of routers and thousands of links in big service provider networks Variety of routing protocols The infrastructure is evolving New services require new protocols and devices Ph.D. Defense

Dissertation Contribution Focuses on management of Open Shortest Path First (OSPF) protocol OSPF is widely used to control routing within service provider and enterprise networks Three areas of focus Monitoring Characterization Maintenance Ph.D. Defense

Monitoring Motivation: Contribution: Effective management requires sound monitoring systems Contribution: Design and implementation of an OSPF monitor Deployment in two commercial networks Has proved valuable for trouble-shooting and identifying impending problems in early stage Collection and archiving of OSPF data that is used for performance improvement, post-mortem analysis and further research Ph.D. Defense

Characterization Motivation: Contribution: Need sound simulation and analytical models for scalability studies, addition of new features etc... How do we parameterize these models? Need vendor-independent benchmarking methods Contribution: Black-box techniques for estimating OSPF processing delays within a router Has become basis for OSPF benchmarking standardization efforts Case study of OSPF dynamics in an enterprise network Ph.D. Defense

Maintenance Motivation: Contribution: Maintenance of routers occurs fairly frequently Protocol enhancements, bug fixes, hardware/software upgrades During maintenance, operators have to withdraw router undergoing maintenance Leads to route flapping and instability How to perform seamless maintenance? Contribution: I’ll Be Back (IBB) capability for OSPF Allows “router-under-maintenance” to be used for forwarding Ph.D. Defense

Outline Background Monitoring Characterization Maintenance Routing and OSPF overview Design of an IP router Monitoring OSPF Monitor Characterization Black-box measurements for OSPF Case study of OSPF dynamics Maintenance I’ll Be Back (IBB) Capability for OSPF Conclusions and future work Ph.D. Defense

Routing in the Internet AS1 AS2 BGP OSPF IS-IS BGP BGP BGP BGP AS3 AS4 AS5 BGP BGP OSPF RIP OSPF Internet is a collection of Autonomous Systems (ASes) Two classes of routing protocols IGP (Interior Gateway Protocols) Used within an AS Example: OSPF, IS-IS, RIP, EIGRP EGP (Exterior Gateway Protocols) Used across ASes Example: BGP Ph.D. Defense

Overview of OSPF OSPF is a link-state protocol Every router learns entire network topology Topology is represented as graph Routers are vertices, links are edges Every link is assigned weight through configuration Every router uses Dijkstra’s single source shortest path algorithm to build its forwarding table Router builds Shortest Path Tree (SPT) with itself as root Shortest Path Calculation (SPF) Packets are forwarded along shortest paths defined by link weights Ph.D. Defense

Areas in OSPF OSPF allows domain to be divided into areas for scalability Areas are numbered 0, 1, 2 … Hub-and-spoke with area 0 as hub Every link is assigned to exactly one area Routers with links in multiple areas are called border routers Border routers Area 1 Area 2 Area 0 Ph.D. Defense

Summarization with Areas Each router learns Entire topology of its attached areas Information about subnets in remote areas and their distance from the border routers Distance = sum of link costs from border router to subnet Area 1 Area 0 20 100 B1 B2 C1 C2 10.10.4.0/24 10.10.5.0/24 10 50 200 500 400 300 R3 R2 R1 OSPF domain B1 B2 R2 Area 0 100 200 500 400 300 R3 R1 R1’s View Area 1 10.10.4.0/24 10.10.5.0/24 20 70 10 60 Ph.D. Defense

Link State Advertisements (LSAs) Every router describes its local connectivity in Link State Advertisements (LSAs) Router originates an LSA due to… Change in network topology Example: link goes down or comes up Periodic soft-state refresh Recommended value of interval is 30 minutes LSA is flooded to other routers in the domain Flooding is reliable and hop-by-hop Includes change and refresh LSAs Flooding leads to duplicate copies of LSAs being received Every router stores LSAs (self-originated + received) in link-state database (= topology graph) Ph.D. Defense

Adjacency Neighbor routers (i.e., routers connected by a physical link) form an adjacency The purpose is to make sure Link is operational and routers can communicate with each other Neighbor routers have consistent view of network topology To avoid loops and black holes Link gets used for data forwarding only after adjacency is established Use of periodic Hellos to monitor the status of link and adjacency Ph.D. Defense

Design of an IP Router Route Processor (CPU) Data packet Data packet OSPF Process Routing calculation BGP Process Routing calculation RIP Process Routing calculation Route Manager Control Plane Data Plane Forwarding Info. Base (FIB) Interface card Forwarding Interface card Forwarding Data packet Data packet Switching Fabric Ph.D. Defense

Outline Monitoring Motivation: Background Effective management requires sound monitoring systems Contribution: OSPF monitor Design Three component and their functionality Deployment in two commercial networks How OSPF Monitor is being used Lessons learnt through deployment Characterization Maintenance Conclusions and future work Ph.D. Defense

OSPF Monitor: Objectives Real-time analysis of OSPF behavior Trouble-shooting, alerting Real-time snapshots of OSPF network topology Off-line analysis Post-mortem analysis of recurring problems Identify anomaly signatures and use them to predict impending problems Allow operators to tune configurable parameters Improve maintenance procedures Analyze OSPF behavior in commercial networks Ph.D. Defense

Related Work Route monitoring Topology tracking Commercial IP monitors Route Dynamics (IPSUM), Route Explorer (PacketDesign) IPMON project at Sprint IS-IS and BGP listeners RouteViews and RIPE Collects BGP updates from several networks Topology tracking OSPF topology server [shaikh:jsac02] Evaluation and comparison of LSA-based versus SNMP-based approaches Rocketfuel project at UW Seattle Inference of intra-domain topologies from end-to-end measurements Ph.D. Defense

Components Data collection: LSA Reflector (LSAR) Passively collects OSPF LSAs from network “Reflects” streams of LSAs to LSAG Archives LSAs for analysis by OSPFScan Real-time analysis: LSA aGgregator (LSAG) Monitors network for topology changes, LSA storms, node flaps and anomalies Off-line analysis: OSPFScan Tools for analysis of LSA archives Post-mortem analysis of recurring problems, performance improvement, what-if analysis, OSPF dynamics Ph.D. Defense

Example OSPF Network Area 1 Area 0 Area 2 Real-time Monitoring LSAG OSPFScan Off-line Analysis LSAs LSAs LSAs LSAR 1 LSAR 2 “Reflect” LSA “Reflect” LSA LSA archive LSA archive LSA archive replicate LSAs LSAs LSAs OSPF Network Area 1 Area 0 Area 2 Ph.D. Defense

How LSAR attaches to Network Host mode Join multicast group Adv: completely passive Disadv: not reliable, delayed initialization of LSDB Full adjacency mode Form full adjacency with a router Adv: reliable, immediate initialization of LSDB Disadv: LSAR’s instability can impact entire network Partial adjacency mode Keep adjacency in a state that allows LSAR to receive LSAs, but does not allow data forwarding over link Adv: reliable, LSAR’s instability does not impact entire network, immediate initialization of LSDB Disadv: can raise alarms on the router Ph.D. Defense

LSA aGregator (LSAG) Analyzes “reflected” LSAs from LSARs over TCP connections in real-time Generates console messages: Changes in OSPF network topology ADJACENY COST CHANGE: rtr 10.0.0.1 (intf 10.0.0.2)  rtr 10.0.0.5 old_cost 1000 new_cost 50000 area 0.0.0.0 Node flaps RTR FLAP: rtr 10.0.0.12 no_flaps 7 flap_window 570 sec LSA storms LSA STORM: lstype 3 lsid 10.1.0.0 advrt 10.0.0.3 area 0.0.0.0 no_lsas 7 storm_window 470 sec Anomalous behavior TYPE-3 ROUTE FROM NON-BORDER RTR: ntw 10.3.0.0/24 rtr 10.0.0.6 area 0.0.0.0 Ph.D. Defense

OSPFScan Tools for off-line analysis of LSA archives Parse, select (based on queries), and analyze Derivation and analysis of auxiliary information from LSA archives LSAs indicating network topology changes Routing table entries How OSPF routing tables evolved in response to network changes How end-to-end path within OSPF domain looked like at any instance Topology changes as graph-based abstraction Vertex addition/deletion and link addition/deletion/change_weight Playback of topology change events Essentially an LSAG playback Ph.D. Defense

Deployment Deployed in two commercial networks Enterprise network 15 areas, 500+ routers; Ethernet-based LANs Deployed since February, 2002 LSA archive size: 10 MB/day LSAR connection: host mode ISP network Area 0, 100+ routers; Point-to-point links Deployed since January, 2003 LSA archive size: 8 MB/day LSAR connection: partial adjacency mode Ph.D. Defense

LSAG in Day-to-day Operations Generation of alarms by feeding messages into higher layer network management systems Correlation and grouping of messages into a single alarm Prioritization of messages Validation of maintenance steps and monitoring the impact of these steps on network-wide OSPF behavior Example: Operators change link weights to carry out maintenance activities A “link-audit” web-page allows operators to keep track of link weights in real-time Ph.D. Defense

Problems Caught by LSAG Equipment problem Detected internal problems in a crucial router in enterprise network Problem manifested as episodes of OSPF adjacency flapping Configuration problem Identified assignment of same router-ids to two routers in enterprise network OSPF implementation bug Caught a bug in refresh algorithm of routers from a particular vendor in ISP network Bug resulted in a much faster refresh of LSAs than standards-mandated rate Ph.D. Defense

Long Term Analysis by OSPFScan LSA traffic analysis Identified excessive duplicate LSA traffic in some areas of the enterprise network Led to root-cause analysis and preventative steps Generation of statistics Inter-arrival time of change LSAs in the ISP network Fine-tuning configurable timers related to SPF calculation Mean down-time and up-time for links and routers in the ISP network Assessment of reliability and availability as ISP network gears for deployment of new services Ph.D. Defense

Lessons Learnt through Deployment New tools reveal new failure modes Real networks exhibit significant activity Maintenance and genuine problems Archive all LSAs LSA volume is manageable Stability and reliability of monitor is extremely important Keep data collection separate from its analysis Keep data collector as simple as possible Add functionality incrementally and through interaction with users Ph.D. Defense

Summary Three component architecture LSAR: LSA capture from the network LSAG: real-time analysis of LSA stream Detection and trouble-shooting of problems OSPFScan: off-line analysis tools for LSA archives Post-mortem analysis of recurring problems, performance improvement, what-if analysis, OSPF dynamics Deployed in two commercial networks Has proven a valuable network management tool “OSPF Monitor was a lifesaver” VP of Networking, Enterprise network  When monitor caught an impending failure in an early stage Ph.D. Defense

Outline Characterization Motivation: Contributions: Background Monitoring Characterization Motivation: Simulation and analytical models, benchmarking Contributions: Black-box techniques for estimating OSPF processing delays on a router Tasks we measure, methodology, results for Cisco and GateD Case study of OSPF dynamics in an enterprise network Maintenance Conclusions and future work Ph.D. Defense

Black-box Measurements for OSPF OSPF processing delays within a router matter! Add up to impact convergence and stability Guidance in tuning configurable parameters, head to head vendor comparisons, simulation models Instrumenting routing code for measuring delays is challenging Commercial implementations are proprietary May involve grappling with Numerous code versions, hardware platforms, and developers Use black-box measurements Measure the timing delays using external observations Applied to Cisco and GateD OSPF implementations Ph.D. Defense

Related Work White-box measurements for IS-IS [alaettinoglu] SPF delays reported are comparable to results obtained by us Empirical analysis of router behavior under large BGP routing tables [chang:imw02] Cisco and Juniper routers Benchmarking Methodology working group (bmwg) at IETF Drafts related to OSPF benchmarking Our black-box methods are basis for some benchmark tests Ph.D. Defense

What tasks did we measure? LSA Processing Route Processor (CPU) OSPF Process LSA Flooding Topology View LSA LSA SPF Calculation SPF Calculation LS Ack FIB Update FIB Forwarding Forwarding Data packet Switching Fabric Interface card Interface card Ph.D. Defense

Methodology Testbed Load emulated topology on target router LSA LSA LSA TopTracker Target router Testbed Load emulated topology on target router Initiate task of interest Measure the time for task Ph.D. Defense

Measuring Task Time Use a black-box method to bracket task start and finish times Subtract out intervals that precede and exceed these times time top bracket event B A task start time X task finish time C bottom bracket event X = A - (B + C) Ph.D. Defense

Measuring SPF Calculation TopTracker Target Router Load desired topology Send initiator LSA A B Initiator LSA arrives C SPF calculation starts time Send duplicate LSA X SPF calculation ends E D Send ack for duplicate LSA Ack for duplicate LSA arrives X = A – (B + C + D + E) Estimate the overhead = B + C + D + E Ph.D. Defense

Estimating the Overhead Remove SPF calculation from bracket spf_delay = 60 seconds TopTracker Target Router B Send initiator LSA overhead Send duplicate LSA Initiator LSA arrives Duplicate LSA arrives C time Initiator LSA processing done D Duplicate LSA processing done; send ack E Ack for duplicate LSA arrives SPF calculation starts overhead = B + C + D + E Ph.D. Defense

Results Results for Cisco GSR, 7513 and GateD For GateD, comparison of black-box results with those obtained using instrumentation (white-box) Route processors Cisco: 200 MHz R5000 processor GateD: 500 MHz AMD-K6 processor Topology: full n  n mesh with random OSPF edge weights n in range 10, 20, …, 100 Ph.D. Defense

Results for Cisco Routers Observations Similar results for two models SPF calculation time is O(n2) Ph.D. Defense

Results for GateD Observations: Black-box over-estimates white-box measurement Black-box captures the characteristics very well Ph.D. Defense

Summary Black-box methods for estimating OSPF processing delays Work across wide range of time delays Work for pure CPU bound tasks Effective in capturing scaling Match with white-box measurements Applied methods to Cisco GSR and 7513 LSA Processing: 100-800 microseconds LSA flooding: 30-40 milliseconds Pacing timer is the determining factor SPF calculation: 1-40 milliseconds O(n2) behavior for full n x n mesh FIB update time: 100-300 milliseconds No dependence on topology size Ph.D. Defense

Outline Characterization Motivation: Contributions: Background Monitoring Characterization Motivation: Simulation and analytical models, benchmarking Contributions: Black-box techniques for estimating OSPF processing delays on a router Case study of OSPF dynamics in an enterprise network Enterprise network topology, categorization of LSA traffic, results Maintenance Conclusions and future work Ph.D. Defense

Case Study of OSPF Dynamics OSPF behavior in commercial networks is not well understood Understanding dynamics of LSA traffic is key to better understanding of OSPF Bulk of OSPF processing is due to LSAs Big impact on OSPF convergence, (in)stability Analysis of LSA archives collected by OSPF monitor in enterprise network Focus on April, 2002 data Ph.D. Defense

Related Work Several studies focusing on BGP dynamics in the Internet Relatively easy to collect BGP data BGP is more complicated OSPF dynamics in a regional service provider network (MichNet) [watson:icdcs03] One year worth of data Several findings are similar to our observations Analysis of OSPF stability through simulations [basu:sigcomm01] Ph.D. Defense

Enterprise Network Provides customers with connectivity to applications and databases residing in data center OSPF network 15 areas, 500 routers This case study covers 8 areas, 250 routers One month: April, 2002 Ethernet-based LANs Customers are connected via leased lines Customer routes are injected via EIGRP into OSPF The routes are propagated via external LSAs Ph.D. Defense

Enterprise Network Topology Customer Customer Customer EIGRP EIGRP EIGRP B1 B2 Monitor LAN1 LAN 2 Border rtrs Area A Area 0 External (EIGRP) OSPF Domain Area A Area B Area 0 Area C Servers Database Applications Monitor uses host mode to receive LSAs Ph.D. Defense

Categorizing LSA Traffic Refresh LSA traffic Originated due to periodic soft-state refresh Forms base-line LSA traffic Can be predicted using configuration information Change LSA traffic Originated due to changes in network topology E.g, link goes down/comes up Allows detection of anomalies and problems Duplicate LSA traffic Received due to redundancy in flooding Overhead -- wastes resources Ph.D. Defense

LSA Traffic in Different Areas Days Area 2 Days Refresh LSAs Genuine Anomaly Change LSAs Area 3 Days Area 4 Days Duplicate LSAs Artifact: 23 hr day (Apr 7) Ph.D. Defense

Baseline LSA Traffic: Refresh LSAs Refresh LSA traffic can be reliably predicted using router configuration files Important for workload generation Days Days Area 2 Area 3 Ph.D. Defense

Refresh process is not synchronized No evidence of synchronization Contrary to simulation-based study [basu:sigcomm01] Reasons Changes in the topology help break synchronization LSA refresh at one router is not coupled with LSA refresh at other routers Drift in the refresh interval of different routers Ph.D. Defense

Change LSAs Internal to OSPF domain versus external Days Internal to OSPF domain versus external Change LSAs due to external events dominated Not surprising due to large number of leased lines and import of customer routes into OSPF Customer volatility  network volatility Ph.D. Defense

Root Causes of Change LSAs Persistent problem  flapping  numerous change LSAs Internal LSA spikes  hardware router problems OSPF monitor identified a problem (not visible other network mgt tools) early and led to preventive maintenance External LSA spikes  customer route volatility Overload of an external link to a customer between 9 PM – 3 AM caused EIGRP session to flap Link flaps Ph.D. Defense

Overhead: Duplicate LSAs Days Why do some areas witness substantial duplicate LSA traffic, while other areas do not witness any? OSPF flooding over LANs leads to control plane asymmetries and to imbalances in duplicate LSA traffic Ph.D. Defense

Summary Refresh LSAs: constituted bulk of overall LSA traffic No evidence of synchronization between different routers Refresh LSA traffic predictable from configuration information Change LSAs: mostly indicated persistent yet partial failure modes Internal LSA spikes  hardware router problems  preventive router maintenance External LSA spikes  customer congestion problems  “preventive” customer care Duplicate LSAs: arose from control plane asymmetries Simple configuration changes could eliminate duplicate LSAs and improved performance Ph.D. Defense

Outline Maintenance Motivation: Contribution: Background Monitoring Characterization Maintenance Motivation: Seamless maintenance and upgrades of routers Minimal instability and flaps Contribution: I’ll Be Back (IBB) capability for OSPF What IBB capability provides, how capability is implemented, performance analysis Conclusions and future work Ph.D. Defense

Maintenance is a Pain Maintenance of routers is a way of life in commercial networks Extensions to routing protocols, new functionality, hardware and software upgrades, bug fixes Maintenance is a painful exercise During maintenance, operators withdraw “router-under-maintenance” from forwarding service Leads to route flaps, traffic disruption and instability Operators have to carefully schedule maintenance Schedule them during night when load is moderate Stagger maintenance of different routers across time Ph.D. Defense

We can do better Observation: router can continue forwarding even while its routing process is inactive, at least for a while Current routers have separate routing and forwarding paths Routing in software (CPU) Forwarding in hardware (switching) Need to extend routing protocols since they always try to route around inactive router Our proposal: IBB (I’ll Be Back) extensions to OSPF Ph.D. Defense

IBB Proposal in a Nutshell OSPF process on router R needs to be shutdown Before shutdown, R informs other routers that it is going to be inactive for a while R specifies a time period (IBB Timeout) by which it expects to become operational again Other routers continue using R for forwarding during IBB Timeout period If R comes back within IBB Timeout period, no routing instability or flaps Else other routers start forwarding packets around R Ph.D. Defense

Related Work Graceful restart proposals for various routing protocols at IETF Graceful restart proposal for OSPF by John Moy Alex zinin’s propsal to avoid flaps upon restart of OSPF process Process has to come up before other routers notice it was shutdown Provides small window of opportunity Use of redundant route processors and seamless transfer of control NSR (Avici), High Availability Initiative (Cisco) Ph.D. Defense

What if topology changes R cannot update its forwarding table to reflect the change Can lead to loop or black holes B A R 3 2 6 (a) Topology when R went down B A R 10 2 6 (b) Topology changes while R is inactive Ph.D. Defense

Handling Changes: Three Options Don’t do anything Stop using R: John Moy’s proposal Inadvertent changes during upgrade are likely Example: flapping due to a bad interface somewhere But all changes are not bad Do not always lead to loops or black holes Stop using R only when loop or black hole gets formed And only for destinations for which there is a problem Our approach Ph.D. Defense

Roadmap of Algorithm Single area, single inactive router case Loop formation Black hole formation Single area, multiple inactive routers case Multiple areas Black hole formation and area partitions Ph.D. Defense

Single Area, Single Inactive Router Problem Formulation Inactive Router = R All routers other than R have the same image of the topology graph R’s image is that of a past = the time at which it went down Source = S, Destination = D Next hop(R, D) = Y Actual path a packet takes from S to D = P(SD) Ph.D. Defense

Loop Detection P(SD) has a loop iff S and Y have R on their paths to D in their SPTs D R 3 2 6 Topology when R went down S 1 Y 20 D R 10 2 6 S 1 Y Topology changes while R is inactive 20 Y R D 2 6 S and Y have R on their paths to D in their SPT S 1 If there is a loop, neighbor can always detect it Ph.D. Defense

Loop Prevention Every router needs to calculate a path to D such that R does not appear on it D R 10 2 6 S 1 Y Changed topology while R is inactive 20 S D 20 S and Y calculate paths to D w/o R on it Y 10 Ph.D. Defense

Loop Avoidance Procedure R sends forwarding table to neighbors before shutdown - Thus, Y knows that next hop(R, D) is Y Detection: during SPF calculation neighbors detect loops - Y checks if R exists on the path to D or not Upon detection, neighbors send avoid messages to other routers in the domain - avoid(R, D) = avoid using R for reaching D Prevention: upon receiving avoid(R, D) message, other routers calculate a new path to D without R on it Ph.D. Defense

Performance Maximum effect on SPF calculation Prototype Implementation Quantify overhead Impact of topology size Prototype Implementation IBB extensions incorporated into GateD 4.0.7 Ph.D. Defense

Testbed Setup Physical Topology SUT’s view of the Topology LAN TopTracker Physical Topology SUT System Under Test = where IBB overhead is measured SUT 1 SUT’s view of the Topology TopTracker LAN LSAs LSAs LSAs 1 Router under maintenance 20 X R M1 Complete graph with n nodes Emulated topology Ph.D. Defense

Experiment Sequence mean SPF time in Case B Overhead = GateD on SUT IBB-GateD on SUT Time (mins) T = 0 Bring R down Bring R down in IBB mode Case A inactive rtr T = 4 Send avoid(R, Mj) messages to SUT (1j  n) Case B inactive rtr, avoid it T = 8 Bring R up Overhead = mean SPF time in Case B mean SPF time in Case A Ph.D. Defense

Result Overhead remains constant at roughly 2.0 as n increases Sources of overhead: Second SPF calculation Graph in case B is larger than graph in case A Ph.D. Defense

Summary IBB proposal: extend OSPF so that a router can be used for forwarding even while its OSPF process is inactive Main contribution: algorithm that gracefully handles topology changes Stops using the inactive router for a destination if using the router can lead to loops or black holes Overhead of the algorithm is modest Shows good scaling behavior in terms of topology size Ph.D. Defense

Outline Conclusions and future work Background Monitoring Characterization Maintenance Conclusions and future work Ph.D. Defense

Conclusions Monitoring Characterization Maintenance Design and implementation of an OSPF monitor Deployment in two commercial networks Characterization Black-box techniques for estimating OSPF processing delays within a router Case study of OSPF dynamics in enterprise network Maintenance I’ll Be Back (IBB) capability for OSPF that allows a “router-under-maintenance” to be used for forwarding Ph.D. Defense

Future Work Three principal directions for future work Application of this work to other routing protocols IS-IS is very similar to OSPF EIGRP, RIP and BGP bring their own set of challenges Distance-vector nature of the protocols BGP also brings scalability issues Other areas related to routing and network management Security, network design, configuration management, simulation & modeling How performance of routing infrastructure affects user-perceived performance More work in each of three focus areas Ph.D. Defense

Future Work for Monitoring Real-time analysis More meaningful alerting Correlation with other fault and performance data Learn from past events Prioritization of alerts Off-line analysis Correlation with other data sources Work already underway: BGP, fault, performance Identification of problem signatures and feeding them into real-time component for problem prediction Ph.D. Defense

Future Work for Characterization Expand measurements to cover other router vendors and commercial networks Use results to build simulation and analytical models Validation of models Ph.D. Defense

Future Work for Maintenance Improvements to IBB scheme Incremental deployment Reduction in overhead How to use IBB-like schemes in conjunction with other approaches Routing software that can be upgraded without bringing the process down Use of redundant route processors and seamless transfer of control Scheduling maintenance task such that they have minimal impact Ph.D. Defense

Networks that manage themselves! Holy Grail Networks that manage themselves! Ph.D. Defense

Probably your last chance… :-) Grill me ... Probably your last chance… :-) Q and A Ph.D. Defense

Backups Ph.D. Defense

Partial Adjacency for LSAR I need LSA L from LSAR I have LSA L R LSAR Please send me LSA L Please send me LSA L Please send me LSA L Partial state Router R does not advertise a link to LSAR Routers (except R) not aware of the presence of LSAR Does not trigger SPF calculations in network LSAR’s going up/down does not impact the network LSAR does not originate any LSAs LSARR link not used for data forwarding LSAR does not install any routes in forwarding table Ph.D. Defense

Multiple Inactive Routers for IBB Loop Avoidance Change in loop detection conditions Simplification for loop prevention No change in black-hole detection Ph.D. Defense

Loop Avoidance Set of inactive routers: R1, R2, …, Rn Loop avoidance procedure applies for each inactive router Detection Router detects loops for all its inactive neighbors Prevention A router can get avoid(Ri, D) messages for j inactive routers (j <= n) The router avoids these j forbidden routers on its path to D Problem: Set of forbidden routers can be different for different destinations O(n) shortest path calculations n = number of vertices Ph.D. Defense

Simplification Router avoids all inactive routers if it has some forbidden routers on its path to D Calculate two SPTs: SPT with all inactive routers on it SPT w/o any inactive router on it If the path to D does not contain any forbidden routers on it, Pick next hop for D from the first SPT Else, Pick next hop for D from the second SPT Ph.D. Defense

Multiple Inactive Routers: Loop Detection Loop detection condition for single inactive router cannot detect all loop when multiple routers are inactive Two new conditions for loop detection by neighbors Generalization of loop detection for single inactive router Conditions can result in false positives Evaluation using realistic OSPF topology graphs with two inactive routers Using two conditions together eliminate most false positives (90% hit-rate), but not all... Ph.D. Defense

Publications Aman Shaikh, Mukul Goyal, Albert Greenberg, Raju Rajan and K.K. Ramakrishnan, An OSPF Topology Server: Design and Evalution, IEEE J- SAC, 20(4), May 2002. Aman Shaikh and Albert Greenberg, OSPF Monitoring: Architecture, Design, and Deployment Experience, submitted to NSDI, 2004. Aman Shaikh and Albert Greenberg, Experience in Black-box OSPF Measurement, In Proc. ACM SIGCOMM IMW, pp. 113-125, November 2001 Aman Shaikh, Chris Isett, Albert Greenberg, Matthew Roughan and Joel Gottlieb, A Case Study of OSPF Behavior in a Large Enterprise Network, In Proc. ACM SIGCOMM IMW, pp. 217-230, November 2002. Aman Shaikh, Rohit Dube and Anujan Varma, Avoiding Instability during Graceful Shutdown of OSPF, In Proc. IEEE INFOCOM, June 2002. Aman Shaikh, Rohit Dube and Anujan Varma, Avoiding Instability during Graceful Shutdown of Multiple OSPF Routers, submitted to IEEE/ACM Transactions on Networking (ToN). Ph.D. Defense