1 Congestion Avoidance & Control for OSPF Networks (draft-ash-manral-ospf-congestion-control-00.txt) Jerry Ash AT&T Gagan Choudhury AT&T

Slides:



Advertisements
Similar presentations
1 A. Sshaikh, A. Greenberg; Nov 01 UCSC Sigcomm IMW Experience in Black-box OSPF Measurement Aman Shaikh, UCSC Albert Greenberg, AT&T Labs-Research.
Advertisements

© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 OSPF Routing Protocols and Concepts – Chapter 11.
1 Aman Shaikh: June 02 UCSC INFOCOM 2002 Avoiding Instability during Graceful Shutdown of OSPF Aman Shaikh, UCSC Joint work with Rohit Dube, Xebeo Communications.
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
By Alex Kirshon and Dima Gonikman Under the Guidance of Gabi Nakibly.
1 LINK STATE PROTOCOLS (contents) Disadvantages of the distance vector protocols Link state protocols Why is a link state protocol better?
1 ELEN 602 Lecture 20 More on Routing RIP, OSPF, BGP.
Multiple constraints QoS Routing Given: - a (real time) connection request with specified QoS requirements (e.g., Bdw, Delay, Jitter, packet loss, path.
1 Relates to Lab 4. This module covers link state routing and the Open Shortest Path First (OSPF) routing protocol. Dynamic Routing Protocols II OSPF.
A General approach to MPLS Path Protection using Segments Ashish Gupta Ashish Gupta.
A General approach to MPLS Path Protection using Segments Ashish Gupta Ashish Gupta.
ROUTING PROTOCOLS Rizwan Rehman. Static routing  each router manually configured with a list of destinations and the next hop to reach those destinations.
Objectives After completing this chapter you will be able to: Describe hierarchical routing in OSPF Describe the 3 protocols in OSPF, the Hello, Exchange.
1 Relates to Lab 4. This module covers link state routing and the Open Shortest Path First (OSPF) routing protocol. Dynamic Routing Protocols II OSPF.
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Routing in the Internet Internal Routing Protocols.
CCNP Network Route OSPF Part -I OSPF: Open Shortest Path First Concept of OSPF: 1. It is a link state routing protocol. 2. There are basically only 2 ISIS.
Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.
Chapter 12 Intro to Routing & Switching.  Upon completion of this chapter, you should be able to:  Read a routing table  Configure a static route 
1 CS 4396 Computer Networks Lab Dynamic Routing Protocols - II OSPF.
1 Pertemuan 20 Teknik Routing Matakuliah: H0174/Jaringan Komputer Tahun: 2006 Versi: 1/0.
Dynamic Routing Protocols  Function(s) of Dynamic Routing Protocols: – Dynamically share information between routers (Discover remote networks). – Automatically.
M. Menelaou CCNA2 DYNAMIC ROUTING. M. Menelaou DYNAMIC ROUTING Dynamic routing protocols can help simplify the life of a network administrator Routing.
LAN Switching and WAN Networks Topic 6 - OSPF. What we have done so far! 18/09/2015Richard Hancock2  Looked at the basic switching concepts and configuration.
“Intra-Network Routing Scheme using Mobile Agents” by Ajay L. Thakur.
Routing protocols Basic Routing Routing Information Protocol (RIP) Open Shortest Path First (OSPF)
Ogier - 1 MANET Extension of OSPF Using CDS Flooding draft-ogier-manet-ospf-extension-03.txt Richard Ogier March 2, 2005 IETF Meeting - OSPF WG.
1 Explicit Marking and Prioritized Treatment of Specific OSPF Packets for Faster Convergence and Improved Network Scalability and Stability (draft-ietf-ospf-scalability-02.txt)
Introduction to OSPF Nishal Goburdhan. Routing and Forwarding Routing is not the same as Forwarding Routing is the building of maps Each routing protocol.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 OSPF Routing Protocols and Concepts – Chapter 11.
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 3 v3.0 Module 2 Single-Area OSPF.
CCNA 3 Week 2 Link State Protocols OSPF. Copyright © 2005 University of Bolton Distance Vector vs Link State Distance Vector –Copies Routing Table to.
IGP Data Plane Convergence draft-ietf-bmwg-dataplane-conv-meth-14.txt draft-ietf-bmwg-dataplane-conv-term-14.txt draft-ietf-bmwg-dataplane-conv-app-14.txt.
1 Module 4: Implementing OSPF. 2 Lessons OSPF OSPF Areas and Hierarchical Routing OSPF Operation OSPF Routing Tables Designing an OSPF Network.
Lecture 2 Agenda –Finish with OSPF, Areas, DR/BDR –Convergence, Cost –Fast Convergence –Tools to troubleshoot –Tools to measure convergence –Intro to implementation:
TCOM 509 – Internet Protocols (TCP/IP) Lecture 06_a Routing Protocols: RIP, OSPF, BGP Instructor: Dr. Li-Chuan Chen Date: 10/06/2003 Based in part upon.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 2 v3.1 Module 7 Distance Vector Routing Protocols.
 Development began in 1987  OSPF Working Group (part of IETF)  OSPFv2 first established in 1991  Many new features added since then  Updated OSPFv2.
Distance Vector Routing Protocols Dynamic Routing.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 8: Single-Area OSPF Routing Protocols.
Open Shortest Path First (OSPF)
Dynamic Routing Protocols II OSPF
CO5023 EIGRP. Features of EIGRP EIGRP is a highly advanced distanced vector routing protocol Uses Protocol dependent modules to route different packets.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 8: Single-Area OSPF Routing Protocols.
1 IGP Data Plane Convergence Benchmarking draft-ietf-bmwg-igp-dataplane-conv-app-01.txt draft-ietf-bmwg-igp-dataplane-conv-term-01.txt draft -ietf-bmwg-igp-dataplane-conv-meth-01.txt.
CCNP Routing Semester 5 Chapter 4 OSPF.
1 24-Feb-16 S Ward Abingdon and Witney College OSPF CCNA Exploration Semester 2 Chapter 11.
© 2009 Cisco Systems, Inc. All rights reserved. ROUTE v1.0—3-1 Implementing a Scalable Multiarea Network OSPF-Based Solution How OSPF Packet Processes.
1 Traffic Engineering By Kavitha Ganapa. 2 Introduction Traffic engineering is concerned with the issue of performance evaluation and optimization of.
1 LSA Flooding Optimization Algorithms and Their Simulation Study (draft-choudhury-manral-flooding-simulation-00.txt) Gagan Choudhury AT&T
Single Area OSPF Module 2, Review How routing information is maintained Link-state routers apply the Dijkstra shortest path first algorithm against.
Routing Semester 2, Chapter 11. Routing Routing Basics Distance Vector Routing Link-State Routing Comparisons of Routing Protocols.
1 IGP Data Plane Convergence Benchmarking draft-ietf-bmwg-igp-dataplane-conv-app-00.txt draft-ietf-bmwg-igp-dataplane-conv-term-00.txt draft -ietf-bmwg-igp-dataplane-conv-meth-00.txt.
Chapter 11 Chapter 8 Routing & Switching Open Shortest Path First OSPF Thanks to instructors at St. Clair College in Windsor, Ontario.
1 Relates to Lab 4. This module covers link state routing and the Open Shortest Path First (OSPF) routing protocol. Dynamic Routing Protocols II OSPF.
1 CMPT 471 Networking II OSPF © Janice Regan,
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 8: Single-Area OSPF Routing & Switching.
Dynamic Routing Protocols II OSPF
Richard Ogier Presented by Tom Henderson July 28, 2011
OSPF (Open Shortest Path First)
CCNA 2 v3.1 Module 7 Distance Vector Routing Protocols
Dynamic Interior Routing Information Mechanisms
Dynamic Routing Protocols II OSPF
CCNA 3 v3 JEOPARDY Module 2 CCNA3 v3 Module 2 K. Martin.
Link-State Routing Protocols
Dynamic Routing and OSPF
Chapter 8: Single-Area OSPF
COS 561: Advanced Computer Networks
Link-State Routing Protocols
Routing Protocols and Concepts – Chapter 11
Dynamic Routing Protocols part3 B
Presentation transcript:

1 Congestion Avoidance & Control for OSPF Networks (draft-ash-manral-ospf-congestion-control-00.txt) Jerry Ash AT&T Gagan Choudhury AT&T Vishwas Manral NetPlane Systems Anurag Maunder Sanera Systems Vera Sapozhnikova AT&T Mostafa Hashem Sherif AT&T

2 Outline (draft-ash-manral-ospf-congestion-control-00.txt) qproblem: vconcerns over scalability of IGP link-state protocols (e.g., OSPF) vmuch evidence that LS protocols cannot recover from large failures & widespread loss of topology database information –failure experience –vendor analysis –simulation & modeling qpropose protocol mechanisms to address problem vthrottle LSA updates/retranmissions –detect & notify congestion state –neighbor nodes throttle LSA updates/retransmissions vkeep adjacencies up vdatabase backup & resynchronization vproprietary implementations of mechanisms have improved scalability/stability –need standard features for uniform implementation & interoperability qissues discussed on list

3 Background & Motivation qFailure experience vLS routing protocols cannot recover from large ‘flooding storms’ –triggered by wide range of causes: network failures, bugs, operational errors, etc. –flooding storm overwhelms processors, causes database asynchrony & incorrect shortest path calculation, etc. vAT&T has experienced several very large LS protocol failures (4/13/1998, 7/2000, 2/20/2001, described in I-D) qvendor analysis of LS protocol recovery from total network failure(loss of all database information in the specified scenario, 400 nodes, etc.) vrecovery time estimates up to 5.5 hours vexpectation is that vendor equipment recovery not adequate under large failure scenario qnetwork-wide event simulation model [choudhury] vmedium to large flooding storms cause network to recover with difficulty and/or not recover at all vmodel validated -- results match actual network experience

4 Failure Experience AT&T Frame Relay Network, 4/13/98 qcause & effect vadministrative error coupled with a software bug vresult was the loss of all topology database information vthe link-state protocol then attempted to recover the database with the usual Hello & topology state updates (TSUs) vhuge overload of control messages kept network down for very long time qseveral problems occurred to prevent the network from recovering properly (based on root-cause analysis) vvery large number of TSUs being sent to every node to process, causing general processor overload vroute computation based on incomplete topology recovery; routes generated based on transient, asynchronous topology information & then in need of frequent re-computation vinadequate work queue management to allow processes to complete before more work is put into the process queue vinability to access node processors with network management commands due to lack of necessary priority of these messages qworked with vendor to make protocol fixes to address problems valong the lines suggested in the I-D

5 Proposed Protocol Mechanisms Throttle LSA Updates/Retransmissions qdetect node-congestion by vlength of internal work queues vhigh processor occupancy & long CPU busy times qnotify congestion state to other nodes vuse TBD packet to convey congestion signal qwhen a node detects congestion from a neighbor vprogressively decrease flooding rate, e.g. vdouble LSA_RETRANSMIT_INTERVAL for low congestion vquadruple LSA_RETRANSMIT_INTERVAL for high congestion qsimulation analysis shows proposed mechanisms perform effectively (Choudhury) qdeals better with non-linear failure modes than statistical detection/notification methods

6 Issues Discussed on List qis there a problem (need to prevent catastrophic network collapse) vmost seem to agree there is a problem vseveral have observed ‘LSA storms’ & their ill effects –storms triggered by hardware failure, software bug, faulty operational practice, etc., many different events –sometimes network cannot recover –unacceptable to operators vvendors invited to analyze failure scenario given in draft –no response yet qhow to solve problem vbetter/smart implementation/coding of protocol within current specification –e.g., ‘never losing an adjacency solves problem’ –these are proprietary, single-vendor, implementation extensions vstandard protocol extensions –for uniform implementation –for multi-vendor interoperability –already demonstrated with proprietary, single-vendor implementations

7 Issues Discussed on List qwhat protocol extensions? vnot just ‘signaling congestion message on the wire’ but also response –need uniform response to congestion signal ‘slow down by this much’ to be effective –rather than ‘implementation dependent’ response –like helper router response to ‘grace LSA’ from congested router in hitless restart qhow evaluate effectiveness of proposals vexpert analysis based on experience vsimulation –a couple of ‘academic’ & ‘shaky simulation’ comments –validated simulations used widely for network design of routing features, nm features, congestion control, etc. for many years many large-scale network design examples (e.g., ‘Dynamic Routing in Telecommunications Networks’, McGraw Hill) v‘white-box’ approach –implement & text in the lab vexpert analysis, simulation, white-box all useful

8 Issues Discussed at IETF-55 Routing Area Meeting & MPLS WG Meeting qbox builders view: v‘stop intruding into our box’ vdesign choices should be made by box builders vnothing wrong with current way of building boxes qbox users view: vstill observe major failures –most agree there is a problem (from list discussion) vbox-builder/vendor analysis shows unacceptable failure response (in draft) –box-builders/vendors invited to analyze scenario in draft vbox-builders approach doesn’t work to prevent failures vboxes need a few, critical, standard protocol mechanisms to address problem vhave gotten vendors to make proprietary changes to fix problem vrequire standard protocol extensions –for uniform implementation –for multi-vendor interoperability quser requirements need to drive solution to problem

9 Conclusions qproblem: vconcerns over scalability of IGP link-state protocols vevidence that LS routing protocols (e.g., OSPF) currently can not recover from large failures & widespread loss of topology database information vproblem is flooding, data base asynchrony, shortest path calculation, etc. vevidence based on failure experience, vendor analysis, simulation & modeling qpropose protocol mechanisms to address problem, e.g. vthrottle LSA update/retransmissions –detect & notify congestion state –neighbor nodes throttle LSA updates/retransmissions qsimulation analysis shows effectiveness of proposed changes (Choudhury) qpropose draft as an OSPF WG document vrefine/evolve proposed protocol extensions

10 Backup Slides

11 Proposed Congestion Control Mechanisms qthrottle LSA updates/retransmissions vdetect & notify congestion state vcongested node signals other nodes to limit rate of LSA messages sent to it vneighbor nodes throttle LSA updates/retransmissions –automatically reduce rate under congestion qkeep adjacencies up qdatabase backup & resynchronization vtopology database automatically recovered from loss based on local backup mechanisms vallows a node to recover gracefully from local faults on the node qprioritized processing of Hello & LSA Ack messages (Choudhury draft)

12 Keep Adjacencies Up qincrease adjacency break interval under congestion vgoal is to avoid breaking adjacencies by increasing wait interval for non-receipt of Hello messages –if node detects congestion from a neighbor & if no packet received in NODE_DEAD-INTERVAL –wait additional time = ADJACENCY_BREAK_INTERAL before calling adjacency down qthrottle setups of link adjacencies vdefine MAX_ADJACENCY_BUILD_COUNT = maximum number of adjacencies a node can bring up at one time

13 Database Backup & Resynchronization qdatabase backup vnode should provide a local, primary, nonvolatile memory backup [GR-472-CORE] vnode should back up all non-self-originated LSAs, routing tables, & states of interfaces vdatabase should be backed up at least every 5 minutes vrestoration of data should be completed within 5 minutes of initiation [GR-472-CORE] qnodes signal neighbors when ’safe’ to perform resynchronization procedures vbased on TBD packet format qunder resynchronization, node vshould generate all its own LSAs vshould receive only LSAs that have changed between time it failed & current time vshould base its routing on current database, derived as above

14 Database Backup & Resynchronization qdatabase resynchronization vpropose changes to receiving/transmitting database summary & LSA request packets vwhen in full state –node sends & receives database summary & LSA request packets as if performing database synchronization when peer data structure is in Negotiating, Exchanging, & loading states vnode informs neighbor when to use resync procedures vnode supports resync to neighbor request by receiving/transmitting database summary & LSA request packets

15 Failure Experience qother failures which have occurred with similar consequences qmoderate TSU storm following ATM nodes upgrade, 7/2000 vnetwork recovered, with difficulty qlarge TSU storm in ATM network, 2/20/2001 [pappalardo1, pappalardo2] vmanual procedures required to reduce TSU flooding & stabilize network vdesirable to automate procedures for TSU flooding reduction under overload qworked with vendor to make protocol fixes to address problems valong the lines suggested in the I-D qother relevant LS-network failures have been reported [cholewka, jander] qconclusions vLS vulnerable to loss of database information, control overload to re-sync databases, & other failure/overload scenarios vnetworks more vulnerable in absence of adequate protection mechanisms vgeneric problem of LS protocols –across a variety of implementations –across FR, ATM, & IP-based technologies

16 Vendor Analysis qvendors & service providers asked to analyze LS protocol recovery from total network failure(loss of all database information in the specified scenario qnetwork scenario v400 node network –100 backbone nodes –3 edge nodes per backbone node (edge single homed) vbackbone nodes connected to max of 10 backbone nodes –max node adjacency is 13 –sparse network v101 peer groups –1 backbone peer group with 100 backbone nodes –100 edge peer groups, each with 3 nodes, all homed on the backbone peer group v1,000,000 addresses advertised

17 Vendor Analysis qprojected recovery times vRecovery Time Estimate A – 3.5 hours vRecovery Time Estimate B – 5-15 minutes vRecovery Time Estimate C – 5.5 hours qexpectation is that vendor equipment recovery not adequate under large failure scenario

18 Analysis Modeling qvarious studies published [atmf , maunder, choudhury] q[choudhury] reports network-wide event simulation model vstudy impact of a TSU storm vcaptures –node congestion –propagation delay between nodes –retransmissions if TSU not acknowledged within 5 seconds –link declared down if Hello delayed beyond “node-dead interval” (aka “inactivity timer” in PNNI, “router-dead interval” in OSPF) –link recovery following database synchronization vapproximates real network behavior & processing times vresults show –dispersion -- number of control packets generated but not processed in at least one node –medium to large TSU storms cause network to recover with difficulty and/or not recover at all –results match actual network experience

19 Impact of TSU Storm on Network Stability