Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Congestion Avoidance & Control for OSPF Networks (draft-ash-manral-ospf-congestion-control-00.txt) Jerry Ash AT&T Gagan Choudhury AT&T

Similar presentations


Presentation on theme: "1 Congestion Avoidance & Control for OSPF Networks (draft-ash-manral-ospf-congestion-control-00.txt) Jerry Ash AT&T Gagan Choudhury AT&T"— Presentation transcript:

1 1 Congestion Avoidance & Control for OSPF Networks (draft-ash-manral-ospf-congestion-control-00.txt) Jerry Ash AT&T gash@att.com Gagan Choudhury AT&T choudhury@att.com Vishwas Manral NetPlane Systems vishwasm@netplane.com Anurag Maunder Sanera Systems amaunder@sanera.net Vera Sapozhnikova AT&T sapozhnikova@att.com Mostafa Hashem Sherif AT&T mhs@att.com

2 2 Outline (draft-ash-manral-ospf-congestion-control-00.txt) qproblem: vconcerns over scalability of IGP link-state protocols (e.g., OSPF) vmuch evidence that LS protocols cannot recover from large failures & widespread loss of topology database information –failure experience –vendor analysis –simulation & modeling qpropose protocol mechanisms to address problem vthrottle LSA updates/retranmissions –detect & notify congestion state –neighbor nodes throttle LSA updates/retransmissions vkeep adjacencies up vdatabase backup & resynchronization vproprietary implementations of mechanisms have improved scalability/stability –need standard features for uniform implementation & interoperability qissues discussed on list

3 3 Background & Motivation qFailure experience vLS routing protocols cannot recover from large ‘flooding storms’ –triggered by wide range of causes: network failures, bugs, operational errors, etc. –flooding storm overwhelms processors, causes database asynchrony & incorrect shortest path calculation, etc. vAT&T has experienced several very large LS protocol failures (4/13/1998, 7/2000, 2/20/2001, described in I-D) qvendor analysis of LS protocol recovery from total network failure(loss of all database information in the specified scenario, 400 nodes, etc.) vrecovery time estimates up to 5.5 hours vexpectation is that vendor equipment recovery not adequate under large failure scenario qnetwork-wide event simulation model [choudhury] vmedium to large flooding storms cause network to recover with difficulty and/or not recover at all vmodel validated -- results match actual network experience

4 4 Failure Experience AT&T Frame Relay Network, 4/13/98 qcause & effect vadministrative error coupled with a software bug vresult was the loss of all topology database information vthe link-state protocol then attempted to recover the database with the usual Hello & topology state updates (TSUs) vhuge overload of control messages kept network down for very long time qseveral problems occurred to prevent the network from recovering properly (based on root-cause analysis) vvery large number of TSUs being sent to every node to process, causing general processor overload vroute computation based on incomplete topology recovery; routes generated based on transient, asynchronous topology information & then in need of frequent re-computation vinadequate work queue management to allow processes to complete before more work is put into the process queue vinability to access node processors with network management commands due to lack of necessary priority of these messages qworked with vendor to make protocol fixes to address problems valong the lines suggested in the I-D

5 5 Proposed Protocol Mechanisms Throttle LSA Updates/Retransmissions qdetect node-congestion by vlength of internal work queues vhigh processor occupancy & long CPU busy times qnotify congestion state to other nodes vuse TBD packet to convey congestion signal qwhen a node detects congestion from a neighbor vprogressively decrease flooding rate, e.g. vdouble LSA_RETRANSMIT_INTERVAL for low congestion vquadruple LSA_RETRANSMIT_INTERVAL for high congestion qsimulation analysis shows proposed mechanisms perform effectively (Choudhury) qdeals better with non-linear failure modes than statistical detection/notification methods

6 6 Issues Discussed on List qis there a problem (need to prevent catastrophic network collapse) vmost seem to agree there is a problem vseveral have observed ‘LSA storms’ & their ill effects –storms triggered by hardware failure, software bug, faulty operational practice, etc., many different events –sometimes network cannot recover –unacceptable to operators vvendors invited to analyze failure scenario given in draft –no response yet qhow to solve problem vbetter/smart implementation/coding of protocol within current specification –e.g., ‘never losing an adjacency solves problem’ –these are proprietary, single-vendor, implementation extensions vstandard protocol extensions –for uniform implementation –for multi-vendor interoperability –already demonstrated with proprietary, single-vendor implementations

7 7 Issues Discussed on List qwhat protocol extensions? vnot just ‘signaling congestion message on the wire’ but also response –need uniform response to congestion signal ‘slow down by this much’ to be effective –rather than ‘implementation dependent’ response –like helper router response to ‘grace LSA’ from congested router in hitless restart qhow evaluate effectiveness of proposals vexpert analysis based on experience vsimulation –a couple of ‘academic’ & ‘shaky simulation’ comments –validated simulations used widely for network design of routing features, nm features, congestion control, etc. for many years many large-scale network design examples (e.g., ‘Dynamic Routing in Telecommunications Networks’, McGraw Hill) v‘white-box’ approach –implement & text in the lab vexpert analysis, simulation, white-box all useful

8 8 Issues Discussed at IETF-55 Routing Area Meeting & MPLS WG Meeting qbox builders view: v‘stop intruding into our box’ vdesign choices should be made by box builders vnothing wrong with current way of building boxes qbox users view: vstill observe major failures –most agree there is a problem (from list discussion) vbox-builder/vendor analysis shows unacceptable failure response (in draft) –box-builders/vendors invited to analyze scenario in draft vbox-builders approach doesn’t work to prevent failures vboxes need a few, critical, standard protocol mechanisms to address problem vhave gotten vendors to make proprietary changes to fix problem vrequire standard protocol extensions –for uniform implementation –for multi-vendor interoperability quser requirements need to drive solution to problem

9 9 Conclusions qproblem: vconcerns over scalability of IGP link-state protocols vevidence that LS routing protocols (e.g., OSPF) currently can not recover from large failures & widespread loss of topology database information vproblem is flooding, data base asynchrony, shortest path calculation, etc. vevidence based on failure experience, vendor analysis, simulation & modeling qpropose protocol mechanisms to address problem, e.g. vthrottle LSA update/retransmissions –detect & notify congestion state –neighbor nodes throttle LSA updates/retransmissions qsimulation analysis shows effectiveness of proposed changes (Choudhury) qpropose draft as an OSPF WG document vrefine/evolve proposed protocol extensions

10 10 Backup Slides

11 11 Proposed Congestion Control Mechanisms qthrottle LSA updates/retransmissions vdetect & notify congestion state vcongested node signals other nodes to limit rate of LSA messages sent to it vneighbor nodes throttle LSA updates/retransmissions –automatically reduce rate under congestion qkeep adjacencies up qdatabase backup & resynchronization vtopology database automatically recovered from loss based on local backup mechanisms vallows a node to recover gracefully from local faults on the node qprioritized processing of Hello & LSA Ack messages (Choudhury draft)

12 12 Keep Adjacencies Up qincrease adjacency break interval under congestion vgoal is to avoid breaking adjacencies by increasing wait interval for non-receipt of Hello messages –if node detects congestion from a neighbor & if no packet received in NODE_DEAD-INTERVAL –wait additional time = ADJACENCY_BREAK_INTERAL before calling adjacency down qthrottle setups of link adjacencies vdefine MAX_ADJACENCY_BUILD_COUNT = maximum number of adjacencies a node can bring up at one time

13 13 Database Backup & Resynchronization qdatabase backup vnode should provide a local, primary, nonvolatile memory backup [GR-472-CORE] vnode should back up all non-self-originated LSAs, routing tables, & states of interfaces vdatabase should be backed up at least every 5 minutes vrestoration of data should be completed within 5 minutes of initiation [GR-472-CORE] qnodes signal neighbors when ’safe’ to perform resynchronization procedures vbased on TBD packet format qunder resynchronization, node vshould generate all its own LSAs vshould receive only LSAs that have changed between time it failed & current time vshould base its routing on current database, derived as above

14 14 Database Backup & Resynchronization qdatabase resynchronization vpropose changes to receiving/transmitting database summary & LSA request packets vwhen in full state –node sends & receives database summary & LSA request packets as if performing database synchronization when peer data structure is in Negotiating, Exchanging, & loading states vnode informs neighbor when to use resync procedures vnode supports resync to neighbor request by receiving/transmitting database summary & LSA request packets

15 15 Failure Experience qother failures which have occurred with similar consequences qmoderate TSU storm following ATM nodes upgrade, 7/2000 vnetwork recovered, with difficulty qlarge TSU storm in ATM network, 2/20/2001 [pappalardo1, pappalardo2] vmanual procedures required to reduce TSU flooding & stabilize network vdesirable to automate procedures for TSU flooding reduction under overload qworked with vendor to make protocol fixes to address problems valong the lines suggested in the I-D qother relevant LS-network failures have been reported [cholewka, jander] qconclusions vLS vulnerable to loss of database information, control overload to re-sync databases, & other failure/overload scenarios vnetworks more vulnerable in absence of adequate protection mechanisms vgeneric problem of LS protocols –across a variety of implementations –across FR, ATM, & IP-based technologies

16 16 Vendor Analysis qvendors & service providers asked to analyze LS protocol recovery from total network failure(loss of all database information in the specified scenario qnetwork scenario v400 node network –100 backbone nodes –3 edge nodes per backbone node (edge single homed) vbackbone nodes connected to max of 10 backbone nodes –max node adjacency is 13 –sparse network v101 peer groups –1 backbone peer group with 100 backbone nodes –100 edge peer groups, each with 3 nodes, all homed on the backbone peer group v1,000,000 addresses advertised

17 17 Vendor Analysis qprojected recovery times vRecovery Time Estimate A – 3.5 hours vRecovery Time Estimate B – 5-15 minutes vRecovery Time Estimate C – 5.5 hours qexpectation is that vendor equipment recovery not adequate under large failure scenario

18 18 Analysis Modeling qvarious studies published [atmf00-0249, maunder, choudhury] q[choudhury] reports network-wide event simulation model vstudy impact of a TSU storm vcaptures –node congestion –propagation delay between nodes –retransmissions if TSU not acknowledged within 5 seconds –link declared down if Hello delayed beyond “node-dead interval” (aka “inactivity timer” in PNNI, “router-dead interval” in OSPF) –link recovery following database synchronization vapproximates real network behavior & processing times vresults show –dispersion -- number of control packets generated but not processed in at least one node –medium to large TSU storms cause network to recover with difficulty and/or not recover at all –results match actual network experience

19 19 Impact of TSU Storm on Network Stability


Download ppt "1 Congestion Avoidance & Control for OSPF Networks (draft-ash-manral-ospf-congestion-control-00.txt) Jerry Ash AT&T Gagan Choudhury AT&T"

Similar presentations


Ads by Google