BGP Scalability. 2 © 2002, Cisco Systems, Inc. Introduction Will discuss various bugs we have fixed in BGP scalability Talk about different configuration.

Slides:

Advertisements

Similar presentations

CCNP Network Route BGP Part -I BGP : Border Gateway Protocol. It is a distance vector protocol It is an External Gateway Protocol and basically used for.

Advertisements

Technical Aspects of Peering Session 4. Overview Peering checklist/requirements Peering step by step Peering arrangements and options Exercises.

1 Copyright  1999, Cisco Systems, Inc. Module10.ppt10/7/1999 8:27 AM BGP — Border Gateway Protocol Routing Protocol used between AS’s Currently Version.

CS Summer 2003 CS672: MPLS Architecture, Applications and Fault-Tolerance.

Border Gateway Protocol Ankit Agarwal Dashang Trivedi Kirti Tiwari.

CS540/TE630 Computer Network Architecture Spring 2009 Tu/Th 10:30am-Noon Sue Moon.

© J. Liebeherr, All rights reserved 1 Border Gateway Protocol This lecture is largely based on a BGP tutorial by T. Griffin from AT&T Research.

BGP. 2 Copyright © 2009 Juniper Networks, Inc. BGP Overview Is an inter-domain routing protocol that communicates prefix reachablility.

© 2006 Cisco Systems, Inc. All rights reserved. MPLS v2.2—2-1 Label Assignment and Distribution Introducing Typical Label Distribution in Frame-Mode MPLS.

© 2009 Cisco Systems, Inc. All rights reserved. ROUTE v1.0—2-1 Implementing an EIGRP-Based Solution Advanced EIGRP Features in an Enterprise Network.

1 Interdomain Routing Protocols. 2 Autonomous Systems An autonomous system (AS) is a region of the Internet that is administered by a single entity and.

Best Practices for ISPs

Router Architecture : Building high-performance routers Ian Pratt

CS Summer 2003 Lecture 4. CS Summer 2003 Route Aggregation The process of representing a group of prefixes with a single prefix is known as.

CSEE W4140 Networking Laboratory Lecture 4: IP Routing (RIP) Jong Yul Kim

CSEE W4140 Networking Laboratory Lecture 4: IP Routing (RIP) Jong Yul Kim

More on BGP Check out the links on politics: ICANN and net neutrality To read for next time Path selection big example Scaling of BGP.

A Routing Control Platform for Managing IP Networks Jennifer Rexford Princeton University

Routing and Routing Protocols

© 2009 Cisco Systems, Inc. All rights reserved.ROUTE v1.0—6-1 Connecting an Enterprise Network to an ISP Network Configuring and Verifying Basic BGP Operations.

© 2009 Cisco Systems, Inc. All rights reserved. ROUTE v1.0—6-1 Connecting an Enterprise Network to an ISP Network Considering the Advantages of Using BGP.

BGP Attributes and Path Selections

1 © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential Session Number Presentation_ID Advanced BGP Convergence Techniques Pradosh Mohapatra.

Introduction to BGP 1. Border Gateway Protocol A Routing Protocol used to exchange routing information between different networks – Exterior gateway protocol.

Routing. A world without networks and routing  No connection between offices, people and applications  Worldwide chaos because of the lack of centralized.

Interior Gateway Routing Protocol (IGRP) is a distance vector interior routing protocol (IGP) invented by Cisco. It is used by routers to exchange routing.

Chapter 8 Routing. Introduction Look at: –Routing Basics (8.1) –Address Resolution (8.2) –Routing Protocols (8.3) –Administrative Classification (8.4)

TCOM 515 Lecture 6.

Distance Vector Routing Protocols W.lilakiatsakun.

Dynamic Routing Protocols  Function(s) of Dynamic Routing Protocols: – Dynamically share information between routers (Discover remote networks). – Automatically.

© 2009 Cisco Systems, Inc. All rights reserved. ROUTE v1.0—6-1 Connecting an Enterprise Network to an ISP Network BGP Attributes and Path Selection Process.

The Hebe-jebes (or He-B-GPs): Understanding the Roles of EBGP, IBGP and an IGP Using Lab 7-4, IBGP, Next Hop and Synchronization Rick Graziani Cabrillo.

Lecture 4: BGP Presentations Lab information H/W update.

Chapter 9. Implementing Scalability Features in Your Internetwork.

Border Gateway Protocol

BGP routing table entry for /16, version Paths: (4 available, best #1) Advertised to peer-groups: AS4544-AGG-CUSTOMER-FULL

© 2001, Cisco Systems, Inc. A_BGP_Confed BGP Confederations.

Copyright 2012 Kenneth M. Chipps Ph.D. Cisco CCNA Exploration CCNA 2 Routing Protocols and Concepts BGP Last Update

BGP4 - Border Gateway Protocol. Autonomous Systems Routers under a single administrative control are grouped into autonomous systems Identified by a 16.

Border Gateway Protocol (BGP) W.lilakiatsakun. BGP Basics (1) BGP is the protocol which is used to make core routing decisions on the Internet It involves.

More on Internet Routing A large portion of this lecture material comes from BGP tutorial given by Philip Smith from Cisco (ftp://ftp- eng.cisco.com/pfs/seminars/APRICOT2004.

Copyright 2008 Kenneth M. Chipps Ph.D. Controlling Flow Last Update

Evolving Toward a Self-Managing Network Jennifer Rexford Princeton University

Evolving Toward a Self-Managing Network Jennifer Rexford Princeton University

© 2005 Cisco Systems, Inc. All rights reserved. BGP v3.2—6-1 Scaling Service Provider Networks Scaling IGP and BGP in Service Provider Networks.

Route Selection Using Policy Controls

TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.

© 2005 Cisco Systems, Inc. All rights reserved. BGP v3.2—5-1 Customer-to-Provider Connectivity with BGP Connecting a Multihomed Customer to a Single Service.

© 2005 Cisco Systems, Inc. All rights reserved. BGP v3.2—7-1 Optimizing BGP Scalability Implementing BGP Peer Groups.

© 2005 Cisco Systems, Inc. All rights reserved. BGP v3.2—2-1 BGP Transit Autonomous Systems Forwarding Packets in a Transit AS.

© 2005 Cisco Systems, Inc. All rights reserved. BGP v3.2—7-1 Optimizing BGP Scalability Improving BGP Convergence.

BGP Transit Autonomous System

RIP Routing Protocol. 2 Routing Recall: There are two parts to routing IP packets: 1. How to pass a packet from an input interface to the output interface.

Chapter 25 Internet Routing. Static Routing manually configured routes that do not change Used by hosts whose routing table contains one static route.

BGP Basics BGP uses TCP (port 179) BGP Established unicast-based connection to each of its BGP- speaking peers. BGP allowing the TCP layer to handle such.

Text BGP Basics. Document Name CONFIDENTIAL Border Gateway Protocol (BGP) Introduction to BGP BGP Neighbor Establishment Process BGP Message Types BGP.

RRG Nov 08 Mapped BGP Paul Francis, Cornell Xiaohu Xu, Huawei Hitesh Ballani, Cornell.

Inter-domain Routing Outline Border Gateway Protocol.

1 QOS ©2000, Cisco Systems, Inc. BGP MED Churn Daniel Walton

Connecting an Enterprise Network to an ISP Network

BGP 1. BGP Overview 2. Multihoming 3. Configuring BGP.

Border Gateway Protocol

Explaining BGP Concepts and Terminology

BGP supplement Abhigyan Sharma.

BGP Overview BGP concepts and operation.

John Scudder October 24, 2000 BGP Update John Scudder October 24, 2000.

Working Principle of BGP

Distance Vector Routing Protocols

Computer Networks Protocols

Presentation transcript:

BGP Scalability

2 © 2002, Cisco Systems, Inc. Introduction Will discuss various bugs we have fixed in BGP scalability Talk about different configuration changes you can make to improve convergence Software improvements for faster convergence

3 © 2002, Cisco Systems, Inc. Before we begin… What does this graph show? Shows the number of peers we can converge in 10 minutes (y-axis) given a certain number of routes (x-axis) to advertise to those peers Example: We can advertise 100k routes to 50 peers with 12.0(12)S or 110 peers with 12.0(13)S

4 © 2002, Cisco Systems, Inc. Old Improvements CSCdr50217 – “BGP: Sending updates slow” Fixed in 12.0(13)S Description Fixed a problem in bgp_io which allows BGP to send data to TCP more aggressively

5 © 2002, Cisco Systems, Inc. Old Improvements What does CSCdr50217 mean in terms of scalability? Almost 100% improvement!!

6 © 2002, Cisco Systems, Inc. Old Improvements – Peer Groups Advertising 100,000+ routes to hundreds of peers is a big challenge from a scalability point of view. BGP will need to send a few hundred megs of data in order to converge all peers Two part challenge Generating the hundreds of megs of data Advertising this data to BGP peers Peer-groups make it easier for BGP to advertise routes to large numbers of peers by addressing these two problems Using peer-groups will reduce BGP convergence times and make BGP much more scalable

7 © 2002, Cisco Systems, Inc. Peer Groups UPDATE generation without peer-groups The BGP table is walked for every peer, prefixes are filtered through outbound policies, UPDATEs are generated and sent to this one peer UPDATE generation with peer-groups A peer-group leader is elected for each peer-group. The BGP table is walked for the leader only, prefixes are filtered through outbound policies, UPDATEs are generated and sent to the peer-group leader and replicated for peer-group members that are synchronized with the leader If we generate an update for the peer-group leader and replicate it to all peer-group members we are achieving 100% replication

8 © 2002, Cisco Systems, Inc. Peer Groups A peer-group member is “synchronized” with the leader if all UPDATEs sent to the leader have also been sent to the peer- group member The more peer-group members stay in sync the more UPDATEs BGP can replicate. Replicating an UPDATE is much easier/faster than formatting an UPDATE. Formatting requires a table walk and policy evaluation, replication does not A peer-group member can fall out of sync for several reasons *Slow TCP throughput **Rush of TCP Acks fill input queues resulting in drops Peer is busy doing other tasks Peer has a slower CPU than the peer-group leader

9 © 2002, Cisco Systems, Inc. Old Improvements Peer-groups give between 35% - 50% increase in scalability

10 © 2002, Cisco Systems, Inc. Larger Input Queues In a nutshell If a BGP speaker is pushing a full Internet table to a large number of peers, convergence is degraded due to enormous numbers of drops (100k+) on the interface input queue. ISP foo gets ~½ million drops in 15 minutes on their typical route reflector. With the default interface input queue depth of 75, it takes us ~19 minutes to advertise 75k real world routes to 500 clients. The router drops ~225,000 packets (mostly TCP Acks) in this period. By using brute force and setting the interface input queue depth to 4096, it takes us ~10 minutes to send the same number of routes to the same number of clients. The router drops ~20,000 packets in this period

11 © 2002, Cisco Systems, Inc. Larger Input Queues

12 © 2002, Cisco Systems, Inc. Larger Input Queues Rush of TCP Acks from peers can quickly fill the 75 spots in process level input queues Increasing queue depths (4096) improves BGP scalability

13 © 2002, Cisco Systems, Inc. Larger Input Queues Why not change default input queue size? May happen someday but people are nervous CSCdu69558 has been filed for this issue Even with 4096 spots in the input queue we can still see drops given enough routes/peers Need to determine “How big is too big” in terms of how large an input queue can be before we are processing the same data multiple times

14 © 2002, Cisco Systems, Inc. MTU Discovery Default MSS (Max Segment Size) is 536 bytes Inefficient for today’s POS/Ethernet networks Using “ip tcp path-mtu-discovery” improves convergence

15 © 2002, Cisco Systems, Inc. MTU Discovery and Larger Input Queues Simple config changes can give 3x improvement

16 © 2002, Cisco Systems, Inc. UPDATE Packing Quick review on BGP UPDATEs An UPDATE contains: | Withdrawn Routes Length (2 octets) | | Withdrawn Routes (variable) | | Total Path Attribute Length (2 octets) | | Path Attributes (variable) | | Network Layer Reachability Information (variable) | At the top you list a combination of attributes (MED = 50, Local Pref = 200, etc) Then you list all of the NLRI (prefixes) that share this combination of attributes

17 © 2002, Cisco Systems, Inc. Update Packing If your BGP tables contains 100k routes and 15k attribute combinations then you can advertise all the routes with 15k updates if you pack the prefixes 100% If it takes you 100k updates then you are achieving 0% update packing Convergence times vary greatly depending on the # of attribute combinations used in the table and on how well BGP packs updates Ideal Table Routem generated BGP table of 75k routes All paths have the same attribute combination Real Table 75k route feed from Digex (replayed via routem) ~12,000 different attribute combinations

18 © 2002, Cisco Systems, Inc. Update Packing

19 © 2002, Cisco Systems, Inc. Update Packing With the ideal table we are able to pack the maximum number of prefixes into each update because all prefixes share a common set of attributes. With the real world table we send updates that are not fully packed because we walk the table based on prefix but prefixes that are side by side may have different attributes. We can only walk the table for a finite amount of time before we have to release the CPU so we may not find all the NLRI for a give attribute combination before sending the updates we have built and suspending. With 500 RRCs the ideal table takes ~4 minutes to converge where a real world table takes ~19 minutes!!

20 © 2002, Cisco Systems, Inc. UPDATE Packing UPDATE packing bugs BGP would pack one NLRI per update unless “set metric” was configured in an outbound route-map CSCdt BGP: Misc fixes for update-generation – 12.0(16.6)S CSCdv BGP update packing suffers with confederation peers – 12.0(19.5)S Same fix but CSCdt81280 is for regular iBGP and CSCdv52271 is for confed peers

21 © 2002, Cisco Systems, Inc. UPDATE Packing Example of CSCdt81280 from customer router BGP has 132k routes and 26k attribute combinations Took 130k messages to advertise 132k routes network entries and paths using bytes of memory BGP path attribute entries using bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd :20: :21:

22 © 2002, Cisco Systems, Inc. UPDATE Packing CSCdt34187 introduces an improved update generation algorithm: 100% update packing – attribute distribution no longer makes a significant impact 100% peer-group replication – no longer have to worry about peers staying “in sync”

23 © 2002, Cisco Systems, Inc. UPDATE Packing 4x – 6x improvement!!

24 © 2002, Cisco Systems, Inc. UPDATE Packing 12.0(19)S + MTU discovery + Larger Input Queues = 14x improvement

25 © 2002, Cisco Systems, Inc. READ_ONLY Mode READ_ONLY Mode - If BGP is in READ_ONLY mode then BGP is only accepting routing updates and is not computing a best path nor advertising routes for any prefixes. When the BGP process starts (i.e. after a router reboot) BGP will go into READ_ONLY mode for a maximum of two minutes. RO mode forces a BGP speaker to be still for a few minutes giving his peers a chance to send their initial set of updates. The more routes/paths BGP has the more stable the network will be because we will avoid the scenario where BGP sends an update for a prefix and then learns about a better path for that prefix a few seconds later. If that happened then BGP sent two updates for a single prefix, which is very inefficient. READ_ONLY mode increases the chances of BGP learning about the bestpath for a prefix before sending out any advertisements for that prefix. BGP will transition from RO mode to RW mode once all of our peers have sent us their initial set of updates or the two-minute RO timer expires. READ_WRITE Mode - This is the normal mode of operation for BGP. While in READ_WRITE mode BGP will install routes in the routing table and will advertise those routes to his peers.

26 © 2002, Cisco Systems, Inc. READ_ONLY Mode RO and RW modes were introduced via CSCdm56595 RO timer (120 seconds) started when BGP process started Never worked on GSR because it takes more than 120 seconds for linecards to boot, IGP to converge, etc…

27 © 2002, Cisco Systems, Inc. READ_ONLY Mode CSCds66429 corrects oversights made by CSCdm56595 RO timer now starts when the first peer comes up Linecard boot times and IGP convergence are accounted for automatically Will transition to RW mode when one of the following happens: All peers have sent us a KA All peers that were up within 60 seconds of the first peer have sent us a KA. This way we do not wait 120s for a peer that is mis-configured The 120s timer pops

What happened to 12.0(21)S? What happened to 12.0(21)S?

29 © 2002, Cisco Systems, Inc. Introduction Customer demand for faster BGP convergence BGP could take over 60 minutes to converge 100+ peers CSCdt BGP should optimize update advertisement Committed to 12.0(18.6)S and 12.0(18)S1 Dramatically reduced convergence times and improved scalability Known as “Init” mode convergence algorithm Pre- CSCdt34187 method is known as “Normal” mode

30 © 2002, Cisco Systems, Inc. How does it work? CSCdt34187 improves convergence by achieving 100% update packing and 100% update replication New algorithm is used to efficiently pack updates and replicate them to all peer-group members BGP converges much faster but uses large amounts of transient memory to do so

31 © 2002, Cisco Systems, Inc. Oops When memory is low, BGP will throttle itself to avoid running out of memory The problem BGP does not have a low watermark in terms of how much memory it is allowed to use Can use the majority of memory but not all of it Other processes need more memory than BGP is leaving available The result Customers running 12.0(18)S1 or 12.0(19)S saw extremely low watermarks in free mem Upgrading to 12.0(21)S almost always resulted in malloc fail on GSR 12.0(21)S was deferred

32 © 2002, Cisco Systems, Inc. What is happening? Any event that causes another process to use large amounts of transient memory while BGP is converging can result in a malloc failure CEF XDR messages are the most common problem XDRs are used to update linecards with information about the RIB/FIB XDRs can consume a lot of memory

33 © 2002, Cisco Systems, Inc. XDR Triggers When a linecard boots, XDRs are used to send it the RIB/FIB Linecards booting while BGP is trying to converge can result in malloc failure Upgrading from 12.0(19)S to 12.0(21)S will cause the linecards to boot one at a time because various software components on the linecards must be upgraded If it takes more that 2 minutes (default update- delay timer) for all linecards to boot then cards will be coming up while BGP is converging

34 © 2002, Cisco Systems, Inc. XDR Triggers Any significant routing change can trigger a wave of XDRs A new peer comes up whose paths are better than the ones BGP currently has installed Must re-install new bestpaths which cause XDRs to be sent to all linecards

35 © 2002, Cisco Systems, Inc. XDR Triggers Double recursive lookups almost always trigger a significant routing change A [AS 100, advertises /8] | B C [B and C are in AS 200] B does not do next-hop-self on session to C Instead B does “redistribute connected” and “redistribute static” into BGP C will know about A’s next-hop but will know about it via BGP

36 © 2002, Cisco Systems, Inc. XDR Triggers A [AS 100, advertises /8] | B C [B and C are in AS 200] Step 1 - C will transition from RO mode to RW mode Step 2 - C will not have a route to A because he will only know about A via BGP but we haven’t installed any BGP routes Step 3 - C will select some other route as best and install it. Other BGP routes, including the route to A, are installed at this point Step 4 – BGP begins converging peers which uses most of the memory on the box Step 5 - bgp_scanner runs on C but now A is reachable so C’s bestpath for /8 changes Do this 100k times and you have a lot of XDR messages

37 © 2002, Cisco Systems, Inc. The Solution Must take multiple steps to avoid malloc failure #1- BGP has a RIB throttle mechanism that allows us to delay installing a route in the RIB if memory is low (). Avoids malloc failures during large routing changes like the double recursive scenario #2 – CEF will wait for all linecards to boot before enabling CEF on any linecard. Avoids the problem of sending XDRs to slow booting linecards while BGP is trying to converge

38 © 2002, Cisco Systems, Inc. The Solution #3 – If a linecard crashes/reboots while BGP is trying to converge CEF will signal BGP that it needs more transient memory to bring the linecard up. BGP will finish converging the current peer-group and will signal CEF that memory is available. #4 – “Init” mode in BGP will always try to leave 20M free for CEF (distributed platforms only. An additional 1/32 of total memory on the box will be left free for other processes #5 – BGP will fall back to Normal mode if we can’t converge without leaving required amounts of memory free

39 CCIE ‘99 Session 1624 © 1999, Cisco Systems, Inc.