A Case Study in Understanding OSPFv2 and BGP4 Interactions Using Efficient Experiment Design David Bauer†, Murat Yuksel‡, Christopher Carothers† and Shivkumar Kalyanaraman‡ †Department of Computer Science ‡Department of Electrical, Computer and Systems Engineering Rensselaer Polytechnic Institute
Problem Statement Computational Complexity Models: BGP4, OSPFv2, TCP-Reno, IPv4 Design Complexity Parameter Space: fixed inputs, protocol timers, decision algorithm Highly Detailed Models ROSS.Net built and utilized to address both parts of the problem ROSS.Net built and utilized to address both parts of the problem Goal: “good results fast” leading to an understanding of the system under test (make sense of the results) Goal: “good results fast” leading to an understanding of the system under test (make sense of the results)
BGP4 Response Surface OSPFv2 Understand protocol interactions through UPDATE messages generated by and between protocols Understand protocol interactions through UPDATE messages generated by and between protocols OO: OSPF caused OSPF Updates BO: BGP caused OSPF Updates BB: BGP caused BGP Updates OB: OSPF caused BGP Updates INTERACTION
Why Are Feature Interactions Harmful? Network protocol weaknesses are not fully understand until implemented / simulated in the large-scale Network protocol weaknesses are not fully understand until implemented / simulated in the large-scale Are decisions made to efficiently route data within a domain adversely affecting our ability to efficiently route data across the domain? Are decisions made to efficiently route data within a domain adversely affecting our ability to efficiently route data across the domain? Hot-potato routing: small degree of unstable information affects large portion of traffic Cold potato routing AS 0AS 1AS 2 Local Policy: optimize routing within AS (OSPFv2)Local Policy: optimize routing between ASes (BGP4)Global Policy: optimize routing within and between ASes
Large-scale Simulation Topology from Rocketfuel data Topology from Rocketfuel data Network Hierarchy: Network Hierarchy: – Level 0 routers: 9.92 Gb/sec and 1 ms delay – Level 1 routers: 2.48 Gb/sec and 2 ms delay – Level 2 routers: 620 Mb/sec and 3 ms delay – Level 3 routers: 155 Mb/sec and 50 ms delay – Level 4 routers: 45 Mb/sec and 50 ms delay – Level 5 routers and below: 1.55 Mb/sec and 50 ms delay LEVEL 3: AS 3356 iBGP: 7,921eBGP: 210 OSPFv2: Routers: 2,064 Links: 8,669 Tiscali: AS 3257 iBGP: 441 eBGP: OSPFv2: Routers: 618 Links: 839 EBONE: AS 1755 iBGP: 16,384 OSPFv2: Routers: 438 Links: 1,192 EXODUS: AS 3967 iBGP: 50,176eBGP: 53 OSPFv2: Routers: 688 Links: 2,166 ABOVENET: AS 6461 iBGP: 2,500eBGP: 199 OPSFv2: Routers: 843 Links: 2,
Experiment Design and Analysis Three classes of protocol parameters: Three classes of protocol parameters: –OSPF timers, BGP timers, BGP decision RRS was allowed 200 trials to optimize (minimize) response surface RRS was allowed 200 trials to optimize (minimize) response surface –Heuristic search algorithm Applied multiple linear regression analysis on the results Applied multiple linear regression analysis on the results
Response Plane Intra-domain routing decisions can effect inter-domain behavior, and vice versa. Intra-domain routing decisions can effect inter-domain behavior, and vice versa. All updates belong to either of four categories: All updates belong to either of four categories: –OSPF-caused OSPF (OO) update –OSPF-caused BGP (OB) update – interaction –BGP-caused OSPF (BO) update – interaction –BGP-caused BGP (BB) update Destination OB Update 8 10 Link failure or cost increase (e.g. maintenance)
Intra-domain routing decisions can effect inter-domain behavior, and vice versa. Intra-domain routing decisions can effect inter-domain behavior, and vice versa. All updates belong to either of four categories: All updates belong to either of four categories: –OSPF-caused OSPF (OO) update –OSPF-caused BGP (OB) update –BGP-caused OSPF (BO) update –BGP-caused BGP (BB) update Response Plane eBGP connectivity becomes available Destination BO Update These interactions cause route changes to thousands of IP prefixes, i.e. huge traffic shifts!!
High Level Characterization Optimized with respect to OB+BO response surface. Optimized with respect to OB+BO response surface. BGP timers play the major role, i.e. ~15% improvement in the optimal response. BGP timers play the major role, i.e. ~15% improvement in the optimal response. –BGP KeepAlive timer seems to be the dominant parameter.. – in contrast to expectation of MRAI! OSPF timers effect little, i.e. at most 5%. OSPF timers effect little, i.e. at most 5%. –low time-scale OSPF updates do not effect BGP. ~15% improvement when BGP timers included in search space
Design 1: Mgt Perspectives Varied response surfaces -- equivalent to a particular management approach. Varied response surfaces -- equivalent to a particular management approach. Importance of parameters differ for each metric. Importance of parameters differ for each metric. For minimal total updates: For minimal total updates: –Local perspectives are 20-25% worse than the global. For minimal total interactions: For minimal total interactions: –15-25% worse can happen with other metrics OB updates are more important than BO updates (i.e. ~0.1% vs. ~50%) OB updates are more important than BO updates (i.e. ~0.1% vs. ~50%) Important to optimize OSPF OB: ~50% of total updates BO: ~0.1% of total updates Global perspective 20-25% better than local perspectives Minimize total BO+OB 15-25% better than other metrics
Q: Can we use this approach to provide guidance for network routing policies? Q: Can we use this approach to provide guidance for network routing policies? Performed full factorial of RRS searches, turning Hot-, Cold-potato routing ON/OFF Performed full factorial of RRS searches, turning Hot-, Cold-potato routing ON/OFF Provide quantitative results from which qualitative stmts can be made Provide quantitative results from which qualitative stmts can be made Verified AT&T and Sprint measurements Verified AT&T and Sprint measurements Design 2: Hot- v Cold-Potato Routing No major impact regardless of search performed Majority of UPDATEs were generated by LOCAL-Pref and AS Path length MED was << 1% of UPDATEs Hot Potato was 0.8% Larger question: Which steps in the BGP decision making algorithm are most important?
Design 3: Network Robustness Q: Can we use this approach to provide network admins with guidance for network configurations? Q: Can we use this approach to provide network admins with guidance for network configurations? Link status varied with uniform random probability over simulation runtime Link status varied with uniform random probability over simulation runtime Link weights varied with uniform random probability over simulation runtime Link weights varied with uniform random probability over simulation runtime Response: BO + OB, Global Persp, and Default network settings Response: BO + OB, Global Persp, and Default network settings Search consistently provides better results Search consistently provides better results Response tied to link stability BGP parameters had greatest impact By maximizing link failure detection times, UPDATEs most effectively minimized
Conclusions –Number of experiments were reduced by many orders of magnitude in comparison to Full Factorial –Experiment design and statistical analysis enabled rapid elimination of insignificant parameters –Several qualitative statements and system characterizations could be obtained with few experiments. –Provided validation of network measurement community results, and called into question importance of premises –Search algorithms do not always find desired behaviour ! Allowed me to complete my thesis and graduate!