Proactive Network Configuration Validation with Batfish Meg Walraed-Sullivan Ratul Mahajan Jitu Padhye I’m going to be discussing Batfish, a tool we developed to find bugs in network configurations offline. Currently available, open source Has been used to find bugs in real networks Purpose: get feedback on improving technology This is joint work.. Ari Fogel Todd Millstein Luis Pedrosa Ramesh Govindan
Misconfigurations are common are expensive Last year bug caused outage for all time warner customers for multiple hours These misconfigurations are common and expensive
ospf interface int3_1 metric 1 ospf redistribute static metric 10 Configuration is Hard Low-Level Directives interface-level metrics protocol metrics per-network policy Multiple Protocols: BGP IS-IS OSPF Protocol Interactions: Route Redistribution Protocol Preference Re-advertisement ospf interface int3_1 metric 1 ospf redistribute static metric 10 bgp neighbor p1 AS P Accept ALL static route 10.0.0.0/24 drop, log Why do these misconfigurations happen? configurations contain low-level directives that we hope map to high-level intent …per-peer policy Correctly specify parameters for multiple protocols …static, connected Interact in complicated ways Diverse configuration languages in typical enterprise network Arista Cisco IOS Cisco NX-OS Juniper Quanta
Example Customer 10.0.0.0/24 n10 C c1 n2 Provider c2 P n1 n3 N p1 n4 10.0.0.0/24 should be: Reachable from C Unreachable from P, n4 Customer 10.0.0.0/24 n10 C c1 n2 Provider c2 Example is idealized version of one of the networks we analyzed. 10.0.0.0/24 should be: Reachable from C, n1, n2 Unreachable from P, n3, n4 P n1 n3 N p1 n4
4 ospf redistribute connected metric 10 10.0.0.0/24 should be: Reachable from C Unreachable from P, n4 Customer 10.0.0.0/24 n10 C c1 n2 Provider c2 P n1 n3 N p1 n4 3 interface int2_10 ip 10.0.0.1/24 4 ospf redistribute connected metric 10 4 static route 10.0.0.0/24 drop 5 ospf redistribute static metric 10 //----------------Configuration of n2---------------- 1 ospf interface int2_1 metric 1 2 ospf interface int2_3 metric 1 3 interface int2_10 ip 10.0.0.1/24 4 ospf redistribute connected metric 10 5 prefix-list PL_C 10.0.0.0/24 6 bgp neighbor c1 AS C apply PL_C out //----------Configuration of n3---------- 1 ospf interface int3_1 metric 1 2 ospf interface int3_2 metric 1 3 ospf interface int3_4 metric 1 4 static route 10.0.0.0/24 drop 5 ospf redistribute static metric 10 6 bgp neighbor p1 AS P Accept ALL - Simple intent with relatively simple configuration - Bug is hard-to-find until deployment Explain highlighted lines Traffic from c2 to 10.0.0.0/24 intermittently dropped Bug is violation of what we call multipath consistency Bug is based on real bug we found Buried in configs thousands of lines big Each line on its own looks reasonable. Need deep automated checker that understands forwarding state
Batfish Offline configuration safety checker Available at http://www.batfish.org Has found real bugs in real networks 4 stages: Configuration processing Configuration analysis Forwarding table generation Forwarding table analysis Enter batfish Offline Don’t need to run equipment Enables checking current network, proposed changes
Stage 1: Process router configurations //----------Configuration of n3---------- 1 ospf interface int3_1 metric 1 2 ospf interface int3_2 metric 1 3 ospf interface int3_4 metric 1 4 static route 10.0.0.0/24 drop 5 ospf redistribute static metric 10 6 bgp neighbor p1 AS P Accept ALL n1 n2 n3 n4 N Our declarative model provides a set of relations, here are a couple Some simple relations come directly from configuration Fact about OSPF interface costs OspfCost( node:n3, interface:int3_1, cost:1). Fact about topology LanNeighbors( node1:n3 interface1:int3_1, node2:n1, interface2:int1_3).
Stage 2: Analyze configurations //----------Parsing---------- No parsing errors //----------Basic checks---------- Undefined reference to route-map ‘loch_ness_policy’ //----------Custom checks---------- // No IP reuse IP ‘192.168.1.13’ assigned to both rtr1:int5 and rtr3:int6 // All loopback networks exported into OSPF rtr5:loopback0 neither active nor passive for any OSPF process Can ask custom questions using our query language
Stage 3: Compute forwarding tables OspfExport( node=n2, network=10.0.0.0/24, cost=10, type=ospfE2). InstalledRoute(route={ node=n1, network=10.0.0.0/24, nextHop=n2 administrativeCost=110, protocolCost=10, protocol=ospfE2}). Fib( node=n1, network=10.0.0.0/24, egressInterface=int1_2). TODO: Mention somehow that we understand intricacies of protocols TODO: merge installedroute and fib into table, remove ospfexport WAIT TO REVEAL - In addition to basic facts we also have derived relations - The data plane generator contains rules for deriving these relations - This stage is key contribution REVEAL Ospf computation e.g. is modeled in part by OspfExport
Stage 4a: Identify forwarding violations Counterexample of multipath consistency { IngressNode=n1, SrcIp=0.0.0.0, DstIp=10.0.0.2, IpProtocol=0 } Now that you have data plane you can check any forwarding property using your favorite dp analyzer There are multiple data planes if you want to check properties expressible as difference between data planes
Stage 4b: Explain forwarding violations Counterexample packet traces ViolationTraceRoute( flow={ node=n1, … ,dstIp=10.0.0.2 }, 1st hop:[ n1:int1_2 -> n2:int2_1 ] 2nd hop:[ n2:int2_10 -> n10:int10_2 ] fate=accepted). -------------------------------------------------------- 1st hop:[ n1:int1_3 -> n3:int3_1 ] fate=nullRouted by n3). - Traces are automatically produced
New Consistency Properties Multipath – disposition consistent on all paths 10.0.0.0/24 n10 n2 Recall that since we produced data plane, you can ask any question expressible as a forwarding invariant. Part of our contribution is the introduction of three novel invariants The first is multipath n1 n3
New Consistency Properties Multipath – disposition consistent on all paths Differential reachability – reachability unaffected by change 2.2.2.0/24 C n2 c1 Reachability matrix should be unchanged after failure N1 accepts both N2 accepts only one After failure, 3.3.3.0/24 is no longer reachable c2 n3 n1 N 2.2.2.0/24 3.3.3.0/24
New Consistency Properties Multipath – disposition consistent on all paths Differential reachability – reachability unaffected by change Destination – at most one customer per delegated address 10.0.1.0/24 AS length=1 10.0.1.0/24 AS length=2 TODO: decide how to change destination consistency mention/name/results Property intended for networks that delegate unique portion of address space to customers - Packets for any given destination address should only be sent to one customer under any network conditions CA CB B ca1 n1 cb1 b1 N
Implementation Support multiple configuration languages IOS, NX-OS, Juniper, Arista, … Broad feature support Route redistribution, OSPF internal/external, BGP communities… Unified, vendor-neutral intermediate representation Languages: IOS, NX-OS, Juniper, Arista Features OSPF: external type-1&2, inter-/intra-area, route-reflection, redistribution, aggregation; community-, as-path matching; route-maps, policy-statements, interface ACLs, route-tagging Sufficient feature coverage to model several large networks In the face of broad feature support, we make analysis tractable by filtering down to unified, vendor-neutral format. As we’ve added more features and built out a sufficient set of primitives, adding new directives often only requires a conversion to unified format.
Demo Simplified version of Net1 Cisco configuration files Multiple seeded bugs
Evaluation Two large university networks Net1 – 21 core routers ISP1 Dept1 ISPm Deptn IGP,VLAN Net1 Core ISP1 Dept1 ISPm Deptn BGP Two large university networks Net1 – 21 core routers Federated network Each department is own AS Heavy use of BGP Net2 – 17 core routers Centrally controlled Heavy use of VLANs Single AS BGP communication only with ISPs Breather, transition
Results “P.S. WRT the prefix that was dual assigned from yesterday, one of my NOC [network operations center] guys stopped by today to ask what voodoo I was using to find such things :)” [emphasis added] – email from the head of the Net1 NOC We had some good feedback from the network operators of net1 (Read it)
Violations Confirmed By Operators Violations Fixed by Operators Results Invariant Total Violations Violations Confirmed By Operators Violations Fixed by Operators Net1 Multipath 32 32(4) 21(3) Diff.Reach. 16 3(2) 0(0) Destination 55 55(6) 1(1) Net2 11 11(3) 77 18(7) No destination for net2 because no customer ASes Difference in confirmed violations failure consistency is due primarily to conscious decision not to deploy additional resources Difference in fixed violations is due to various factors: reluctance to disturb working system, need to coordinate with other operators, or need to install new equipment, which cannot be solved with configuration changes Failure and destination cannot be checked with dp analysis
Selected Violations (Multipath) Black-hole route cost too low (equal) (Diff.Reach.) Only one interface underlying VLAN (Destination) Prefix assigned to multiple deptartments
Send feedback/questions to: arifogel@ucla.edu Conclusion Take survey so we can support your network features and requirements in forthcoming versions: http://www.batfish.org/survey Send feedback/questions to: arifogel@ucla.edu