Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transient BGP Loops Do they matter, and what can be done about them?

Similar presentations


Presentation on theme: "Transient BGP Loops Do they matter, and what can be done about them?"— Presentation transcript:

1 Transient BGP Loops Do they matter, and what can be done about them?
Nate Kushman MIT/Akamai Srikanth Kandula, Dina Katabi and John Wroclawski

2 What causes: “Transient BGP Loops”
Withdraw MIT Sprint AT&T Joe So what exactly am I talking about when I say “Transient BGP Loop”? Let’s take an example Bob Maintenance MIT

3 What causes: “Transient BGP Loops”
Sprint AT&T Joe Bob Maintenance MIT

4 What causes: “Transient BGP Loops”
Withdraw MIT Sprint AT&T Joe Bob Maintenance MIT

5 What causes: “Transient BGP Loops”
Routing Loop Sprint AT&T Joe Bob Maintenance MIT

6 What causes: “Transient BGP Loops”
Withdraw MIT Sprint AT&T Joe Bob Maintenance MIT

7 What causes: “Transient BGP Loops”
Sprint AT&T Joe Bob Maintenance MIT

8 What causes: “Transient BGP Loops”
Sprint AT&T Joe Bob Maintenance MIT

9 How common are: “Transient Inter-domain Routing Loops”
Sprint Study (IMC 2003, IMW 2002): Looked at packet traces from the Sprint backbone Up to 90% of the observed packet-loss was caused by routing loops 60-100% of the loops attributable to BGP

10 Routing Loop Damage Our Study: 20 vantage points with BGP feeds
2 Months 70,000 unique prefixes Pinged once every 2 minutes Trace-routed once every 30 minutes TTL Exceeded responses to detect loops Additional pings and traceroutes when loops detected Study to look at the entire internet and not just sprint Do as a laundry list What we did was look at the %age of looping packets near BGP updates

11 10-15% of updates cause routing loops
Routing Loop Damage Regenerate graph in excel or something 10-15% of the time More text indicating that 10-15% of the updates cause the prefix to loop Upon an update, 10-15% of the time the prefix loops 10-15% of updates cause routing loops

12 Collateral Damage AS F AS C AS A AS D AS B AS E

13 Collateral Damage Collateral Damage AS F X AS C AS A AS D AS B AS E

14 Prefixes sharing a loopy link see 19% loss
Collateral Damage You might thing you’re safe if your prefixes don’t see loops but in fact you are not or something…. Not all the time, but when it happens it’s bad – this is an undebuggable problem When others on your path? Loop then you see on average 20% packet loss More emphasis here on how bad things are --- Make it personal – this is your problem even if it’s not your prefixes Prefixes sharing a loopy link see 19% loss

15 What should be done? We should prevent forwarding loops
Even if the protocol loops that does not mean that my data should loop BGP can take 15 minutes to converge We don’t need to see loops during this

16 A loop occurs because: One AS pushes a route update to the data plane, but other AS's, unaware yet of the move, try to send packets on the old route

17 How can we avoid Routing Loops?
Withdraw MIT Sprint AT&T Joe Let’s go back to the same simple example Put preferences on respective sides graph in the middle Bob Maintenance MIT

18 How can we avoid Routing Loops?
Withdraw MIT Sprint AT&T Joe AT&T still thinks Joe is routing through Bob Drop customer provider labels Put preferences on respective sides graph in the middle Bob Maintenance MIT

19 How can we avoid Routing Loops?
Sprint AT&T Joe What if: AT&T knew about Joe’s change before making its own? Drop customer provider labels Put preferences on respective sides graph in the middle Bob Maintenance MIT

20 Suspension Continue to route traffic
Tell control system not to propagate the route

21 How can we avoid Routing Loops?
Withdraw MIT Sprint AT&T Joe Let’s rewind a bit and see how this would work. Sends update but cointue s to send packets down maintenance link Bob Maintenance MIT

22 How can we avoid Routing Loops?
Withdraw MIT Sprint AT&T Joe What if: Joe sends it’s update before changing it’s forwarding table? Let’s rewind a bit and see how this would work. Bob Maintenance MIT

23 How can we avoid Routing Loops?
Sprint AT&T Joe Note still sending traffic ok because it’s a maintenance event Let’s rewind a bit and see how this would work. Bob Maintenance MIT

24 How can we avoid Routing Loops?
Sprint AT&T Joe And also waits for an Ack from AT&T before updating it’s forwarding table? Let’s rewind a bit and see how this would work. When will Joe know that it can change only once it receives the ack from At&t indicating that At&t has both received the update and made the relevant forwarding changes to ensuree that Joes update won’t cause a loop Bob Maintenance MIT

25 How can we avoid Routing Loops?
Sprint AT&T Joe Then we can be sure that AT&T knows about the path change before it happens and will not use the path Let’s rewind a bit and see how this would work. When will Joe know that it can change only once it receives the ack from At&t indicating that At&t has both received the update and made the relevant forwarding changes to ensuree that Joes update won’t cause a loop Bob Maintenance MIT

26 How can we avoid Routing Loops?
Sprint AT&T Joe Instead, AT&T will move immediately to the Sprint path and the loop is avoided. Let’s rewind a bit and see how this would work. Bob Maintenance MIT

27 More Generally We have proven: All sorts of good proofs and stuff:
Loops are prevented in the general case Convergence properties similar to normal BGP All sorts of good proofs and stuff:

28 Your feedback Clearly: What about? Planned Maintenance events
20% of update events caused by planned maintenance Link up events What about? Unplanned Link down events Trade-off between loss on current path and collateral damage You say, well great. This is all nice and stuff when I’m doing maintenance and I can afford to wait to update my forwarding table, but is this always the right thing to do? So clearly when doing maintenance events this is a good thing: - and it’s worth noting that 20% of all update events are caused by planned maintenance Additionally, it’s laughable that on a link up event, when the network is getting more redundant, we still can’t avoid routing loops!! However, in the case of link down events we have a trade-off. Like to get a feeling from the operators of how you think about this trade-off. Is it worth slowing down failover in order to ensure that collarteral damage isn’t caused. Would you be willing to slow down failover on less important prefixes in order to avoid these prefixes causing collateral damage to the more important prefixes.

29 In Short Routing loops cause significant performance problems
Even prefixes with no BGP updates are significantly affected by loops A simple change to BGP can avoid all routing loops

30 Questions?


Download ppt "Transient BGP Loops Do they matter, and what can be done about them?"

Similar presentations


Ads by Google