Download presentation
Presentation is loading. Please wait.
Published byRandolf Henry Modified over 9 years ago
1
1 Effective Diagnosis of Routing Disruptions from End Systems Ying Zhang Z. Morley Mao Ming Zhang
2
Src Routing disruptions impact application performance More applications today have high QoS requirements Routing events can cause high loss and long delays AS A AS B AS C Internet AS D AS E Dst
3
Existing approaches to diagnose routing disruptions are ISP-centric Require routing data from many routers in ISPs [Feldmann04, Teixeira04, Wu05] Passive and accurate 3 AS A AS C Internet AS D AS B BGP collectors
4
Limitations of ISP-centric approaches Difficult to gain access to data from many ISPs BGP data reflects “expected” data-plane paths 4 AS A AS C Internet AS D AS B End-systems ?? ? ?? ? ? ISP
5
Can we diagnose entirely from end systems? Goal: infer data-plane paths of many routers 5 Dst ISP A AS B AS C AS D Probing host
6
Our approach: end systems based monitoring Only require probing from end hosts Cover all the PoPs of a target ISP 6 Dst Target ISP AS B AS C AS D Probing host
7
Our approach: end systems based monitoring Cover most of the destinations on the Internet 7 ISP A AS B AS C AS D Probing host Dst
8
Our approach: end systems based monitoring Identify routing changes by comparing paths measured consecutively 8 Dst ISP A AS B AS C AS D Probing host
9
Advantages and challenges Advantages: No need to access to ISP-propriety data Identify actual data-plane paths Monitor data plane performance Challenges: Limited resources to probe Coverage of probed paths Timing granularity Measurement noise 9
10
System architecture 10 Event identification and classification Event identification and classification Collaborative probing Collaborative probing Event correlation and inference Event correlation and inference Event impact analysis Reports Target ISP
11
Outline Collaborative probing Event identification and classification Event correlation and inference Result and validation 11
12
Collaborative probing Using a set of hosts To learn the routing state To improve coverage To reduce overhead 12 ISP A AS B AS C AS D Probing host
13
Outline Collaborative probing Event identification and classification Event correlation and inference Result and validation 13
14
Event classification Classify events according to ingress/egress changes 14 Destination Prefix P Target ISP Probing host Type1: Ingress PoP changes Type2: Ingress PoP same, egress PoP different Type3: Ingress PoP same, egress PoP same
15
Outline Collaborative probing Event identification and classification Event correlation and inference Result and validation 15
16
Likely causes: link failures 1616 Destination Prefix P Target ISP Old path New path Probing host Old egress PoP New egress PoP Neighbor AS
17
Likely causes: internal distance changes 1717 distance: 120 Probing host Old egress PoP New egress PoP Hot potato changes Cost of old internal path increases Cost of new internal path decreases Neighbor AS distance: 80 distance: 100 distance: 120
18
Event correlation Spatial correlation: a single network failure often affects multiple routers Temporal correlation: routing events occurring close together are likely due to only a few causes 18
19
Inference methodology An evidence: an event that supports the cause 19 Destination prefix P Target ISP Probing host New path Probing host New egress Cause: Link L is down Link L
20
Inference methodology A conflict: a measurement trace that conflicts with the cause 20 Destination prefix P Target ISP Probing host New path Probing host New egress Cause: Link L is down Link L
21
Inference methodology 21 Evidence node [1,2,3]->[1,2,4] Cause: link 2-3 down Cause: node 3 withdraws the route AS 1 AS 2 AS 3AS 4 Withdrawal
22
Inference methodology 22 Evidence node [1,2,3]->[1,2,4] Evidence node [0,2,3]->[0,2,4] Cause: link 2-3 down Cause: node 3 withdraws the route Evidence Graph AS 1 AS 2 AS 3AS 4 AS 0 Withdrawal
23
Inference methodology 23 Conflict node [1,2,3,6] Cause: link 2-3 down Cause: node 3 withdraws the route Conflict node [0,2,3,6] Conflict Graph Conflict node [0,2,3] AS 1 AS 2 AS 3 AS 0 AS 6
24
Inference methodology 24 Evidence node [1,2,3]->[1,2,4] Evidence node [0,2,3]->[0,2,4] Conflict node [1,2,3,6] Conflict node [0,2,3,6] Evidence GraphConflict Graph Conflict node [0,2,3] Greedy algorithm: minimum set of causes that can explain all the evidence while minimizing conflicts Evidence: 2 Conflicts: 3 Evidence: 2 Conflicts: 0
25
Outline Collaborative probing Event identification and classification Event correlation and inference Result and validation 25
26
ISPs studied 26 AS Name ASN (Tier) Periods# of Src# of PoPs# of Probes Probe Gap AT&T3/23-4/92301116145318.3 min Verio4/10-4/22 9/13-9/22 218468102419.3 min Deutsche Telekom 4/23-5/22149642795817.5 min Savvis5/23-6/24178394098917.4 min Abilene9/23-9/30 2/3-2/17 113115103718.4 min
27
Results of event classification Many events are internal changes Abilene has many ingress changes 27 Target AS Total events (% all traces) Diff egress Same ingress, egressDiff ingress Internal PoP path External AS path AT&T0.35%12.1%51%35%11% Verio0.31%27.3%48%19%9.8% Deutsche Telekom 0.66%4.9%8.5%80.7%7.2% Savvis0.35%11%45%31%14% Abilene0.24%13.6%37%40%17%
28
Validation with BGP based approach [Wu05] Hot potato changes: egress point changes due to internal distance changes 28 Hot potato changes BGP based Our methodBoth Tier-1 AS147185101 (31%, 45%) Abilene network798860 (24%, 31%) Number of incidences identified by BGP method Number of incidences identified by our method Number of incidences identified by both False negative, false positives
29
Validation with BGP based approach Session resets: peering link up/down Inaccuracy reasons: Limited coverage Coarse-grained probing Measurement noise 29 Session reset BGP based Our method Both Tier-1 AS9156 (33%, 50%) Abilene network 7117 (0%, 36%)
30
System performance Can keep up with generated routing state Applicable for real-time diagnosis and mitigation Reactive: construct alternate paths to bypass the problem Proactive: avoid paths with many historical routing disruptions 30
31
Conclusion Developed the first system to diagnose routing disruptions purely from end systems Used a simple greedy algorithm on two bipartite graphs to infer causes Comprehensively validated the accuracy 31
32
Thank you! Questions? 32
33
Performance impact analysis End-to-end latency changes caused by different types of routing events 33
34
Validation with BGP data BGP feeds from RouteView, RIPE, Abilene, and 29 BGP feeds from a Tier-1 ISP The destination prefix coverage and the routing event detection rate 34 Target AS Dst. Prefix coverage Dst. Prefix traversing PoPs with BGP feeds Detected events (AS change, next hop change) Missed events (short-duration, filtering, other) AT&T15%1.5%11% (10.3%, 3.2%)89% (75%, 13%, 1%) Verio18.6%18.1%23% (19.1%, 8.6%)77% (73%, 4%, 0%) Savvis7.8%1.1%6% (5.8%, 0.5%)94% (80%, 9%, 5%) Abilene6% 21% (17.3%, 5.8%)79% (61%, 15%, 3%)
35
Event classification: same ingress PoP, different egress PoP 3535 Target ISP Old path New path Probing host Old egress PoP New egress PoP Policy changes Local preference in the old route decreases Local preference in the new route increases Neighbor AS Local Pref : 100->50 Local Pref : 60->110
36
Event classification: same ingress PoP, different egress PoP 3636 Target ISP Old path New path Probing host Old egress PoP New egress PoP External routing changes Old route worsens due to external factors (withdrawal, longer AS path) New route improves due to external factors AS A ABCD->ABEFD BCEFD->BEFD AS B
37
Event classification: same ingress PoP, same egress PoP Internal PoP path changes Cost of old internal path increases Cost of new internal path decreases External AS path changes 3737 Destination Prefix P Target ISP Old path New path Probing host
38
Results of cause inference 38 Effectiveness of inference algorithm Clusters: a group of events with the same root cause
39
Event identification A routing event: path changes Event identification omparing continuous routing snapshots 39 Dst ISP A AS B AS C AS D Probing host
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.