Presentation is loading. Please wait.

Presentation is loading. Please wait.

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.

Similar presentations


Presentation on theme: "PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton."— Presentation transcript:

1 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University

2 2 Motivation Routing anomalies are common on Internet  Maintenance  Power outage  Fiber cut  Misconfiguration  … Anomalies can affect end-to-end performance  Packet losses  Packet delays  Disconnectivities

3 3 Background Anomaly detection and diagnosis are nontrivial  Asymmetric paths  Failure information propagation  Highly varied durations  Limited coverage

4 4 Contributions New techniques for  Anomaly detection  Anomaly isolation  Anomaly classification Large-scale study of anomalies  Broad coverage  High detection rate, low overhead  Characterization of anomalies  End-to-end effects  Benefits to host service

5 5 Outline State of the Art PlanetSeer Components  MonD – passive monitoring  ProbeD – active probing Anomaly Analysis  Loop-based anomaly  Non-loop anomaly Bypassing Anomalies Summary

6 6 State of the Art Routing messages  BGP: AS-level diagnosis  IS-IS, OSPF: Within single ISP Router/link traffic statistics  SNMP, NetFlow: proprietary End-to-end measurement  Ping, traceroute

7 7 End-to-End Probing All-pairs probes among n nodes  O(n^2) measurement cost  Not scalable as n grows

8 8 Key Observation Combine passive monitoring with active probing Peer-to-Peer (P2P), Content Distribution Network (CDN)  Large client population  Geographically distributed nodes  Large traffic volume  Highly diverse paths The traffic generated by the services reveals information about the network.

9 9 Our Approach Host service  CDN Components  Passive monitoring  Active probing Advantages  Low overhead  Wide coverage Client A C B R1 R2

10 10 MonD: Anomaly Detection Anomaly indicators  Time-to-live (TTL) change Routing change  n consecutive timeouts (n = 4 in current system) Idling period of 3 to 16 seconds most congestion periods < 220ms

11 11 ProbeD Operation Baseline probes  When a new IP appears  From local node Forward probes  When a possible anomaly detected  From multiple nodes (including local node) Reprobes  At 0.5, 1.5, 3.5 and 7.5 hours later  From local node

12 12 ProbeD Groups 353 nodes, 145 sites, 30 groups  According to geographic location  One traceroute per group

13 13 Estimating Scope Which routers might be affected?  Routers which possibly change their next hops  Traceroutes from multiple locations can narrow the scope rara rbrb rcrc rdrd Client Local ProbeD Remote ProbeD

14 14 Path Diversity Monitoring Period: 02/2004 – 05/2004 Unique IPs: 887,521 Traversed ASes: 10,090 22 ASes 215 ASes 1392 ASes 1420 ASes 13872 ASes Core Edge

15 15 Confirming Anomalies Reported anomalies  2,259,588 Conditions  Loops  Route change  Partial unreachability  ICMP unreachable Very conservative confirmation Undecided 22% Non- anomaly 66% Anomaly 12%

16 16 Confirmed Anomaly Breakdown Confirmed anomalies  271,898  2 per minute  100x more Temp anomalies  Inconsistent probes Temp loop 1% Path Change 44% Fwd Outage 9% Other Outage 23% Persist Loop 7% Temp Anomalies 16%

17 17 Scope of Loops How many routers or ASes are involved?  Temp loops involve more routers than persistent loops  97% persistent loops and 51% temp loops contain 2 hops 1% persist loops cross ASes 15% temp loops cross ASes

18 18 Distribution of Loops Many persistent loops in tier-3, few in tier-1 Worst 10% of tier-1 ASes – implications for largest ISPs  20% traffic  35% persistent loops

19 19 Duration of Persistent Loops How long do persistent loops last?  Either resolve quickly or last for an extended period

20 20 Scope of Forward Anomalies How many routers or ASes are affected?  60% outages within 1 hops  75% outages and 68% changes within 4 hops 78% outages within 2 ASes 57% changes within 2 ASes

21 21 Location of Forward Anomalies How close are the anomalies to the edges of the network?  44% outages at the last hop  72% outages and 40% changes within 4 hops

22 22 Distribution of Forward Anomalies Which ASes are affected?  Tier-1 ASes most stable  Tier-3 ASes most likely to be affected

23 23 Overlay Routing Use alternate path when default path fails source destination intermediate

24 24 Bypassing Anomalies How useful is overlay routing for bypassing failures?  Effective in 43% of 62,815 failures, lower than previous studies  32% bypass paths inflate RTTs by more than a factor of two

25 25 Summary Confirm 272,000 anomalies in 3 months Persistent and temporary loops  Persistent loops narrower scope, either resolve quickly or last for a long time Path outages and changes  Outages closer to edge, narrower scope Anomaly distribution  Skewed. Tier-1 most stable. Tier-3 most problematic. Overlay routing  Bypasses 43% failures, latency inflation

26 26 More Information In the paper  More details about anomaly characteristics  End-to-end impacts  Classification methodology  Optimizations to reduce overheads & improve confirmation rate mzhang@cs.princeton.edu http://www.cs.princeton.edu/nsg/infoplane

27 27 Classifying Anomalies Temporary vs. persistent loops  Whether exit loops at maximum hop Path changes vs. outages  Changes: follow different paths to clients  Outages: stop at intermediate hops ProbeD Client

28 28 Non-anomalies  Ultrashort anomalies  Path-based TTL  Aggressive timeout

29 29 Identifying Forward Outages Forward outages  Route change  ICMP dest unreachable  Forward timeout

30 30 Loop Effect on RTT How do loops affect RTTs?  Loops can incur high latency inflation

31 31 Loop Effect on Loss Rate How do loops affect loss rates?  65% temporary and 55% persistent loops preceded by loss rates exceeding 30%

32 32 Forward Anomaly Effect on RTT How do forward anomalies affect RTTs?  Outages and changes can incur latency inflation  Outages have more negative effect on RTTs

33 33 Forward Anomaly Effect on Loss Rate How do forward anomalies affect loss rates?  45% outages and 40% changes preceded by loss rates exceeding 30%

34 34 Reducing Measurement Overhead Can we reduce the number of probes?  15 probes can achieve the same accuracy in 80% cases  Flow-based TTL

35 35 Traffic Breakdown By Tiers


Download ppt "PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton."

Similar presentations


Ads by Google