PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University
2 Motivation Routing anomalies are common on Internet Maintenance Power outage Fiber cut Misconfiguration … Anomalies can affect end-to-end performance Packet losses Packet delays Disconnectivities
3 Background Anomaly detection and diagnosis are nontrivial Asymmetric paths Failure information propagation Highly varied durations Limited coverage
4 Contributions New techniques for Anomaly detection Anomaly isolation Anomaly classification Large-scale study of anomalies Broad coverage High detection rate, low overhead Characterization of anomalies End-to-end effects Benefits to host service
5 Outline State of the Art PlanetSeer Components MonD – passive monitoring ProbeD – active probing Anomaly Analysis Loop-based anomaly Non-loop anomaly Bypassing Anomalies Summary
6 State of the Art Routing messages BGP: AS-level diagnosis IS-IS, OSPF: Within single ISP Router/link traffic statistics SNMP, NetFlow: proprietary End-to-end measurement Ping, traceroute
7 End-to-End Probing All-pairs probes among n nodes O(n^2) measurement cost Not scalable as n grows
8 Key Observation Combine passive monitoring with active probing Peer-to-Peer (P2P), Content Distribution Network (CDN) Large client population Geographically distributed nodes Large traffic volume Highly diverse paths The traffic generated by the services reveals information about the network.
9 Our Approach Host service CDN Components Passive monitoring Active probing Advantages Low overhead Wide coverage Client A C B R1 R2
10 MonD: Anomaly Detection Anomaly indicators Time-to-live (TTL) change Routing change n consecutive timeouts (n = 4 in current system) Idling period of 3 to 16 seconds most congestion periods < 220ms
11 ProbeD Operation Baseline probes When a new IP appears From local node Forward probes When a possible anomaly detected From multiple nodes (including local node) Reprobes At 0.5, 1.5, 3.5 and 7.5 hours later From local node
12 ProbeD Groups 353 nodes, 145 sites, 30 groups According to geographic location One traceroute per group
13 Estimating Scope Which routers might be affected? Routers which possibly change their next hops Traceroutes from multiple locations can narrow the scope rara rbrb rcrc rdrd Client Local ProbeD Remote ProbeD
14 Path Diversity Monitoring Period: 02/2004 – 05/2004 Unique IPs: 887,521 Traversed ASes: 10, ASes 215 ASes 1392 ASes 1420 ASes ASes Core Edge
15 Confirming Anomalies Reported anomalies 2,259,588 Conditions Loops Route change Partial unreachability ICMP unreachable Very conservative confirmation Undecided 22% Non- anomaly 66% Anomaly 12%
16 Confirmed Anomaly Breakdown Confirmed anomalies 271,898 2 per minute 100x more Temp anomalies Inconsistent probes Temp loop 1% Path Change 44% Fwd Outage 9% Other Outage 23% Persist Loop 7% Temp Anomalies 16%
17 Scope of Loops How many routers or ASes are involved? Temp loops involve more routers than persistent loops 97% persistent loops and 51% temp loops contain 2 hops 1% persist loops cross ASes 15% temp loops cross ASes
18 Distribution of Loops Many persistent loops in tier-3, few in tier-1 Worst 10% of tier-1 ASes – implications for largest ISPs 20% traffic 35% persistent loops
19 Duration of Persistent Loops How long do persistent loops last? Either resolve quickly or last for an extended period
20 Scope of Forward Anomalies How many routers or ASes are affected? 60% outages within 1 hops 75% outages and 68% changes within 4 hops 78% outages within 2 ASes 57% changes within 2 ASes
21 Location of Forward Anomalies How close are the anomalies to the edges of the network? 44% outages at the last hop 72% outages and 40% changes within 4 hops
22 Distribution of Forward Anomalies Which ASes are affected? Tier-1 ASes most stable Tier-3 ASes most likely to be affected
23 Overlay Routing Use alternate path when default path fails source destination intermediate
24 Bypassing Anomalies How useful is overlay routing for bypassing failures? Effective in 43% of 62,815 failures, lower than previous studies 32% bypass paths inflate RTTs by more than a factor of two
25 Summary Confirm 272,000 anomalies in 3 months Persistent and temporary loops Persistent loops narrower scope, either resolve quickly or last for a long time Path outages and changes Outages closer to edge, narrower scope Anomaly distribution Skewed. Tier-1 most stable. Tier-3 most problematic. Overlay routing Bypasses 43% failures, latency inflation
26 More Information In the paper More details about anomaly characteristics End-to-end impacts Classification methodology Optimizations to reduce overheads & improve confirmation rate
27 Classifying Anomalies Temporary vs. persistent loops Whether exit loops at maximum hop Path changes vs. outages Changes: follow different paths to clients Outages: stop at intermediate hops ProbeD Client
28 Non-anomalies Ultrashort anomalies Path-based TTL Aggressive timeout
29 Identifying Forward Outages Forward outages Route change ICMP dest unreachable Forward timeout
30 Loop Effect on RTT How do loops affect RTTs? Loops can incur high latency inflation
31 Loop Effect on Loss Rate How do loops affect loss rates? 65% temporary and 55% persistent loops preceded by loss rates exceeding 30%
32 Forward Anomaly Effect on RTT How do forward anomalies affect RTTs? Outages and changes can incur latency inflation Outages have more negative effect on RTTs
33 Forward Anomaly Effect on Loss Rate How do forward anomalies affect loss rates? 45% outages and 40% changes preceded by loss rates exceeding 30%
34 Reducing Measurement Overhead Can we reduce the number of probes? 15 probes can achieve the same accuracy in 80% cases Flow-based TTL
35 Traffic Breakdown By Tiers