Network Tomography for Fault Diagnosis Renata Teixeira LIP6 Computer Laboratory CNRS and UPMC Paris Universitas
1 The Internet is great, but problems happen LIP6 network Net1 Net2 Net3 How to automatically detect and identify problems? Is my connection ok? Is it google? Is the problem in one of the networks in path?
2 Current alarms are not enough Network equipments already have many alarms – SNMP traps – Anomaly detection systems But, alarms may not reflect user’s experience – Hard to map users’ complaints to alarms – The user’s problem may not appear as an alarm Network admins often resort to active measurements – Active monitoring servers inside their network – Subscribe to third-party monitoring services Eg. Keynote or RIPE TTM
3 End-hosts can collaborate to troubleshoot problems LIP6 network Net1 Net2 Net3 Detection: continuous path monitoring Identification: tomography
4 End-host troubleshooting in two different contexts Network admins deploy monitoring services – Verify the performance of their networks – Assist in troubleshooting End-users can collaborate – Identify and bypass problems – Rank providers
5 Detection techniques For network admins – Deploy dedicated monitors – Need to inject probes to measure paths For end-users – Monitoring at end- users’ machine – Tapping users’ traffic is promising Challenge cannot continuously overload the network or end-user’s machine to detect faults
Minimizing probing cost for detecting interface failures: Algorithms and scalability analysis with Hung X. Nguyen (Univ. of Adelaide) Patrick Thiran (EPFL) Christophe Diot (Thomson)
7 Active monitoring system to detect faults M1 M2 T3 T1 T2 A C B D target hosts monitors Goal detect failures of any of the interfaces in the subscriber’s network with minimum probing overhead target network
8 Simple solution: Coverage problem M1 M2 T3 T1 T2 A C B D Instead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s network Coverage problem is NP-hard – Solution: greedy set-cover heuristic
9 Coverage solution doesn’t detect all types of failures Detects fail-stop failures – Failures that affect all packets that traverse the faulty interface Eg., interface or router crashes, fiber cuts, bugs But not path-specific failures – Failures that affect only a subset of paths that cross the faulty interface Eg., router misconfigurations
10 New formulation of failure detection problem Select the frequency to probe each path – Lower frequency per-path probing can achieve a high frequency probing of each interface M1 M2 T3 T1 T2 A C B D 1 every 9 mins 1 every 3 mins
11 Properties of solution Failure detection problem is no longer NP-hard – Can find optimal solution using linear programming – Parameters: Duration of path-specific and fail-stop failures Needs synchronization among monitors – Monitors need collaborate to probe an interface – Alternative probabilistic solution avoids synchronization overhead Probing cost scales almost linearly with the size of the target network – In random power-law graphs like inferred internet graphs
12 Evaluation Paths obtained using traceroutes – From 750 PlanetLab nodes to 3,000 DNS servers – From 12 RON nodes to 60,000 targets Target networks are probed ASes – Map IPs to ASes using Mao et al.’s technique – 1,366 ASes in PlanetLab – 6,517 ASes in RON Compute probing costs varying parameters – Set of paths, failure durations, target network
13 Probing costs varying size of subscriber network in PlanetLab Duration Path-specific = 1000 sec Fail-stop = 1 sec
14 Summary Practical formulation of failure detection problem – Incorporates both fail-stop and path-specific failures Solution minimizes probing cost – Using linear programming Inferred internet graphs are among the most expensive to probe – Probing scales almost linearly with network size Next step – Deploy a system based on these probing techniques
ConnectionWatch: Passive monitoring of round-trip times at end-hosts with Diana Zeaiter Joumblatt (LIP6) Nina Taft (Intel)
16 Goal Automatic detection of performance degradations – Only care about problems that impact applications – Focus on detecting “large” round-trip times (RTT) – Detection should be fast and lightweight
17 ConnectionWatch Sniffer Extract flow ID RTT estimation High RTT detector TCP packets Upload to central server Packet Trace Ping Daemon Flow statistics Alarms
18 Insights from preliminary experiments Datasets from five students during three days – 44,715 TCP connections over 3,584 paths to 2,242 IPs Some observations – More complete measurements than ping 16.5% of 1,072 addresses don’t reply to pings – Transfer of traces to server is main bottleneck Hurdles – Portability of system to other OSes – Privacy concerns with capturing user’s traffic – Incentives for large-scale deployment
19 Which RTT variations correspond to performance degradations? Our datasets are still too small to answer – Performance degradations are rare events Simple technique based on outlier threshold – What is a good threshold? – Should it the threshold be for all users, per user, per path, per app? Do outliers correspond to real performance degradations? – ConnectionWatch should get user’s feedback “I’m annoyed button”
Practical issues with using network tomography for fault diagnosis with Italo Scota Cunha (LIP6, Thomson) Amogh Dhamdhere, Yiyi Huang, Nick Feamster, Constantine Dovrolis (Georgia Tech) Christophe Diot (Thomson)
21 The binary tomography solution by Duffield Given – Complete network topology – End-to-end reachability measurements Find the smallest set of links that explain observations – Assumes single-source tree, access to targets m t1 t2
22 Extending binary tomography Multi-network setting: topology not known – Periodic traceroutes determine topologies Extension to multiple-sources, multiple-targets – Minimum hitting set problem (NP-hard) Tomo: Iterative poly-time greedy heuristic – Intuition: Iteratively choose link that explains the max number of failures
23 Some problems Dynamics – Loss can be transient, topology can change Ambiguity – Losses are one-way but don’t always have access to both ends of the path Lack of synchronization – Different monitors see different conditions
24 Approach Transient packet loss – Triggered confirmation of failed paths Dynamic routing – Periodic snapshots of the network topology One-way losses – Algorithm based on IP spoofing Lack of synchronization – Correlation of probes from different monitors
25 Failure confirmation time loss burst packets on a path Upon detection of a failure, trigger extra probes Number of probes – Confirm failures with a target false positive rate – Assume independence and a given a loss rate Time between probes – Reduce chance that probes fall on the same loss burst – Assume link losses follow a Gilbert process false positive
26 Disambiguating one-way losses: Spoofing Monitor sends request to spoofer to send probe Probe has IP address of the monitor If reply reaches the monitor, reverse path is working M Spoofer: Send spoofed packet with source address of M T
27 Evaluation Evaluation is challenging – Need ground truth and realistic environment Controlled experiments on the VINI testbed – Allow us to inject failures – Problem: hard to argue about false positive Experiments on Emulab – More control: dedicated nodes and links – Emulate the Abilene network – Selected LA and NY as monitors
28 Failure confirmation reduces false positives Emulab experiment setup – 10% loss rates in each direction – No persistent failures Both schemes use three probes to confirm a failure Confirmation interval Burst factor 90%96% Back-to-back15%25% 0.2 secs0.8% low false positives, because an interval of 0.2 secs guarantees a small probability of probes being correlated
29 Correlation is important to get a consistent view Emulab and VINI experiments with short failures – More false positives – Lower detection rate In real deployments, can we get a consistent view? – More noise because of losses and routing dynamics – Monitors are less synchronized – Monitors may not be able to reach the coordinator Next steps – Online correlation – Minimize communication with coordinator
30 Summary Continuous monitoring for detection – At management hosts: active measurements Reduce probing overhead, still detect failures – At end-users: passive measurements Lightweight detection of problems that affect apps Network tomography for identification – Many challenges to get consistent inputs for tomography Network dynamics and transient losses Ambiguity of forward and reverse failures Monitors may observe different conditions