User-level Internet Path Diagnosis R. Mahajan, N. Spring, D. Wetherall and T. Anderson
The network is a black box…...so what can I do 1.We want the users to be able to diagnose their paths 2.Communicate information to ISP or NOC to improve the network
TULIP: User-level path diagnosis Objectives: Detect performance faults that affect a user’s flows. This involves a measure of the magnitude of the fault (queuing delay, loss) and the localization of the faulty link. Detect performance faults that affect a user’s flows. This involves a measure of the magnitude of the fault (queuing delay, loss) and the localization of the faulty link.
How TULIP does it Ideal Architecture – Packet based solutions Ideal Architecture – Packet based solutions Each router the packet traverses adds a certain number of information to the packet: timestamp, global address of the router’s input interface. Each router the packet traverses adds a certain number of information to the packet: timestamp, global address of the router’s input interface. Issue: Packet size increases at each hop. A packet loss involves a loss of all the information. Corruption of a packet might yield to incorrect diagnosis data (allthough most corruption are treated as losses)
Because things are never ideal Basic architecture sufficient for data collection Basic architecture sufficient for data collection Assets: Fixed packet size and sufficient information… Assuming : stationarity of paths (paths between source and destination don’t change too often)
Diagnosis tools in use in TULIP Out-of-band measurement probes (or TTL based search) Out-of-band measurement probes (or TTL based search) obtain the Sample TTL and Interface ID obtain the Sample TTL and Interface ID ICMP ICMP Router timestamp Router timestamp IP identifiers IP identifiers Approximation of the per-flow counter Approximation of the per-flow counter
How to detect path loss/reordering Sending two probes to determine the behavior of the remote router
Packet queuing An ICMP timestamp is used to determine the queuing delays within a router (median)
The TULIP methods To perform the measurement, TULIP uses two “scanning” methods. To perform the measurement, TULIP uses two “scanning” methods. Binary search (reduces diagnostic traffic but at a cost of diagnosis time) Binary search (reduces diagnostic traffic but at a cost of diagnosis time) Parrallel search (interleaves measurements to different routers by cycling through them in nodes) Parrallel search (interleaves measurements to different routers by cycling through them in nodes)
Network Load and Diagnosis Time Because of the relative stationary behavior of a router, with an approximative diagnosis time of 10/30 min, TULIP can provide accurate results. Because of the relative stationary behavior of a router, with an approximative diagnosis time of 10/30 min, TULIP can provide accurate results. The load for Binary search is B/W and for parrallel LB/W (lower bound) The load for Binary search is B/W and for parrallel LB/W (lower bound) L: # of measurable routers B: Bandwitdth cost of the probes W: Wait time (usually 1s)
Diagnosing granularity The granularity is the weighted average of the lengths of its diagnosable segments. The granularity is the weighted average of the lengths of its diagnosable segments ’2’ Rank(G)=2 1 2
Various granularity for different measurements 50 % of the paths have a granularity less than 3 hops (75% <4) TULIP matches ideal tomography implementation
Validation Compared results with Planet Lab coupled with a tomography system Compared results with Planet Lab coupled with a tomography system Use a measure “rate delta” that computes the difference between the rate at the far end minus that at the near end of a segment. Use a measure “rate delta” that computes the difference between the rate at the far end minus that at the near end of a segment. Negative values implies a lack of consistency (values spawn a range too large)
Reordering Results 85 % of the results are consistent for forward path 75 % for round trip (due to the asymmetric nature of some paths)
Loss results 85% again of non negative deltas Round trip counterpart less affected by asymmetry than the Reordering diagnosis (because loss usually occurs close to the destination)
Queuing Results ICMP message generation has a poor timestamp resolution (the two median within 2ms of each other – One from TCPDump on planet lab and one from TULIP). Forward path shows that queuing delay is consistent (very few negative values) Round trip reflects the variability in the return path
The last mile… First hops from user is the bottleneck First hops from user is the bottleneck
Persistance of a fault We check for how many iterations, TULIP yields similar results 80% of the path show faults persisting long enough for TULIP to diagnose them (typical time a binary search takes to locate a fault : 6 runs)
Conclusions Network Operators would be able to diagnose links efficiently Network Operators would be able to diagnose links efficiently And a user too … if the world was populated entirely by Computer nerds. And a user too … if the world was populated entirely by Computer nerds.
Issues… Multiple TULIP users could reduce the accuracy of the probing method, the per flow counter Multiple TULIP users could reduce the accuracy of the probing method, the per flow counter An application doesn’t experience the network the same way an active measurement does. (TCP, application dependant as well as flags) An application doesn’t experience the network the same way an active measurement does. (TCP, application dependant as well as flags)
…and possible improvements Per flow counter at the router level (unrealistic) Per flow counter at the router level (unrealistic) Hash source address and IPID (for flow) Hash source address and IPID (for flow) ICMP timestamp have reception time as well as transmission time (allows the calculation of the delay the packet is processed at the router) ICMP timestamp have reception time as well as transmission time (allows the calculation of the delay the packet is processed at the router)