Internet monitoring is essential

Making Network Tomography Practical
Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas

Internet monitoring is essential
For network operators Monitor service-level agreements Troubleshoot failures Diagnose anomalous behavior For users or content/application providers Verify network performance

Challenge 1: Nobody controls end-to-end path
AS3 AS2 AS4 AS1 Network operators only have data of one AS End-hosts can only monitor end-to-end paths

Challenge 2: Available data not direct
Network operators Users, applications Is my network performance good? Only have per-link counts or active probes Is there a problem? Where? There may be no alarm Is my provider’s performance good? Only have end-to-end delay and loss

Network tomography to rescue
Inference of unknown network properties from measurable ones Sophisticated inference algorithms Given a model and available measurements Apply statistical inference to estimate properties Maximum likelihood estimator, Bayesian inference Unfortunately, limited practical deployment Measuring the required inputs is difficult

Monitoring techniques to make network tomography practical
This tutorial Monitoring techniques to make network tomography practical

Outline Examples of network tomography problems
Case study: fault diagnosis Fault detection: continuous path monitoring Fault identification: binary tomography Correlated path reachability Topology measurements Open issues

Network tomography problems
Estimation of a network’s traffic matrix Given total traffic in network links What is the traffic between a network’s entry and exit points? Inference of link performance Given end-to-end probes What is the loss rate or delay of a link? Inference of network topology Given end-to-end loss measurements What is the logical network topology?

Inference of link performance
What are the properties of network links? Loss rate Delay Bandwidth Connectivity Given end-to-end measurements No access to routers F D AS 2 E AS 1 C A B

Multicast-based Inference of Network-internal Characteristics
Measurements Multicast probes Traces collected at receivers Inference Exploit correlation in traces to estimate link properties Introduced by MINC project probe sender probe collectors

Inferring link loss rates
Assumptions Known, logical-tree topology Losses are independent Multicast probes Methodology Maximum likelihood estimates for αk m success probabilities α1 α2 α3 t1 t2 1 1 1 1 1 estimated success probabilities α1 ^ α2 ^ α3 ^

Binary tomography Labels links as good or bad m
Loss rate estimation requires tight correlation Instead, separate good/bad performance If link is bad, all paths that cross the link are bad m α1 α2 α3 t1 t2 1 1 1 1 bad good

Single-source tree “Smallest Consistent Failure Set” algorithm m
Assumes a single-source tree and known topology Find the smallest set of links that explains bad paths Given bad links are uncommon Bad link is the root of maximal bad subtree m bad t1 t2 1 1 1 1 bad good

Binary tomography with multiple sources and targets
Problem becomes NP-hard Minimum hitting set problem Hitting set of a link = paths that traverse the link Iterative greedy heuristic Given the set of links in bad paths Iteratively choose link that explains the max number of bad paths Promising for fault identification m1 m2 t1 t2

Practical issues Topology is often unknown
Need to measure accurate topology Limited deployment of multicast Need to extract correlation from unicast probes Even using probes from different monitors Control of targets is not always practical Need one-way performance from round-trip probes Links can fail for some paths, but not all Need to extend tomography algorithms

Steps of fault diagnosis
AS3 AS2 AS4 AS1 Detection: continuous path monitoring Identification: binary tomography

Fault detection

Detection techniques Active probing: ping
Send probe and collect response No control of targets Passive analysis of user’s traffic tcpdump: tap all incoming and outgoing packets Monitoring of TCP connections

Detection with ping If receives reply If no reply before timeout probe
ICMP echo request t If receives reply Then, path is good If no reply before timeout Then, path is bad reply ICMP echo reply m

Persistent failure or measurement noise?
Many reasons to lose probe or reply Timeout may be too short Rate limiting at routers Some end-hosts don’t respond to ICMP request Transient congestion Routing change Need to confirm that failure is persistent Otherwise, may trigger false alarms

Failure confirmation Upon detection of a failure, trigger extra probes
Goal: minimize detection errors Sending more probes Waiting longer between probes Tradeoff: detection error and detection time loss burst packets on a path time Detection error

Passive detection tcpdump captures all packets
Track status of each TCP connection RTTs, timeouts, retransmissions Multiple timeouts indicate path is bad If current seq. number > last seq. number seen Path is good If current seq. number = last seq. number seen Timeout has occurred After four timeouts, declare path as bad

Passive vs. active detection
No need to inject traffic Detects all failures that affect user’s traffic Responses from targets that don’t respond to ping No need to tap user’s traffic Detects failures in any desired path Not always possible to tap user’s traffic Only detects failures in paths with traffic Probing overhead Cover a large number of paths Detect failures fast

Active monitoring: reducing probing overhead
target network target hosts M1 C A D T3 B Goal detect failures of any of the interfaces in the target network with minimum probing overhead monitors M2

Simple solution: Coverage problem
D T3 B Instead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s network Coverage problem is NP-hard Solution: greedy set-cover heuristic M2

Coverage solution doesn’t detect all types of failures
Detects fail-stop failures Failures that affect all packets that traverse the faulty interface Eg., interface or router crashes, fiber cuts, bugs But not path-specific failures Failures that affect only a subset of paths that cross the faulty interface Eg., router misconfigurations

New formulation of failure detection problem
Select the frequency to probe each path Lower frequency per-path probing can achieve a high frequency probing of each interface T1 T2 1 every 9 mins M1 C A D T3 1 every 3 mins B M2

Is failure in forward or reverse path?
probe Paths can be asymmetric Load balancing Hot-potato routing reply m

Disambiguating one-way losses: Spoofing
Monitor requests to spoofer to send probe Probe has IP address of the monitor If reply reaches the monitor, reverse path is good Spoofer m Spoofer: Send spoofed packet with source address of m

Summary: Fault detection
Techniques to measure path reachability Active probing: ping + failure confirmation Passive analysis of TCP connections Reducing overhead of active monitoring Select the set of paths to probe Trade-off: set of paths and probing frequency No control of targets Only have round-trip measurements Spoofing differentiates forward/reverse failures

Fault identification: correlated path reachability

Uncorrelated measurements lead to errors
Lack of synchronization leads to inconsistencies Probes cross links at different times Path may change between probes m t1 t2 mistakenly inferred failure

Sources of inconsistencies
In measurements from a single monitor Probing all targets can take time In measurements from multiple monitors Hard to synchronize monitors for all probes to reach a link at the same time Impossible to generalize to all links

Inconsistent measurements with multiple monitors
mK … path reachability m1 m1,t1 good mK,t1 good … … … … m1, tN good mK, tN bad inconsistent measurements … tN t1

Solution: Reprobe paths after failure
mK … path reachability m1 m1,t1 good mK,t1 good … … … … m1, tN bad mK, tN bad … tN Consistency has a cost Delays fault identification Cannot identify short failures t1

Summary: Correlated measurements
Correlation is essential to tomography Lack of correlation leads to false alarms Correlation is hard with unicast probes Probing multiple targets takes time Multiple monitors cannot probe a link simultaneously Solution: probe paths again after fault detection Trade-off: consistency vs. detection speed

fault identification: accurate Topology

Measuring router topology
With access to routers (or “from inside”) Topology of one network Routing monitors (OSPF or IS-IS) No access to routers (or “from outside”) Multi-AS topology or from end-hosts Monitors issue active probes: traceroute

Topology from inside Routing protocols flood state of each link
Periodically refresh link state Report any changes: link down, up, cost change Monitor listens to link-state messages Acts as a regular router AT&T’s OSPFmon or Sprint’s PyRT for IS-IS Combining link states gives the topology Easy to maintain, messages report any changes

Inferring a path from outside: traceroute
Actual path TTL exceeded from B.1 TTL exceeded from A.1 A.1 A.2 B.1 B.2 m A B t TTL = 1 TTL = 2 Inferred path A.1 B.1 m t

A traceroute path can be incomplete
Load balancing is widely used Traceroute only probes one path Sometimes taceroute has no answer (stars) ICMP rate limiting Anonymous routers Tunnelling (e.g., MPLS) may hide routers Routers inside the tunnel may not decrement TTL

Traceroute under load balancing
Actual path A C TTL = 2 E L t m B D TTL = 3 Missing nodes and links Inferred path A C False link E L m t B D

Errors happen even under per-flow load balancing
TTL = 2 Port 2 E L t m B D TTL = 3 Port 3 Traceroute uses the destination port as identifier Per-flow load balancers use the destination port as part of the flow identifier

Paris traceroute Solves the problem with per-flow load balancing
Probes to a destination belong to same flow Changes the location of the probe identifier Use the UDP checksum A C TTL = 3 Port 1 TTL = 2 Port 1 E L t m Checksum 3 Checksum 2 B D

Topology from traceroutes
Actual topology Inferred topology D t1 3 2 1 2 1 1 D.1 t1 A C 4 C.1 m1 A.1 m1 3 2 C.2 t2 1 B 2 3 t2 B.3 m2 m2 Inferred nodes = interfaces, not routers Coverage depends on monitors and targets Misses links and routers Some links and routers appear multiple times

Alias resolution: Map interfaces to routers
Direct probing Probe an interface, may receive response from another Responses from the same router will have close IP identifiers and same TTL Record-route IP option Records up to nine IP addresses of routers in the path Inferred topology D.1 t1 A.1 C.1 m1 C.2 t2 B.3 m2 same router

Large-scale topology measurements
Probing a large topology takes time E.g., probing 1200 targets from PlanetLab nodes takes 5 minutes on average (using 30 threads) Probing more targets covers more links But, getting a topology snapshot takes longer Snapshot may be inaccurate Paths may change during snapshot Hard to get up-to-date topology To know that a path changed, need to re-probe

Faster topology snapshots
Probing redundancy Intra-monitor Inter-monitor Doubletree Combines backward and forward probing to eliminate redundancy D t1 m1 A C t2 B m2

Summary of techniques to measure topology
Routing messages Complete and accurate But, need access to routers Combining traceroutes Anyone can use it, no privileged access to routers But, false or missing links and nodes Topologies for tomography: some uncertainties Multiple topologies close to the time of an event Multiple paths between a monitor and a target

Open issues Fault detection Fault identification
How to detect faults or performance degradations that impact end-users? What is the overhead and speed of large-scale deployments? Will spoofing work in a large-scale deployments? Fault identification How to keep the topology up-to-date for fast identification? Do we need new tomography techniques to cope with partial failures? Could inference be easier with cooperation from routers?

REferences

Network tomography theory
Survey on network tomography R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004), Traffic matrix estimation Y. Vardi, “Network Tomography: Estimating Source-Destination Traffic Intensities from Link Data”, Journal of the American Statistical Association, Vol. 91, 1996. Inference of link performance/connectivity MINC project: A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, May 2000.

Binary tomography Single-source tree algorithm
N. Duffield, “Network Tomography of Binary Network Performance Characteristics”, IEEE Transactions on Information Theory, 2006. Applying tomography in one network R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007. Applying tomography in multiple network topology A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot, “NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data”, CoNEXT, 2007.

Topology from inside IS-IS monitoring OSPF monitoring
R. Mortier, “Python Routeing Toolkit (`PyRT')”, OSPF monitoring A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture, Design and Deployment Experience”, NSDI 2004 Commercial products Packet Design:

Topology with traceroute
Tracing accurate paths under load-balancing B. Augustin et al., “Avoiding traceroute anomalies with Paris traceroute”, IMC, 2006. Reducing overhead to trace topology of a network and alias resolution with direct probing N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002. Use of record route to obtain more accurate topologies R. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet Cartographer”, SIGCOMM, 2008. Reducing overhead to trace a multi-network topology B. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005.

Reducing overhead of active fault detection
Selection of paths to probe H. Nguyen and P. Thiran, “Active measurement for multiple link failures diagnosis in IP networks”, PAM, 2004. Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link delays and faults in IP networks”, INFOCOM, 2003. Selection of the frequency to probe paths H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009.

Internet-wide fault detection systems
Detection with BGP monitoring plus continuous pings, spoofing to disambiguate one-way failures, traceroute to locate faults E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008. Detection with passive monitoring of traffic of peer-to-peer systems or content distribution networks, traceroutes to locate faults M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.

Internet monitoring is essential

Similar presentations

Presentation on theme: "Internet monitoring is essential"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Internet monitoring is essential

Similar presentations

Presentation on theme: "Internet monitoring is essential"— Presentation transcript:

Similar presentations

About project

Feedback