Network Tomography for Fault Diagnosis Renata Teixeira LIP6 Computer Laboratory CNRS and UPMC Paris Universitas.

Slides:

Advertisements

Similar presentations

Challenges in Making Tomography Practical

Advertisements

Data-Plane Accountability with In-Band Path Diagnosis Murtaza Motiwala, Nick Feamster Georgia Tech Andy Bavier Princeton University.

Research Summary Nick Feamster. The Big Picture Improving Internet availability by making networks easier to operate Three approaches –From the ground.

Internet monitoring is essential

Theory Lunch. 2 Problem Areas Network Virtualization for Experimentation and Architecture –Embedding problems –Economics problems (markets, etc.) Network.

Network Resource Broker for IPTV in Cloud Computing Lei Liang, Dan He University of Surrey, UK OGF 27, G2C Workshop 15 Oct 2009 Banff,

Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis Hung Nguyen (Univ. of Adelaide, Australia) Renata Teixeira.

Florin Dinu T. S. Eugene Ng Rice University Inferring a Network Congestion Map with Traffic Overhead 0 zero.

1 Locating Internet Bottlenecks: Algorithms, Measurement, and Implications Ningning Hu (CMU) Li Erran Li (Bell Lab) Zhuoqing Morley Mao (U. Mich) Peter.

1 Network Measurements in Overlay Networks Richard Cameron Craddock School of Electrical and Computer Engineering Georgia Institute of Technology.

CCNA2 Module 4. Discovering and Connecting to Neighbors Enable and disable CDP Use the show cdp neighbors command Determine which neighboring devices.

Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli University of Calif, Berkeley and Lawrence Berkeley National Laboratory SIGCOMM.

Enabling Flow-level Latency Measurements across Routers in Data Centers Parmjeet Singh, Myungjin Lee Sagar Kumar, Ramana Rao Kompella.

Path Optimization in Computer Networks Roman Ciloci.

1 Stochastic Event Capture Using Mobile Sensors Subject to a Quality Metric Nabhendra Bisnik, Alhussein A. Abouzeid, and Volkan Isler Rensselaer Polytechnic.

Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.

1 Estimating Shared Congestion Among Internet Paths Weidong Cui, Sridhar Machiraju Randy H. Katz, Ion Stoica Electrical Engineering and Computer Science.

1 In VINI Veritas: Realistic and Controlled Network Experimentation Jennifer Rexford with Andy Bavier, Nick Feamster, Mark Huang, and Larry Peterson

1 Dynamics of End-host controlled Routing Mukund Seshadri Prof. Randy Katz Sahara Retreat Jan 2004.

Internet Traffic Patterns Learning outcomes –Be aware of how information is transmitted on the Internet –Understand the concept of Internet traffic –Identify.

Server-based Inference of Internet Performance V. N. Padmanabhan, L. Qiu, and H. Wang.

An Algebraic Approach to Practical and Scalable Overlay Network Monitoring Yan Chen, David Bindel, Hanhee Song, Randy H. Katz Presented by Mahesh Balakrishnan.

Monitoring Persistently Congested Internet Links Leiwen (Karl) Deng Aleksandar Kuzmanovic Northwestern University

Available bandwidth measurement as simple as running wget D. Antoniades, M. Athanatos, A. Papadogiannakis, P. Markatos Institute of Computer Science (ICS),

1 End-to-End Detection of Shared Bottlenecks Sridhar Machiraju and Weidong Cui Sahara Winter Retreat 2003.

Cumulative Violation For any window size  t  Communication-Efficient Tracking for Distributed Cumulative Triggers Ling Huang* Minos Garofalakis.

Measurement and Monitoring Nick Feamster Georgia Tech.

User-level Internet Path Diagnosis R. Mahajan, N. Spring, D. Wetherall and T. Anderson.

Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab.

RRAPID: Real-time Recovery based on Active Probing, Introspection, and Decentralization Takashi Suzuki Matthew Caesar.

Yao Zhao 1, Yan Chen 1, David Bindel 2 Towards Unbiased End-to-End Diagnosis 1.Lab for Internet & Security Tech, Northwestern Univ 2.EECS department, UC.

A victim-centric peer-assisted framework for monitoring and troubleshooting routing problems.

1 Studying Black Holes on the Internet with Hubble Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas.

A Machine Learning-based Approach for Estimating Available Bandwidth Ling-Jyh Chen 1, Cheng-Fu Chou 2 and Bo-Chun Wang 2 1 Academia Sinica 2 National Taiwan.

Network Planète Chadi Barakat

MPlane – Building an Intelligent Measurement Plane for the Internet Maurizio Dusi – NEC Laboratories Europe NSF Workshop on perfSONAR.

 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.

1 Automated Fault diagnosis in VoIP 31st March,2006 Vishal Kumar Singh and Henning Schulzrinne.

1 Meeyoung Cha, Sue Moon, Chong-Dae Park Aman Shaikh Placing Relay Nodes for Intra-Domain Path Diversity To appear in IEEE INFOCOM 2006.

©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.

Advanced Networking Lab. Given two IP addresses, the estimation algorithm for the path and latency between them is as follows: Step 1: Map IP addresses.

workshop eugene, oregon What is network management? System & Service monitoring  Reachability, availability Resource measurement/monitoring.

Hung X. Nguyen and Matthew Roughan The University of Adelaide, Australia SAIL: Statistically Accurate Internet Loss Measurements.

TOMA: A Viable Solution for Large- Scale Multicast Service Support Li Lao, Jun-Hong Cui, and Mario Gerla UCLA and University of Connecticut Networking.

Active Measurements on the AT&T IP Backbone Len Ciavattone, Al Morton, Gomathi Ramachandran AT&T Labs.

Protection and Restoration Definitions A major application for MPLS.

Towards Efficient Large-Scale VPN Monitoring and Diagnosis under Operational Constraints Yao Zhao, Zhaosheng Zhu, Yan Chen, Northwestern University Dan.

A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,

Networking Fundamentals. Basics Network – collection of nodes and links that cooperate for communication Nodes – computer systems –Internal (routers,

Detection of Routing Loops and Analysis of Its Causes Sue Moon Dept. of Computer Science KAIST Joint work with Urs Hengartner, Ashwin Sridharan, Richard.

1 A Framework for Measuring and Predicting the Impact of Routing Changes Ying Zhang Z. Morley Mao Jia Wang.

Internet measurements: fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)

Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area.

1 Evaluating NGI performance Matt Mathis

Low-Rate TCP-Targeted DoS Attack Disrupts Internet Routing Ying Zhang Z. Morley Mao Jia Wang Presented in NDSS07 Prepared by : Hale Ismet.

Troubleshooting Mesh Networks Lili Qiu Joint Work with Victor Bahl, Ananth Rao, Lidong Zhou Microsoft Research Mesh Networking Summit 2004.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.

Development of a QoE Model Himadeepa Karlapudi 03/07/03.

Bing Wang, Wei Wei, Hieu Dinh, Wei Zeng, Krishna R. Pattipati (Fellow IEEE) IEEE Transactions on Mobile Computing, March 2012.

Precision Measurements with the EVERGROW Traffic Observatory Péter Hága István Csabai.

1 Effective Diagnosis of Routing Disruptions from End Systems Ying Zhang Z. Morley Mao Ming Zhang.

Placing Relay Nodes for Intra-Domain Path Diversity Meeyoung Cha Sue Moon Chong-Dae Park Aman Shaikh Proc. of IEEE INFOCOM 2006 Speaker 游鎮鴻.

1 On the Impact of Route Monitor Selection Ying Zhang* Zheng Zhang # Z. Morley Mao* Y. Charlie Hu # Bruce M. Maggs ^ University of Michigan* Purdue University.

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.

Monitoring Persistently Congested Internet Links

UNIT-V Transport Layer protocols for Ad Hoc Wireless Networks

ISP and Egress Path Selection for Multihomed Networks

Presentation transcript:

Network Tomography for Fault Diagnosis Renata Teixeira LIP6 Computer Laboratory CNRS and UPMC Paris Universitas

1 The Internet is great, but problems happen LIP6 network Net1 Net2 Net3 How to automatically detect and identify problems? Is my connection ok? Is it google? Is the problem in one of the networks in path?

2 Current alarms are not enough  Network equipments already have many alarms – SNMP traps – Anomaly detection systems  But, alarms may not reflect user’s experience – Hard to map users’ complaints to alarms – The user’s problem may not appear as an alarm  Network admins often resort to active measurements – Active monitoring servers inside their network – Subscribe to third-party monitoring services Eg. Keynote or RIPE TTM

3 End-hosts can collaborate to troubleshoot problems LIP6 network Net1 Net2 Net3 Detection: continuous path monitoring Identification: tomography

4 End-host troubleshooting in two different contexts  Network admins deploy monitoring services – Verify the performance of their networks – Assist in troubleshooting  End-users can collaborate – Identify and bypass problems – Rank providers

5 Detection techniques  For network admins – Deploy dedicated monitors – Need to inject probes to measure paths  For end-users – Monitoring at end- users’ machine – Tapping users’ traffic is promising Challenge cannot continuously overload the network or end-user’s machine to detect faults

Minimizing probing cost for detecting interface failures: Algorithms and scalability analysis with Hung X. Nguyen (Univ. of Adelaide) Patrick Thiran (EPFL) Christophe Diot (Thomson)

7 Active monitoring system to detect faults M1 M2 T3 T1 T2 A C B D target hosts monitors Goal detect failures of any of the interfaces in the subscriber’s network with minimum probing overhead target network

8 Simple solution: Coverage problem M1 M2 T3 T1 T2 A C B D  Instead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s network  Coverage problem is NP-hard – Solution: greedy set-cover heuristic

9 Coverage solution doesn’t detect all types of failures  Detects fail-stop failures – Failures that affect all packets that traverse the faulty interface Eg., interface or router crashes, fiber cuts, bugs  But not path-specific failures – Failures that affect only a subset of paths that cross the faulty interface Eg., router misconfigurations

10 New formulation of failure detection problem  Select the frequency to probe each path – Lower frequency per-path probing can achieve a high frequency probing of each interface M1 M2 T3 T1 T2 A C B D 1 every 9 mins 1 every 3 mins

11 Properties of solution  Failure detection problem is no longer NP-hard – Can find optimal solution using linear programming – Parameters: Duration of path-specific and fail-stop failures  Needs synchronization among monitors – Monitors need collaborate to probe an interface – Alternative probabilistic solution avoids synchronization overhead  Probing cost scales almost linearly with the size of the target network – In random power-law graphs like inferred internet graphs

12 Evaluation  Paths obtained using traceroutes – From 750 PlanetLab nodes to 3,000 DNS servers – From 12 RON nodes to 60,000 targets  Target networks are probed ASes – Map IPs to ASes using Mao et al.’s technique – 1,366 ASes in PlanetLab – 6,517 ASes in RON  Compute probing costs varying parameters – Set of paths, failure durations, target network

13 Probing costs varying size of subscriber network in PlanetLab Duration Path-specific = 1000 sec Fail-stop = 1 sec

14 Summary  Practical formulation of failure detection problem – Incorporates both fail-stop and path-specific failures  Solution minimizes probing cost – Using linear programming  Inferred internet graphs are among the most expensive to probe – Probing scales almost linearly with network size  Next step – Deploy a system based on these probing techniques

ConnectionWatch: Passive monitoring of round-trip times at end-hosts with Diana Zeaiter Joumblatt (LIP6) Nina Taft (Intel)

16 Goal  Automatic detection of performance degradations – Only care about problems that impact applications – Focus on detecting “large” round-trip times (RTT) – Detection should be fast and lightweight

17 ConnectionWatch Sniffer Extract flow ID RTT estimation High RTT detector TCP packets Upload to central server Packet Trace Ping Daemon Flow statistics Alarms

18 Insights from preliminary experiments  Datasets from five students during three days – 44,715 TCP connections over 3,584 paths to 2,242 IPs  Some observations – More complete measurements than ping 16.5% of 1,072 addresses don’t reply to pings – Transfer of traces to server is main bottleneck  Hurdles – Portability of system to other OSes – Privacy concerns with capturing user’s traffic – Incentives for large-scale deployment

19 Which RTT variations correspond to performance degradations?  Our datasets are still too small to answer – Performance degradations are rare events  Simple technique based on outlier threshold – What is a good threshold? – Should it the threshold be for all users, per user, per path, per app?  Do outliers correspond to real performance degradations? – ConnectionWatch should get user’s feedback “I’m annoyed button”

Practical issues with using network tomography for fault diagnosis with Italo Scota Cunha (LIP6, Thomson) Amogh Dhamdhere, Yiyi Huang, Nick Feamster, Constantine Dovrolis (Georgia Tech) Christophe Diot (Thomson)

21 The binary tomography solution by Duffield  Given – Complete network topology – End-to-end reachability measurements  Find the smallest set of links that explain observations – Assumes single-source tree, access to targets m t1 t2

22 Extending binary tomography  Multi-network setting: topology not known – Periodic traceroutes determine topologies  Extension to multiple-sources, multiple-targets – Minimum hitting set problem (NP-hard)  Tomo: Iterative poly-time greedy heuristic – Intuition: Iteratively choose link that explains the max number of failures

23 Some problems  Dynamics – Loss can be transient, topology can change  Ambiguity – Losses are one-way but don’t always have access to both ends of the path  Lack of synchronization – Different monitors see different conditions

24 Approach  Transient packet loss – Triggered confirmation of failed paths  Dynamic routing – Periodic snapshots of the network topology  One-way losses – Algorithm based on IP spoofing  Lack of synchronization – Correlation of probes from different monitors

25 Failure confirmation time loss burst packets on a path  Upon detection of a failure, trigger extra probes  Number of probes – Confirm failures with a target false positive rate – Assume independence and a given a loss rate  Time between probes – Reduce chance that probes fall on the same loss burst – Assume link losses follow a Gilbert process false positive

26 Disambiguating one-way losses: Spoofing  Monitor sends request to spoofer to send probe  Probe has IP address of the monitor  If reply reaches the monitor, reverse path is working M Spoofer: Send spoofed packet with source address of M T

27 Evaluation  Evaluation is challenging – Need ground truth and realistic environment  Controlled experiments on the VINI testbed – Allow us to inject failures – Problem: hard to argue about false positive  Experiments on Emulab – More control: dedicated nodes and links – Emulate the Abilene network – Selected LA and NY as monitors

28 Failure confirmation reduces false positives  Emulab experiment setup – 10% loss rates in each direction – No persistent failures  Both schemes use three probes to confirm a failure Confirmation interval Burst factor 90%96% Back-to-back15%25% 0.2 secs0.8% low false positives, because an interval of 0.2 secs guarantees a small probability of probes being correlated

29 Correlation is important to get a consistent view  Emulab and VINI experiments with short failures – More false positives – Lower detection rate  In real deployments, can we get a consistent view? – More noise because of losses and routing dynamics – Monitors are less synchronized – Monitors may not be able to reach the coordinator  Next steps – Online correlation – Minimize communication with coordinator

30 Summary  Continuous monitoring for detection – At management hosts: active measurements Reduce probing overhead, still detect failures – At end-users: passive measurements Lightweight detection of problems that affect apps  Network tomography for identification – Many challenges to get consistent inputs for tomography Network dynamics and transient losses Ambiguity of forward and reverse failures Monitors may observe different conditions