1 Troubleshooting Chronic Conditions in Large IP Networks Ajay Mahimkar, Jennifer Yates, Yin Zhang, Aman Shaikh, Jia Wang, Zihui Ge, Cheng Tien Ee UT-Austin.

Slides:



Advertisements
Similar presentations
Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs.
Advertisements

Jennifer Rexford Princeton University MW 11:00am-12:20pm Logically-Centralized Control COS 597E: Software Defined Networking.
Towards Automated Performance Diagnosis in a Large IPTV Network Ajay Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, Qi Zhao UT-Austin.
Deployment of MPLS VPN in Large ISP Networks
1 Aman Shaikh UCSC SHS IMW A Case-study of OSPF Behavior in a Large Enterprise Network Aman Shaikh, UCSC Chris Isett, Siemens Health Services Albert.
G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer.
Detectability of Traffic Anomalies in Two Adjacent Networks Augustin Soule, Haakon Ringberg, Fernando Silveira, Jennifer Rexford, Christophe Diot.
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
1 BGP Anomaly Detection in an ISP Jian Wu (U. Michigan) Z. Morley Mao (U. Michigan) Jennifer Rexford (Princeton) Jia Wang (AT&T Labs)
Internet Measurement Initiatives in the Wisconsin Advanced Internet Lab Paul Barford Computer Science Department University of Wisconsin – Madison Spring,
Network Architecture for Joint Failure Recovery and Traffic Engineering Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.
Trajectory Sampling for Direct Traffic Observation Matthias Grossglauser joint work with Nick Duffield AT&T Labs – Research.
1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.
Traffic Engineering With Traditional IP Routing Protocols
Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, Y. Richard Yang Laboratory of Networked Systems Yale University.
1 Traffic Engineering for ISP Networks Jennifer Rexford IP Network Management and Performance AT&T Labs - Research; Florham Park, NJ
DFence: Transparent Network-based Denial of Service Mitigation CSC7221 Advanced Topics in Internet Technology Presented by To Siu Sang Eric ( )
Traffic Engineering in IP Networks Jennifer Rexford Computer Science Department Princeton University; Princeton, NJ
Dynamics of Hot-Potato Routing in IP Networks Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
A Routing Control Platform for Managing IP Networks Jennifer Rexford Princeton University
Measurement and Monitoring Nick Feamster Georgia Tech.
On Multi-Path Routing Aditya Akella 03/25/02. What is Multi-Path Routing?  Dynamically route traffic Multiple paths to a destination Path taken dependant.
Network Monitoring for Internet Traffic Engineering Jennifer Rexford AT&T Labs – Research Florham Park, NJ 07932
A victim-centric peer-assisted framework for monitoring and troubleshooting routing problems.
© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
Hot Potatoes Heat Up BGP Routing Jennifer Rexford AT&T Labs—Research Joint work with Renata Teixeira, Aman Shaikh, and.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Performance Management (Best Practices) REF: Document ID
New Challenges in Cloud Datacenter Monitoring and Management
© 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-1 Implementing a Highly Available Network Understanding High Availability.
A Signal Analysis of Network Traffic Anomalies Paul Barford with Jeffery Kline, David Plonka, Amos Ron University of Wisconsin – Madison Summer, 2002.
Tomo-gravity Yin ZhangMatthew Roughan Nick DuffieldAlbert Greenberg “A Northern NJ Research Lab” ACM.
1 Proposed Additional Use Cases for Congestion Exposure draft-mcdysan-conex-other-usecases-00.txt Dave McDysan.
1 Root-Cause Network Troubleshooting Optimizing the Process Tim Titus CTO, PathSolutions.
1 Meeyoung Cha, Sue Moon, Chong-Dae Park Aman Shaikh Placing Relay Nodes for Intra-Domain Path Diversity To appear in IEEE INFOCOM 2006.
Authors Renata Teixeira, Aman Shaikh and Jennifer Rexford(AT&T), Tim Griffin(Intel) Presenter : Farrukh Shahzad.
Fault Management * * Mani Subramanian “Network Management: Principles and practice”, Addison-Wesley, 2000.
© 2011 Cisco and/or its affiliates. All rights reserved. 1 High Performance Network Analysis Enterprise Operate Practice Cisco Services Andrew Wojtkowiak.
Software Metrics - Data Collection What is good data? Are they correct? Are they accurate? Are they appropriately precise? Are they consist? Are they associated.
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
Happy Network Administrators  Happy Packets  Happy Users WIRED Position Statement Aman Shaikh AT&T Labs – Research October 16,
Wireless Mesh Network 指導教授:吳和庭教授、柯開維教授 報告:江昀庭 Source reference: Akyildiz, I.F. and Xudong Wang “A survey on wireless mesh networks” IEEE Communications.
Measurement and Modeling of Packet Loss in the Internet Maya Yajnik.
Basic component of Network Management Woraphon Lilakiatsakun.
Network Anomography Yin Zhang – University of Texas at Austin Zihui Ge and Albert Greenberg – AT&T Labs Matthew Roughan – University of Adelaide IMC 2005.
A Firewall for Routers: Protecting Against Routing Misbehavior1 June 26, A Firewall for Routers: Protecting Against Routing Misbehavior Jia Wang.
BGP topics to be discussed in the next few weeks: –Excessive route update –Routing instability –BGP policy issues –BGP route slow convergence problem –Interaction.
A Snapshot on MPLS Reliability Features Ping Pan March, 2002.
NetSearch: Googling Large-scale Network Management Data GROUP 2 MEMBERS SAMUEL LAWER WENBO HAN HUAN YAN PEI YAN SHREY YADAV SHUAI YU SHINE PANDITA.
A Light-Weight Distributed Scheme for Detecting IP Prefix Hijacks in Real-Time Lusheng Ji†, Joint work with Changxi Zheng‡, Dan Pei†, Jia Wang†, Paul Francis‡
1 Theophilus Benson*, Aditya Akella*, Aman Shaikh + *University of Wisconsin, Madison + ATT Labs Research.
Towards a Well-Managed Next Generation Internet! Hot Research Topics in Next Generation Internet Panel NY Systems/Networking Summit, NYU Aman Shaikh AT&T.
Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)
1 Root-Cause VoIP Troubleshooting Optimizing the Process Tim Titus CTO, PathSolutions.
On Selfish Routing In Internet-like Environments Lili Qiu (Microsoft Research) Yang Richard Yang (Yale University) Yin Zhang (AT&T Labs – Research) Scott.
Yaping Zhu with: Jennifer Rexford (Princeton University) Aman Shaikh and Subhabrata Sen (ATT Research) Route Oracle: Where Have.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
NetQuest: A Flexible Framework for Large-Scale Network Measurement Lili Qiu University of Texas at Austin Joint work with Han Hee Song.
Company LOGO Network Management Architecture By Dr. Shadi Masadeh 1.
© 2014 Level 3 Communications, LLC. All Rights Reserved. Proprietary and Confidential. Simple, End-to-End Performance Management Application Performance.
1 Root-Cause Network Troubleshooting Optimizing the Process Tim Titus CTO PathSolutions.
SDN challenges Deployment challenges
Network Fault Analysis based on Machine Learning
Redcell™ Management Essentials, Juniper Networks Enterprise Edition
Jian Wu (University of Michigan)
ETHANE: TAKING CONTROL OF THE ENTERPRISE
Networking for the Future of Science
Backbone Traffic Engineering
Scrumium NetBrain Thursday, May 09, 2019.
Presentation transcript:

1 Troubleshooting Chronic Conditions in Large IP Networks Ajay Mahimkar, Jennifer Yates, Yin Zhang, Aman Shaikh, Jia Wang, Zihui Ge, Cheng Tien Ee UT-Austin and AT&T Labs-Research ACM CoNEXT 2008

2 Network Reliability Applications demand high reliability and performance –VoIP, IPTV, Gaming, … –Best-effort service is no longer acceptable Accurate and timely troubleshooting of network outages required –Outages can occur due to mis-configurations, software bugs, malicious attacks Can cause significant performance impact Can incur huge losses

3 Hard Failures Traditionally, troubleshooting focused on hard failures –E.g., fiber cuts, line card failures, router failures –Relatively easy to detect –Quickly fix the problem and get resource up and running Lots of other network events flying under the radar, and potentially impacting performance Link failure

4 Chronic Conditions Individual events disappear before an operator can react to them Keep re-occurring Can cause significant performance degradation –Can turn into hard failure Examples –Chronic link flaps –Chronic router CPU utilization anomalies Router Router CPU Spikes Chronic link flaps

5 Troubleshooting Chronic Conditions Detect and troubleshoot before customer complains State of art –Manual troubleshooting Network-wide Information Correlation and Exploration (NICE) –First infrastructure for automated, scalable and flexible troubleshooting of chronic conditions –Becoming a powerful tool inside AT&T Used to troubleshoot production network issues Discovered anomalous chronic network conditions

6 Outline Troubleshooting ChallengesTroubleshooting Challenges NICE ApproachNICE Approach NICE ValidationNICE Validation Deployment ExperienceDeployment Experience ConclusionConclusion

7 Troubleshooting Chronic Conditions is hard Routing reports Syslogs Layer-1 Performance reports Workflow Traffic 1. Collect network measurements 2. Mine data to find chronic patterns 3. Reproduce patterns in lab settings (if needed) 4. Perform software and hardware analysis (if needed) Effectively mining measurement data for troubleshooting is the contribution of this paper

8 Troubleshooting Challenges Massive Scale –Potential root-causes hidden in thousands of event-series –E.g., root-causes for packet loss include link congestion (SNMP), protocol down (Route data), software errors (syslogs) Complex spatial and topology models –Cross-layer dependency –Causal impact scope Local versus global (propagation through protocols) Imperfect timing information –Propagation (events take time to show impact – timers) –Measurement granularity (point versus range events)

9 NICE Statistical correlation analysis across multiple data –Chronic condition manifests in many measurements Blind mining leads to information snow of results –NICE starts with symptom and identifies correlated events Chronic Symptom Spatial Proximity model Unified Data Model Statistical Correlation Other Network Events Statistically Correlated Events NICE

10 Spatial Proximity Model Select events in close proximity Hierarchical structure –Capture event location Proximity distance –Capture impact scope of event Examples –Path packet loss - events on routers and links on same path –Router CPU anomalies - events on same router and interfaces OSPF area Router Path Logical link Physical link Layer-1 device Interface Network operators find it flexible and convenient to express the impact scope of network events Hierarchical Structure Path Physical link Layer-1 device Interface Logical link Router Path Router Logical link Physical link Layer-1 device Interface

11 Unified Data Model Facilitate easy cross-event correlations Padding time-margins to handle diverse data –Convert any event-series to range series Common time-bin to simplify correlations –Convert range-series to binary time-series Range Event Series A Point Event Series B Padding margin Overlapping range Convert to binary Merge Overlapping range Auto-correlation

12 Statistical Correlation Testing Co-occurrence is not sufficient Measure statistical time co-occurrence –Pair-wise Pearson’s correlation coefficient Unfortunately, cannot apply the classic significance test –Due to auto-correlation Samples within an event-series are not independent Over-estimates the correlation confidence: high false alarms We propose a novel circular permutation test –Key Idea: Keep one series fixed and shift another Preserve auto-correlation Establishes baseline for null hypothesis that two series are independent

13 NICE Validation Goal: Test if NICE correlation output matches networking domain knowledge –Validation using 6 months of data from AT&T backbone Pairs for correlation testing Expected not to correlate Expected to correlate Matched outputs Unexpected Correlations Missed Correlations NICE Correlation Results Results Expected by Network operators For 97% pairs, NICE correlation output agreed with domain knowledge For remaining 3% mismatch, their causes fell into three categories –Imperfect domain knowledge –Measurement data artifacts –Anomalous network behavior Expected to not correlate, NICE marked correlated Expected to correlate, NICE marked uncorrelated

14 Anomalous Network Behavior Example – Cross-layer Failure interactions –Modern ISPs use failure recovery at layer-1 to rapidly recover from faults without inducing re-convergence at layer-3 i.e., if layer-1 has protection mechanism invoked successfully, then layer-3 should not see a link failure Expectation: Layer-3 link down events should not correlate with layer-1 automated failure recovery –Spatial proximity model: SAME LINK Result: NICE identified strong statistical correlation –Router feature bugs identified as root cause –Problem has been mitigated

15 Troubleshooting Case Studies AT&T Backbone Network Uplink packet loss on an access router Packet loss observed by active measurement between a router pair CPU anomalies on routers Data SourceNumber of Event types Layer-1 Alarms130 SNMP4 Router Syslogs937 Command Logs839 OSPF Events25 Total1935 All three case studies uncover interesting correlations with new insights

16 Chronic Uplink Packet loss Problem: Identify strongly correlated event-series with chronic packet drops on router uplinks –Significantly impacting customers NICE Input: Customer interface packet drops (SNMP) and router syslogs Customer interfaces Which customer interface events correlate? ISP Network Uplinks to backbone Packet drops Access Router..

17 Chronic Uplink Packet loss High co-occurrence, but no statistical correlation NICE identifies strong statistical correlation

18 Chronic Uplink Packet loss NICE Findings: Strong Correlations with –Packet drops on four customer-facing interfaces (out of 150+ with packet drops) All four interfaces from SAME CUSTOMER –Short-term traffic bursts appear to cause internal router limits to be reached Impacts traffic flowing out of router Impacting other customers –Mitigation Action: Re-home customer interface to another access router

19 Conclusions Important to detect and troubleshoot chronic network conditions before customer complains NICE – First scalable, automated and flexible infrastructure for troubleshooting chronic network conditions –Statistical correlation testing –Incorporates topology and routing model Operational experience is very positive –Becoming a powerful tool inside AT&T Future Work –Network behavior change monitoring using correlations –Multi-way correlations

20 Thank You !

21 Backup Slides …

22 Router CPU Utilization Anomalies Problem: Identify strongly correlated event-series with chronic CPU anomalies as input symptom NICE Input: Router syslogs, routing events, command logs and layer-1 alarms NICE Findings: Strong Correlations with –Control-plane activities –Commands such as viewing routing protocol states –Customer-provisioning –SNMP polling Mitigation Action: Operators are working with router polling systems to refine their polling mechanisms Consistent with earlier operations findings New

23 Auto-correlation About 30% of event-series have significant auto-correlation at lag 100 or higher

24 Circular Permutation Test Auto-correlation 1 Series A Series B Permutation provides correlation baseline to test hypothesis of independence

25 Imperfect Domain Knowledge Example – one of router commands used to view routing state is considered highly CPU intensive We did not find significant correlation between the command and CPU value as low as 50% –Correlation became significant only with CPU above 40% –Conclusion: The command does cause CPU spikes, but not as high as we had expected Domain knowledge updated !