© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.

Slides:



Advertisements
Similar presentations
Motorola General Business Use MOTOROLA and the Stylized M Logo are registered in the US Patent & Trademark Office. All other product or service names are.
Advertisements

Chapter 1: Introduction to Scaling Networks
Towards Automated Performance Diagnosis in a Large IPTV Network Ajay Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, Qi Zhao UT-Austin.
Deployment of MPLS VPN in Large ISP Networks
Predictor of Customer Perceived Software Quality By Haroon Malik.
Detectability of Traffic Anomalies in Two Adjacent Networks Augustin Soule, Haakon Ringberg, Fernando Silveira, Jennifer Rexford, Christophe Diot.
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
© 2014 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
1 Troubleshooting Chronic Conditions in Large IP Networks Ajay Mahimkar, Jennifer Yates, Yin Zhang, Aman Shaikh, Jia Wang, Zihui Ge, Cheng Tien Ee UT-Austin.
1 BGP Anomaly Detection in an ISP Jian Wu (U. Michigan) Z. Morley Mao (U. Michigan) Jennifer Rexford (Princeton) Jia Wang (AT&T Labs)
Motorola Mobility Services Platform (MSP3.2) Control Edition Optimizing use of your mobile assets Daphanie Wallace June 2008 Enterprise Mobility Solutions.
Cisco Confidential 1 © 2010 Cisco and/or its affiliates. All rights reserved. Cisco Catalyst Smart Operations Automates the trivial and repetitive tasks.
1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.
Traffic Engineering With Traditional IP Routing Protocols
Dynamics of Hot-Potato Routing in IP Networks Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
A Routing Control Platform for Managing IP Networks Jennifer Rexford Princeton University
Network Monitoring for Internet Traffic Engineering Jennifer Rexford AT&T Labs – Research Florham Park, NJ 07932
Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T.
Hot Potatoes Heat Up BGP Routing Jennifer Rexford AT&T Labs—Research Joint work with Renata Teixeira, Aman Shaikh, and.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Barracuda Networks Confidential1 Barracuda Backup Service Integrated Local & Offsite Data Backup.
Department Of Computer Engineering
© 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-1 Implementing a Highly Available Network Understanding High Availability.
CISCO CONFIDENTIAL – DO NOT DUPLICATE OR COPY Protecting the Business Network and Resources with CiscoWorks VMS Security Management Software Girish Patel,
UCSC 1 Aman ShaikhICNP 2003 An Efficient Algorithm for OSPF Subnet Aggregation ICNP 2003 Aman Shaikh Dongmei Wang, Guangzhi Li, Jennifer Yates, Charles.
© 2012 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
Microsoft Desktop Virtualization Migrating to Windows 7 With MED-V.
© 2013 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Mobile Application Ecosystem.
1 October 20-24, 2014 Georgian Technical University PhD Zaza Tsiramua Head of computer network management center of GTU South-Caucasus Grid.
© 2013 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
Net Optics Confidential and Proprietary Net Optics appTap Intelligent Access and Monitoring Architecture Solutions.
Using the WDK for Windows Logo and Signature Testing Craig Rowland Program Manager Windows Driver Kits Microsoft Corporation.
Network Sensitivity to Hot-Potato Disruptions Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
1 Meeyoung Cha, Sue Moon, Chong-Dae Park Aman Shaikh Placing Relay Nodes for Intra-Domain Path Diversity To appear in IEEE INFOCOM 2006.
Current Job Components Information Technology Department Network Systems Administration Telecommunications Database Design and Administration.
© 2014 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
Conditions and Terms of Use
Happy Network Administrators  Happy Packets  Happy Users WIRED Position Statement Aman Shaikh AT&T Labs – Research October 16,
© 2006 Cisco Systems, Inc. All rights reserved. Optimizing Converged Cisco Networks (ONT) Module 6: Implement Wireless Scalability.
Testing Workflow In the Unified Process and Agile/Scrum processes.
1 Second ATLAS-South Caucasus Software / Computing Workshop & Tutorial October 24, 2012 Georgian Technical University PhD Zaza Tsiramua Head of computer.
1 Impact of IT Monoculture on Behavioral End Host Intrusion Detection Dhiman Barman, UC Riverside/Juniper Jaideep Chandrashekar, Intel Research Nina Taft,
Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),
1 © 2001, Cisco Systems, Inc. All rights reserved. Cisco Info Center for Security Monitoring.
A Firewall for Routers: Protecting Against Routing Misbehavior1 June 26, A Firewall for Routers: Protecting Against Routing Misbehavior Jia Wang.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 1: Introduction to Scaling Networks Scaling Networks.
A Snapshot on MPLS Reliability Features Ping Pan March, 2002.
1 Theophilus Benson*, Aditya Akella*, Aman Shaikh + *University of Wisconsin, Madison + ATT Labs Research.
Towards a Well-Managed Next Generation Internet! Hot Research Topics in Next Generation Internet Panel NY Systems/Networking Summit, NYU Aman Shaikh AT&T.
Exchange Deployment Planning Services Exchange 2010 Complementary Products.
Chapter 13: LAN Maintenance. Documentation Document your LAN so that you have a record of equipment location and configuration. Documentation should include.
BGP Routing Stability of Popular Destinations Jennifer Rexford, Jia Wang, Zhen Xiao, and Yin Zhang AT&T Labs—Research Florham Park, NJ All flaps are not.
Data Center Management Microsoft System Center. Objective: Drive Cost of Data Center Management 78% Maintenance 22% New Issue:Issue: 78% of IT budgets.
Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,
Kevin Harrison LTEC 4550 Assignment 3.  Ethernet Hub  An unsophisticated device that is used for connecting multiple Ethernet devices together.  Typically.
1 Huawei Confidential Huawei and Cisco Switches Interoperation VOICE  Over 6.5 million Huawei switches are operating on live networks and their good interoperability.
SDN and Beyond Ghufran Baig Mubashir Adnan Qureshi.
© 2012 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
Instructor Materials Chapter 1: LAN Design
BGP Routing Stability of Popular Destinations
Jian Wu (University of Michigan)
Microsoft Operations Management Suite Insight and Analytics
Microsoft SharePoint Server 2016
TechEd /11/ :44 AM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered.
Scrumium NetBrain Thursday, May 09, 2019.
Deploying and Managing Windows To Go
Computer Networks Protocols
Microsoft Data Insights Summit
Utilizing the Network Edge
Microsoft Virtual Academy
Presentation transcript:

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Mercury: Detecting the Performance Impact of Network Upgrades Ajay Mahimkar, Han Hee Song*, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang*, Joanne Emmons AT&T Labs – Research * UT-Austin 1 ACM SIGCOMM 2010, New Delhi, India

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 2 Massive scale 100s of offices, 1000s of routers, 10,000s of interfaces, Millions of consumers Immense software complexity Scale, Bugs, Interactions Diverse technologies and vendors Layer-1, Layer-2, Switches, Routers, IP, Multicast, MPLS, wireless access points Increasing Network Complexity Continuous evolution Upgrades, Installations Applications Scale, sensitivity

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 3 What are Network Upgrades? Fundamental changes to the network  Router software or hardware upgrades  Configuration and policy changes Upgrades can result in unpredictable impacts in performance  Impacts might fly under radar Enterprise System Servers Operator packet loss End Users Goals  Introduce new service features  Reduce operational cost  Improve performance

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 4 Monitoring Impact of Upgrades One aspect: extensive lab testing before deployment  Software engineering principles and certification process  Goal is to prevent bugs from reaching the network Problems with lab testing  Cannot replicate scale and complexity of operational networks  Cannot enumerate all test-cases Important to monitor upgrades in-field  Manual investigation: critical issues are caught after a long time  Operations Challenge: Large number of devices and performance event-series Innovative solutions required to monitor at scale

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 5 Mercury Detects the performance impact of upgrades in operational networks  Automated data mining to extract trends  Scalable across a large number of measurements  Flexible to work across a diverse set of data sources  Ease of interpretation to network operations Challenges  How to extract upgrades?  Do upgrades induce behavior changes in performance?  Is there commonality in configuration across devices?  Is the change observed network-wide?

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 6 Extracting upgrades Minimize dependency on domain expert input  Human information can be unreliable, incomplete, or outdated  Our approach is data-driven: mine configuration & workflow logs Operating system upgrades  Track OS version and upgrades using polling Firmware upgrades  Detect difference in hardware configuration across days Upgrade-related configuration changes  Lots of configuration changes  Frequent changes like provisioning customers are not upgrades  Heuristic: look for “out of the ordinary”  Two metrics: high coverage (skewness) and rareness

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 7 Detecting Upgrade Induced Changes Performance event-series creation  Divide each series into equal time-bins  For example, daily counts or averages Behavior change detection  E.g., a persistent level-shift  Changes in means, medians, standard deviations or distributions  Our Approach:  Recursive Rank-based Cumulative Sums (CUSUM)  Outputs significant changes along with magnitude (positive versus negative) Upgrades U1U1 U2U2 Associating changes to upgrades  Proximity Model: Same location and close in time CUSUM S i = S i-1 + (r i – ŕ) S 0 = 0

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 8 Identifying commonality Extracting common attributes helps drill-down into changes  Software configuration  Example attributes are OS version, number of BGP peers, re-routing policies  Device location, role, model, vendor Problem: Identifying common attributes is a search in a multi-dimensional space  Classical machine learning problem Solution: RIPPER rule learner  Outputs rules of form A => B  E.g., if (upgrade = OS change) and (router role = border) => positive level-shift in CPU Change A1A1 Attributes A2A2 AnAn Upgrade..... A n-1

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 9 Detecting Network-wide Changes Why network-wide change detection?  Changes might be missed for rare events at each device  Aggregation across devices increases the change significance How to aggregate event-series for each upgrade type?  For each event-series, identify devices that are upgraded  Not trivial to simply aggregate - each upgrade applied over several days Solution: Time alignment for each upgrade  Align event-series such that upgrade falls on same date Significant Change after aggregation Upgrade date R1R1 R2R2 R3R3

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 10 MERCURY Evaluation Evaluation using real network data is challenging  Lack of ground truth information  Close interaction with network operations Data Sets  Upgrades: router configuration, workflow logs  Performance event-series: SNMP (CPU, memory) and syslogs  Collected from tier-1 ISP backbone over 6 months  Number of routers = 988  Router categories: core, aggregate, access, route reflector, hub

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 11 Extracting Upgrades Upgrade Labels from OperationsCountsMERCURY LabelsCounts Interesting13False negative1 Non-interesting19False positive11 MERCURY Output Filtered after applying behavior change detection r = 2 r = 4r = 10 r = 6 r = 8 Compare Mercury output with labels from operations  False positive: falsely detected by Mercury  False negative: missed by Mercury  Vary the threshold for detecting rare upgrade-related configuration changes r = 4

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 12 Performance Event-series CountUpgradesUpgrade Event- series Pairs Upgrade induced change-points Unique Cases CPU , Memory , Syslogs288, ,295, MERCURY Output Significant reduction MERCURY not only confirmed earlier findings, but also revealed previously unknown network behaviors Upgrade induced Behavior Changes Router RoleCore Routers Aggregate Routers Access Routers Route Reflectors Hub Routers Total Performance series103,11243,226113,0796,54824,095290,060

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 13 Mercury Findings Summary Operating system upgrades  Downticks in CPU utilizations on access routers  Upticks in memory utilizations on aggregate routers  Varying behaviors in layer-1 link flaps across different OS versions on access routers  Upticks in number of protection switching events on access routers Firmware upgrades  Downticks in CPU utilizations on central CPU and customer-facing  Upticks on optical carrier line cards BGP fast external fall-over policy changes  Upticks in the number of “down interface flaps”  Downticks in the number of BGP hold timer and peer closed session events

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 14 Case Study: Protection Switching Line card protection in access routers  To protect customers from line card failures  On failure, customers are switched to backup  Switching is called Automated Protection Switching (APS) OS upgrade Dates normalized across all upgraded routers. The upgrade happened on day 84 MERCURY validated a known issue  Small increase in the frequency of APS failure events  Critical issue impacting customers  Run across all the syslog messages  APS failure events are rare per router  Statistically indistinguishable on an individual router level  Change detected when aggregated across all upgraded access routers Mercury was used by Ops to track improvements as fix was deployed

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 15 Conclusions Mercury detects persistent changes in performance induced by upgrades  Automated detection with minimal domain knowledge  Scalable to a large number of measurements  Flexible to be applied across diverse data sources Operational Experiences  Confirmed earlier findings as well as discovered previously unknown behaviors  Is becoming a powerful tool inside AT&T Future Work – Lots !!!  Apply Mercury to new domains such as data centers, VoIP, IPTV, Mobility  Behavior changes induced by chronic events  Real-time capabilities

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. 16 Thank You !