Fault Localization via Analysis of Network Dependency Victor Bahl, Ranveer Chandra, Albert Greenberg, Dave Maltz, Ming Zhang (MSR Redmond)

Slides:

Advertisements

Similar presentations

Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang.

Advertisements

HP OpenView Network Node Manager

Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis Hung Nguyen (Univ. of Adelaide, Australia) Renata Teixeira.

1 Planetary Network Testbed Larry Peterson Princeton University.

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Defense by Chen, Jiazhen & Chen, Shiqi.

High speed links, distributed services, can’t modify routers  Lack of visibility But, need for more visibility and control  Increased number and complexity.

Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank

ManageEngine TM Applications Manager 8 Monitoring Custom Applications.

Technical Architectures

Fault, Configuration, Performance Management

Author: Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica Presenter :Yinzhi Cao.

Squirrel: A decentralized peer- to-peer web cache Paul Burstein 10/27/2003.

ArcGIS for Server Reference Implementations An ArcGIS Server’s architecture tour.

Lesson 1: Configuring Network Load Balancing

.NET Mobile Application Development Introduction to Mobile and Distributed Applications.

MANAGED SERVICES OPERATIONS. Increasing IP Infrastructure Complexity Requires Greater Need for Services Data Center B2B Links Branch Offices Distribution.

SOE and Application Delivery Gwenael Moreau, Abbotsleigh.

Module 1: Introduction to Windows Clustering. Overview Defining Clustering Features Introducing Application Architecture Identifying Availability and.

Module 13: Network Load Balancing Fundamentals. Server Availability and Scalability Overview Windows Network Load Balancing Configuring Windows Network.

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.

COMP1321 Digital Infrastructure Richard Henson February 2014.

Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace.

©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.

Module 11: Implementing ISA Server 2004 Enterprise Edition.

Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.

Network Measurement Tools ESnet Site Coordinators Meeting 26 April 2000 Tracie Monk, UCSD/SDSC/CAIDA -

Cloud Interoperability & Standards. Scalability and Fault Tolerance Fault tolerance is the property that enables a system to continue operating properly.

Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

Change Is Hard: Adapting Dependency Graph Models For Unified Diagnosis in Wired/Wireless Networks Lenin Ravindranath, Victor Bahl, Ranveer Chandra, David.

ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.

Resolve today’s IT management dilemma Enable generalist operators to localize user perceptible connectivity problems Raise alerts prioritized by the amount.

MicroGrid Update & A Synthetic Grid Resource Generator Xin Liu, Yang-suk Kee, Andrew Chien Department of Computer Science and Engineering Center for Networked.

COMP1321 Digital Infrastructure Richard Henson March 2016.

PERFORMANCE MANAGEMENT IMPROVING PERFORMANCE TECHNIQUES Network management system 1.

Network Monitoring Sebastian Büttrich, NSRC / IT University of Copenhagen Last edit: February 2012, ICTP Trieste

Interaction and Animation on Geolocalization Based Network Topology by Engin Arslan.

VL2: A Scalable and Flexible Data Center Network

DISA Cyclops Program.

Fourth Dimension Technologies

IoT Security Part 2, The Malware

Understanding Solutions

High Availability 24 hours a day, 7 days a week, 365 days a year…

CIIT-Human Computer Interaction-CSC456-Fall-2015-Mr

Recipes for Use With Thin Clients

N-Tier Architecture.

Improving searches through community clustering of information

Network Operations and Network Management

Internet and Intranet.

Large Distributed Systems

Managing your IT Environment

Microsoft SharePoint Server 2016

Objectives Differentiate between the different editions of Windows Server 2003 Explain Windows Server 2003 network models and server roles Identify concepts.

Introduction to J2EE Architecture

Chapter 3: Windows7 Part 4.

Internet and Intranet.

Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint.

Distributed Systems Bina Ramamurthy 11/12/2018 From the CDK text.

Rocky Mountain CMG Spring? ‘09 Forum

An Introduction to Computer Networking

Distributed P2P File System

Internet and Intranet.

IST346: Services: Monitoring and Logging

Distributed Systems Bina Ramamurthy 4/22/2019 B.Ramamurthy.

Content Delivery and Remote DNS services

“Detective”: Integrating NDT and E2E piPEs

EE 122: Lecture 22 (Overlay Networks)

Dynatrace AI Demystified

Internet and Intranet.

Network management system

Building a Smart Cloud Strategy

Presentation transcript:

Fault Localization via Analysis of Network Dependency Victor Bahl, Ranveer Chandra, Albert Greenberg, Dave Maltz, Ming Zhang (MSR Redmond) Failure of Management Systems Challenges Mission Automatically Localizing Faults State of the Art Example Extracted Dependencies On-Going Work 10% of requests to internal servers take 10x longer than normal Persistent user frustration and high care costs Invisible to current management systems Automatically Creating Models of Dependencies Response time of 1 web server Response time of 17 servers ~10 % A typical large enterprise ~100,000 client desktops ~10,000 servers ~10,000 apps/services ~10,000 network devices Service alerts for 10 days 120,000 “housekeeping” 2,000 missed heartbeats from 160 servers 18,000 alerts from 194 categories and 877 hosts SQL WebSvr Active Directory Client Machines MOM MAM SMS Scripts.... SMARTS SNMP NetFlow Scripts … Application Support Staff Network Support Staff Server Management Network Management RemoteDsktp SMS... Desktop Management Help Desk Support Staff DNS proxy What we have today: Interdependent distributed systems with hidden and unknown dependencies Plethora of tools for graphing SNMP values, paucity of tools for tracking relationships Little visibility into effect of network on applications What we want: Method to map the IT infrastructure - determining which components affect a given client activity Method to localize problems that affect users Read/Write SML models of applications Automatically generate SML for legacy apps Complement expert-generated SML Explore other applications of Inference Graph Upgrade management (who will be affected) Availability analysis (who is being impacted) Management systems do not provide a “big picture” Tools are box-centric – not service-centric Relationships among severs often undocumented Fragmentation results in more mistakes & outages Tools do not directly measure user experience ~10 % Identify Service Dependencies Fault Localization Packet traces at individual agents/ vantage-point routers Inference Graph Topology and other network information Inference Engine Observations: Client-server interaction logs, Trouble tickets, etc. Actions: Run TraceRoute x->y Fault Suspects: links, routers, servers, clients 12 DNS Server SQL Server web front 1 client A client B client A A → DNS A → WF1 DNS Server web front 1 WF1 → SQL SQL Server A → WF1 Root causes User Experience File Server 3 6 Nodes can be up, down, or troubled State of each node: (P up, P troubled, P down ) where P up + P troubled + P down = client A A → DNS A → WF1 DNS Server web front 1 WF1 → SQL SQL Server A → WF (0, 0, 1) (1, 0, 0) (0, 0, 1) (1, 0, 0) A → FS 0.3 A → FS File Server (1, 0, 0) Results Algorithm for extraction of dependency models Sniffs and correlates packets between hosts Algorithm for flexible & accurate fault localization Scalable to size of large enterprises Localizes both hard and performance faults Finds problems in network, even without data from network routers Deployed and evaluated on testbed and several MSIT applications (e.g., msw, itweb) Model is probabilistic to cope with caching, load balancing and failover techniques