Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang.

Slides:



Advertisements
Similar presentations
Fast Data at Massive Scale Lessons Learned at Facebook Bobby Johnson.
Advertisements

Analyzing the MAC-level behavior of wireless networks in the wild Ratul Mahajan (Microsoft Research) Maya Rodrig, David Wetherall, John Zahorjan (University.
Network Systems Sales LLC
OneBridge Mobile Data Suite Product Positioning. Target Plays IT-driven enterprise mobility initiatives Extensive support for integration into existing.
ICS 434 Advanced Database Systems
Introduction to Systems Management Server 2003 Tyler S. Farmer Sr. Technology Specialist II Education Solutions Group Microsoft Corporation.
THE CASE FOR PREFETCHING AND PREVALIDATING TLS SERVER CERTIFICATES Emily Stark, Lin-Shung Huang, Dinesh Israni, Collin Jackson, Dan Boneh Presented by:
Rake: Semantics Assisted Network- based Tracing Framework Yao Zhao (Bell Labs), Yinzhi Cao, Yan Chen, Ming Zhang (MSR) and Anup Goyal (Yahoo! Inc.) Presenter:
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
Author: Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica Presenter :Yinzhi Cao.
Yet another Service Management Automation Session
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 1: Introduction to Windows Server 2003.
Chapter 8: Network Operating Systems and Windows Server 2003-Based Networking Network+ Guide to Networks Third Edition.
Design & Development Tools: Visual Studio 2005 SQL Server 2005 Biztalk Server 2006 David Gristwood, Mike Taulty Developer & Platform Group Microsoft Ltd.
Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
VTS INNOVATOR SERIES Real Problems, Real solutions.
Module 1: Database and Instance. Overview Defining a Database and an Instance Introduce Microsoft’s and Oracle’s Implementations of a Database and an.
INTRUSION DETECTION SYSTEMS Tristan Walters Rayce West.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 1: Introduction to Windows Server 2003.
Client/Server Computing. Information processing is distributed among several workstations and servers on a network, with each function being assigned.
FileSecure Implementation Training Patch Management Version 1.1.
Week #10 Objectives: Remote Access and Mobile Computing Configure Mobile Computer and Device Settings Configure Remote Desktop and Remote Assistance for.
Module 1: Web Application Security Overview 1. Overview How Data is stored in a Web Application Types of Data that need to be secured Overview of common.
BMC Software confidential. BMC Performance Manager Will Brown.
Selected Topics on Databases n Multi-User Databases –more than one user processes the database at the same time n System Architectures for Multi-User Environments.
Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.
APPLICATION DELIVERY IN UNIVERSITIES Glen D. Hauser, Joel Ahmed Engineering Computer Center (ECC) College of Engineering University of Saskatchewan.
CIS 375—Web App Dev II Microsoft’s.NET. 2 Introduction to.NET Steve Ballmer (January 2000): Steve Ballmer "Delivering an Internet-based platform of Next.
Microsoft Active Directory(AD) A presentation by Robert, Jasmine, Val and Scott IMT546 December 11, 2004.
WiFiProfiler: Cooperative Diagnosis in Wireless LANs Ranveer Chandra, Venkat Padmanabhan, Ming Zhang Microsoft Research.
1 Architecture and Techniques for Diagnosing Faults in IEEE Infrastructure Networks Atul Adya, Victor Bahl, Ranveer Chandra, Lili Qiu Microsoft.
Session Beans INFORMATICS ENGINEERING – UNIVERSITY OF BRAWIJAYA Eriq Muhammad Adams J
ENTERPRISE COMPUTING QUIZ By: Lean F. Torida
® IBM Software Group © 2007 IBM Corporation J2EE Web Component Introduction
COMP1321 Digital Infrastructure Richard Henson February 2014.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 1: Introduction to Windows Server 2003.
Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace.
1 OSG Accounting Service Requirements Matteo Melani SLAC for the OSG Accounting Activity.
Fall CIS 764 Database Systems Engineering L1: Introduction to … CIS 764 Enterprise Database Systems Engineering: Software.
Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.
Maintaining and Updating Windows Server Monitoring Windows Server It is important to monitor your Server system to make sure it is running smoothly.
1 Designing an NT-based Intranet David Strom SD’98 2/13/98.
Application Summary  Web Application that allows its users to keep track of their exercises.  User has full control over what exercises are visible.
Microsoft Management Seminar Series SMS 2003 Change Management.
Cole David Ronnie Julio. Introduction Globus is A community of users and developers who collaborate on the use and development of open source software,
The Internet of Things with Live Data Cloud by Open Automation Software.
Afresco Overview Document management and share
Access Services Introduction & Setup Requirements Kipp Sorensen, Soren Innovations.
ITI-510 Computer Networks ITI 510 – Computer Networks Meeting 6 Rutgers University Center for Applied Computer Technologies Instructor: Chris Uriarte.
Managing and Monitoring the Microsoft Application Platform Damir Bersinic Ruth Morton IT Pro Advisor Microsoft Canada
London Connected Systems User Group – Feb “Instrument and Diagnose your BizTalk Solution in an efficient Way” Saravana Kumar BizTalk Server MVP.
Change Is Hard: Adapting Dependency Graph Models For Unified Diagnosis in Wired/Wireless Networks Lenin Ravindranath, Victor Bahl, Ranveer Chandra, David.
IS Infrastructure Managing Infrastructure and Services Copyright © 2016 Curt Hill.
Resolve today’s IT management dilemma Enable generalist operators to localize user perceptible connectivity problems Raise alerts prioritized by the amount.
SQL Server 2012 Session: 1 Session: 4 SQL Azure Data Management Using Microsoft SQL Server.
Use-cases for GENI Instrumentation and Measurement Architecture Design Prasad Calyam, Ph.D. (PI – OnTimeMeasure, Project #1764) March 31.
- 24x7serversupport Windows Server Management
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
COMP1321 Digital Infrastructure Richard Henson March 2016.
Automatic Network Management: Graphical Models for Fault Location Ricardo Morla INESC Porto / FEUP.
Fault Localization via Analysis of Network Dependency Victor Bahl, Ranveer Chandra, Albert Greenberg, Dave Maltz, Ming Zhang (MSR Redmond)
CLOUD ARCHITECTURE Many organizations and researchers have defined the architecture for cloud computing. Basically the whole system can be divided into.
# 66.
Improving searches through community clustering of information
CONFIGURING A MICROSOFT EXCHANGE SERVER 2003 INFRASTRUCTURE
ALICE Monitoring
Managing your IT Environment
Web Application Server 2001/3/27 Kang, Seungwoo. Web Application Server A class of middleware Speeding application development Strategic platform for.
Presentation transcript:

Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Enterprise Management: Between a Rock and a Hard Place Manageability Stick with tried software, never change infrastructure Cheap Upgrades are hard, forget about innovation! Usability Keep pace with technology Expensive –IT staff in 1000s –72% of MS IT budget is staff Reliability Issues –Cost of down-time

Well-Managed Enterprises Still Unreliable 10% Troubled 85% Normal Fraction Of Requests 0.7% Down Response time of a Web server (ms) 0 10% responses take up to 10x longer than normal How do we manage evolving enterprise networks?

Current Tools Miss the Forest for the Trees Monitor Individual Boxes or Protocols Flood admin with alerts Dont convey the end-to-end picture SQL Backend Web Server Authentication Server DNS Client But, the primary goal of enterprise management is to diagnose user-perceived problems!

Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems Sherlock

Challenges for the End-to-End Approach Dont know what users performance depends on

–Dependencies are distributed –Dependencies are non-deterministic Dont know which dependency is causing the problem –Server CPU 70%, link dropped 10 packets, but which affected user? SQL Backend Web Server Auth. Server DNS Client E.g., Web Connection Challenges for the End-to-End Approach

Sherlocks Contributions Passively infers dependencies from logs Builds a unified dependency graph incorporating network, server and application dependencies Diagnoses user problems in the enterprise Deployed in a part of the Microsoft Enterprise

Sherlocks Architecture

Servers Clients Sherlocks Architecture Web1 1000ms Web2 30ms File1 Timeout User Observations + = List Troubled Components Network Dependency Graph Inference Engine Sherlock works for various client-server applications

Video Server Data Store DNS How do you automatically learn such distributed dependencies?

Strawman: Instrument all applications and libraries Sherlock exploits timing info Time My Client talks to B t My Client talks to C If talks to B, whenever talks to C Dependent Connections Not Practical

Sherlock exploits timing info Time t B BB B B B False Dependence B C If talks to B, whenever talks to C Dependent Connections Strawman: Instrument all applications and libraries Not Practical

Sherlock exploits timing info Time If talks to B, whenever talks to C Dependent Connections t B B C Inter-access time Dependent iff t << Inter-access time As long as this occurs with probability higher than chance Strawman: Instrument all applications and libraries Not Practical

Sherlocks Algorithm to Infer Dependencies Infer dependent connections from timing Video DNS Store Dependency Graph

Bills Client Store DNS Sherlocks Algorithm to Infer Dependencies Infer dependent connections from timing Infer topology from Traceroutes & configurations Video Store Video Bill Watches Video Bill DNS Bill Video Works with legacy applications Adapts to changing conditions Dependency Graph Video DNS Store

But hard dependencies are not enough…

Bills ClientStoreDNS Video Store Video Bill watches Video Bill DNSBill Video But hard dependencies are not enough… Need Probabilities p1 p3 If Bill caches servers IP DNS down but Bill gets video Sherlock uses the frequency with which a dependence occurs in logs as its edge probability p2 p1=10% p2=100%

How do we use the dependency graph to diagnose user problems?

Bills Client Store DNS Video Store Video Bill Watches Video Bill DNS Bill Video Which components caused the problem? Need to disambiguate!! Diagnosing User Problems

Bills Client Store DNS Video Store Video Bill Watches Video Bill DNS Bill Video Diagnosing User Problems Which components caused the problem? Bill Sees Sales Sales Bill Sales Paul Watches Video2 Paul Video2 Video2 Store Video2 Use correlation to disambiguate!! Disambiguate by correlating –Across logs from same client –Across clients Prefer simpler explanations

Will Correlation Scale?

Corporate Core Will Correlation Scale? Microsoft Internal Network O(100,000) client desktops O(10,000) servers O(10,000) apps/services O(10,000) network devices Building Network Campus Core Data Center Dependency Graph is Huge

Can we evaluate all combinations of component failures? The number of fault combinations is exponential! Impossible to compute! Will Correlation Scale?

Scalable Algorithm to Correlate But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults 99.9% accurate Only a few faults happen concurrently Exponential Polynomial

But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults 99.9% accurate Scalable Algorithm to Correlate Only a few faults happen concurrently Only few nodes change state Exponential Polynomial

Re-evaluate only if an ancestor changes state Reduces the cost of evaluating a case by 30x-70x Exponential Polynomial But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults 99.9% accurate Only a few faults happen concurrently Only few nodes change state Scalable Algorithm to Correlate

Results

Experimental Setup Evaluated on the Microsoft enterprise network Monitored 23 clients, 40 production servers for 3 weeks –Clients are at MSR Redmond –Extra host on servers Ethernet logs packets Busy, operational network –Main Intranet Web site and software distribution file server –Load-balancing front-ends –Many paths to the data-center

What Do Web Dependencies in the MS Enterprise Look Like?

Auth. Server What Do Web Dependencies in the MS Enterprise Look Like? Client Accesses Portal

Auth. Server What Do Web Dependencies in the MS Enterprise Look Like? Client Accesses Portal

Auth. Server Sherlock discovers complex dependencies of real apps. What Do Web Dependencies in the MS Enterprise Look Like? Client Accesses PortalClient Accesses Sales

What Do File-Server Dependencies Look Like? Client Accesses Software Distribution Server Auth. Server WINSDNS Backend Server 1 Backend Server 2 Backend Server 3 Backend Server 4 Proxy File Server 100% 10%6% 5% 2% 8% 5% 1%.3% Sherlock works for many client-server applications

Dependency Graph: 2565 nodes; 358 components that can fail Sherlock Identifies Causes of Poor Performance Component Index Time (days) 87% of problems localized to 16 components

Sherlock Identifies Causes of Poor Performance Inference Graph: 2565 nodes; 358 components that can fail Corroborated the three significant faults Component Index Time (days)

SNMP-reported utilization on a link flagged by Sherlock Problems coincide with spikes Sherlock Goes Beyond Traditional Tools Sherlock identifies the troubled link but SNMP cannot!

Comparing with Alternatives Dataset of known (fault, observations) pairs Accuracy = 1 – (Prob. False Positives + Prob. False Negatives) 53% SCORE (non-probabilistic)

Comparing with Alternatives Dataset of known (fault, observations) pairs Accuracy = 1 – (Prob. False Positives + Prob. False Negatives) 59% 53% SCORE (non-probabilistic) Shrink (probabilistic)

Comparing with Alternatives Dataset of known (fault, observations) pairs Accuracy = 1 – (Prob. False Positives + Prob. False Negatives) Sherlock Sherlock outperforms existing tools! 91% 53% Shrink SCORE (non-probabilistic) Shrink (probabilistic) 59%

Conclusions Sherlock passively infers network-wide dependencies f rom logs and traceroutes It diagnoses faults by correlating user observations It works at scale! Experiments in Microsofts Network show –Finds faults missed by existing tools like SNMP –Is more accurate than prior techniques Steps towards a Microsoft product