Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl.

Slides:

Advertisements

Similar presentations

Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang.

Advertisements

Group Research 1: AKHTAR, Kamran SU, Hao SUN, Qiang TANG, Yue

Traffic Engineering with Forward Fault Correction (FFC)

The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.

Distributed Systems Topics What is a Distributed System?

Ira Cohen, Jeffrey S. Chase et al.

Distributed Systems 1 Topics  What is a Distributed System?  Why Distributed Systems?  Examples of Distributed Systems  Distributed System Requirements.

Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.

Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank

Trajectories Simplification Method for Location-Based Social Networking Services Presenter: Yu Zheng on behalf of Yukun Cheng, Kai Jiang, Xing Xie Microsoft.

Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD), Sharad.

Author: Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica Presenter :Yinzhi Cao.

Wresting Control from BGP: Scalable Fine-grained Route Control UCSD / AT&T Research Usenix —June 22, 2007 Dan Pei, Tom Scholl, Aman Shaikh, Alex C. Snoeren,

Impact of Configuration Errors on DNS Robustness Vasileios Pappas, Zhiguo Xu, Songwu Lu, Daniel Massey, Andreas Terzis, Lixia Zhang SIGCOMM 2004 Presented.

Measuring Performance Chapter 12 CSE807. Performance Measurement To assist in guaranteeing Service Level Agreements For capacity planning For troubleshooting.

11 Automating Cross-Layer Diagnosis of Enterprise Wireless Networks Yu-Chung Cheng Mikhail Afanasyev Patrick Verkaik Jennifer Chiang Alex C. Snoeren.

HNI: Human network interaction Ratul Mahajan Microsoft dub, University of Washington August, 2011.

A victim-centric peer-assisted framework for monitoring and troubleshooting routing problems.

Course Instructor: Aisha Azeem

What Can You do With BTM? Business Transaction Management touches the following disciplines:  Performance Management  Application Management  Capacity.

Learning From Mistakes—A Comprehensive Study on Real World Concurrency Bug Characteristics Shan Lu, Soyeon Park, Eunsoo Seo and Yuanyuan Zhou Appeared.

Diversity in Smartphone Usage Hossein Falaki, Ratul mahajan, Srikanth kandula, Dimitrios Lymberopoulous, Ramesh Govindan, Deborah Estrin. UCLA, Microsoft,

NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge

INFO 355Week #61 Systems Analysis II Essentials of design INFO 355 Glenn Booker.

Trust Management in Mobile Ad Hoc Networks Using a Scalable Maturity-Based Model Authors: Pedro B. Velloso, Rafael P. Laufer, Daniel de O. Cunha, Otto.

Mr C Johnston ICT Teacher

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.

AppMetrics and SCOM Working Together to Maximize the availability of Your applications.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Monitoring Latency Sensitive Enterprise Applications on the Cloud Shankar Narayanan Ashiwan Sivakumar.

Experience with Using a Performance Predictor During Development a Distributed Storage System Tale Lauro Beltrão Costa *, João Brunet +, Lile Hattori #,

Data Structures & Algorithms and The Internet: A different way of thinking.

Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace.

©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.

Aditya Akella The Performance Benefits of Multihoming Aditya Akella CMU With Bruce Maggs, Srini Seshan, Anees Shaikh and Ramesh Sitaraman.

1 Introduction to Middleware. 2 Outline What is middleware? Purpose and origin Why use it? What Middleware does? Technical details Middleware services.

KAIST Internet Security Lab. CS710 Behavioral Detection of Malware on Mobile Handsets MobiSys 2008, Abhijit Bose et al 이 승 민.

Computing Infrastructure for Large Ecommerce Systems -- based on material written by Jacob Lindeman.

Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.

Difference of Degradation Schemes among Operating Systems -Experimental analysis for web application servers- Hideaki Hibino*(Tokyo Tech) Kenichi Kourai.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

AppInsight: Mobile App Performance Monitoring In The Wild Lenin Ravindranath, Jitu Padhye, Sharad Agarwal, Ratul Mahajan, Ian Obermiller, Shahin Shayandeh.

Operating Systems Lecture 1 Jinyang Li. Class goals Understand how an OS works by studying its: –Design principles –Implementation realities Gain some.

Why Quantify Landscape Pattern? Comparison (space & time) –Study areas –Landscapes Inference –Agents of pattern formation –Link to ecological processes.

A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.

1 Admission Control and Request Scheduling in E-Commerce Web Sites Sameh Elnikety, EPFL Erich Nahum, IBM Watson John Tracey, IBM Watson Willy Zwaenepoel,

End-to-End Performance Analytics For Mobile Apps Lenin Ravindranath, Jitu Padhye, Ratul Mahajan Microsoft Research 1.

CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.

Exploiting Network Structure for Proactive Spam Mitigation Shobha Venkataraman * Joint work with Subhabrata Sen §, Oliver Spatscheck §, Patrick Haffner.

1 NetProfiler: Profiling Networks From the Edge Venkat Padmanabhan Microsoft Research June 2005 With Sharad Agarwal (MSR), Jitu Padhye (MSR), Dilip Joseph.

Motivation: Finding the root cause of a symptom

Change Is Hard: Adapting Dependency Graph Models For Unified Diagnosis in Wired/Wireless Networks Lenin Ravindranath, Victor Bahl, Ranveer Chandra, David.

HNC COMPUTING - Network Concepts 1 Network Concepts Network Concepts Network Operating Systems Network Operating Systems.

Resolve today’s IT management dilemma Enable generalist operators to localize user perceptible connectivity problems Raise alerts prioritized by the amount.

E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.

Automatic Network Management: Graphical Models for Fault Location Ricardo Morla INESC Porto / FEUP.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Fault Localization via Analysis of Network Dependency Victor Bahl, Ranveer Chandra, Albert Greenberg, Dave Maltz, Ming Zhang (MSR Redmond)

Seyed K. Fayaz, Tushar Sharma, Ari Fogel

Fail-stutter Behavior Characterization of NFS

N-Tier Architecture.

Comparison of the Three CPU Schedulers in Xen

ISP and Egress Path Selection for Multihomed Networks

Latency as a Performability Metric: Experimental Results

Admission Control and Request Scheduling in E-Commerce Web Sites

OOA&D II Bo Wang, Kan Qi Adapted from Alexey Tregubov’s Slides.

Software System Testing

Chapter 5 Architectural Design.

Dynatrace AI Demystified

Towards Predictable Datacenter Networks

Presentation transcript:

Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

Network diagnosis Explaining faulty behavior ratul | sigcomm | '09

Current landscape of network diagnosis systems ratul | sigcomm | '09 Big enterprises Large ISPs Big enterprises Large ISPs Network size Small enterprises ? ?

Why study small enterprise networks separately? ratul | sigcomm | '09 Big enterprises Large ISPs Big enterprises Large ISPs Small enterprises Less sophisticated admins Less rich connectivity Many shared components IIS, SQL, Exchange, …

Our work 1.Shows that small enterprises need “detailed diagnosis” Not enabled by current systems that focus on scale 2.Develops NetMedic for detailed diagnosis Diagnoses application faults without application knowledge ratul | sigcomm | '09

Understanding problems in small enterprises ratul | sigcomm | ' cases Symptoms, root causes

Symptom App-specific 60 % Failed initialization 13 % Poor performance 10 % Hang or crash 10 % Unreachability 7 % Identified cause Non-app config (e.g., firewall) 30 % Software/driver bug 21 % App config 19 % Overload 4 % Hardware fault 2 % Unknown 25 % And the survey says ….. 7 Detailed diagnosis Handle app-specific as well as generic faults Identify culprits at a fine granularity

Example problem 1: Server misconfig ratul | sigcomm | '09 Web server Browser Server config

Example problem 2: Buggy client ratul | sigcomm | '09 SQL server SQL client C2 SQL client C1 Requests

Current formulations sacrifice detail (to scale) Dependency graph based formulations (e.g., Sherlock [SIGCOMM2007]) Model the network as a dependency graph at a coarse level Simple dependency model ratul | sigcomm | '09

Example problem 1: Server misconfig ratul | sigcomm | '09 Web server Browser Server config The network model is too coarse in current formulations

Example problem 2: Buggy client ratul | sigcomm | '09 SQL server SQL client C2 SQL client C1 Requests The dependency model is too simple in current formulations

A formulation for detailed diagnosis Dependency graph of fine-grained components Component state is a multi-dimensional vector ratul | sigcomm | '09 SQL svr Exch. svr IIS svr IIS config Process OS Config SQL client C1 SQL client C2 % CPU time IO bytes/sec Connections/sec 404 errors/sec

The goal of diagnosis ratul | sigcomm | '09 Svr C1 C2 Identify likely culprits for components of interest Without using semantics of state variables  No application knowledge Process OS Config

Using joint historical behavior to estimate impact ratul | sigcomm | '09 DS d0ad0a d0bd0b d0cd0c s0as0a s0bs0b s0cs0c s0ds0d dnadna dnbdnb dncdnc d1ad1a d1bd1b d1cd1c snasna snbsnb sncsnc sndsnd s1as1a s1bs1b s1cs1c s1ds1d Identify time periods when state of S was “similar” How “similar” on average states of D are at those times Svr C1 C2 Request rate (low) Response time (high) Request rate (high) Response time (high) Request rate (high) H H L

Robust implementation of impact estimation Ignore state variables that represent redundant info Place higher weight on state variables likely related to faults being diagnosed Ignore state variables irrelevant to interaction with neighbor Account for aggregate relationships among state variables of neighboring components Account for disparate ranges of state variables ratul | sigcomm | '09

Diagnose a.edge impact b.path impact Implementation of NetMedic ratul | sigcomm | '09 Target components Diagnosis time Reference time Monitor components Component states Ranked list of likely culprits

Evaluation setup ratul | sigcomm | '09 IIS, SQL, Exchange, … actively used desktops Diverse set of faults observed in the logs #components~1000 #dimensions per component (avg) 35

NetMedic assigns low ranks to actual culprits ratul | sigcomm | '09

NetMedic handles concurrent faults well ratul | sigcomm | '09 2 simultaneous faults

Other results in the paper Netmedic needs a modest amount (~60 mins) of history It compares favorably with a method that understands variable semantics ratul | sigcomm | '09

Conclusions NetMedic enables detailed diagnosis in enterprise networks w/o application knowledge Think small: Small enterprise networks deserve more attention ratul | sigcomm | '09