Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD), Sharad.

Similar presentations


Presentation on theme: "Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD), Sharad."— Presentation transcript:

1 Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

2 Network diagnosis explains faulty behavior ratul | gatech | '09 Starts with problem symptoms and ends at likely culprits Configuration File server User cannot access a remote folder Configuration change denies permission Photo viewer

3 Current landscape of network diagnosis systems ratul | gatech | '09 Big enterprises Large ISPs Big enterprises Large ISPs Network size Small enterprises ? ?

4 Why study small enterprise networks separately? ratul | gatech | '09 Big enterprises Large ISPs Big enterprises Large ISPs Small enterprises Less sophisticated admins Less rich connectivity Many shared components IIS, SQL, Exchange, …

5 Our work 1.Uncovers the need for detailed and understandable diagnosis 2.Develops NetMedic for detailed diagnosis Diagnoses application faults without application knowledge 3.Develops NetClinic for explaining diagnostic analysis ratul | gatech | '09

6 Understanding problems in small enterprises ratul | gatech | '09 100+ cases Symptoms, root causes

7 Symptom App-specific 60 % Failed initialization 13 % Poor performance 10 % Hang or crash 10 % Unreachability 7 % Identified cause Non-app config (e.g., firewall) 30 % Software/driver bug 21 % App config 19 % Overload 4 % Hardware fault 2 % Unknown 25 % And the survey says ….. Detailed diagnosis Handle app-specific as well as generic faults Identify culprits at a fine granularity ratul | gatech | '09

8 Example problem 1: Server misconfig ratul | gatech | '09 Web server Browser Server config

9 Example problem 2: Buggy client ratul | gatech | '09 SQL server SQL client C2 SQL client C1 Requests

10 Example problem 3: Client misconfig Exchange server Outlook ratul | sigcomm | '09 config Outlook config

11 Current formulations sacrifice detail (to scale) Dependency graph based formulations (e.g., Sherlock [SIGCOMM2007]) Model the network as a dependency graph at a coarse level Simple dependency model ratul | gatech | '09

12 Example problem 1: Server misconfig ratul | gatech | '09 Web server Browser Server config The network model is too coarse in current formulations

13 Example problem 2: Buggy client ratul | gatech | '09 SQL server SQL client C2 SQL client C1 Requests The dependency model is too simple in current formulations

14 Example problem 3: Client misconfig Exchange server Outlook ratul | sigcomm | '09 config Outlook config The failure model is too simple in current formulations

15 A formulation for detailed diagnosis Dependency graph of fine-grained components Component state is a multi-dimensional vector ratul | gatech | '09 SQL svr Exch. svr IIS svr IIS config Process OS Config SQL client C1 SQL client C2 % CPU time IO bytes/sec Connections/sec 404 errors/sec

16 The goal of diagnosis ratul | gatech | '09 Svr C1 C2 Identify likely culprits for components of interest Without using semantics of state variables  No application knowledge Process OS Config

17 Using joint historical behavior to estimate impact ratul | gatech | '09 DS d0ad0a d0bd0b d0cd0c s0as0a s0bs0b s0cs0c s0ds0d dnadna dnbdnb dncdnc............... d1ad1a d1bd1b d1cd1c snasna snbsnb sncsnc sndsnd.................... s1as1a s1bs1b s1cs1c s1ds1d Identify time periods when state of S was “similar” How “similar” on average states of D are at those times Svr C1 C2 Request rate (low) Response time (high) Request rate (high) Response time (high) Request rate (high) H H L

18 Robust impact estimation Ignore state variables that represent redundant info Place higher weight on state variables likely related to fault being diagnosed Ignore state variables irrelevant to interaction with neighbor Account for aggregate relationships among state variables of neighboring components Account for disparate ranges of state variables ratul | gatech | '09

19 Ranking likely culprits ratul | gatech | '09 AB CD A B A C C A B DA 0.8 0.2 A B C D A A A A 0.8 0.2 1.8 0.8 2.6 0.4 Path weightGlobal impact C B A D

20 Diagnose a.edge impact b.path impact Implementation of NetMedic ratul | gatech | '09 Target components Diagnosis time Reference time Monitor components Component states Ranked list of likely culprits

21 Evaluation setup ratul | gatech | '09 IIS, SQL, Exchange, …...... 10 actively used desktops Diverse set of faults observed in the logs #components~1000 #dimensions per component (avg) 35

22 NetMedic assigns low ranks to actual culprits ratul | gatech | '09

23 NetMedic handles concurrent faults well ratul | gatech | '09 2 simultaneous faults

24 Other empirical results Netmedic needs a modest amount (~60 mins) of history The key to effectiveness is correctly identifying many low impact edges It compares favorably with a method that understands variable semantics ratul | gatech | '09

25 Unleashing (systems like) NetMedic on admins How to present the analysis results? Need human verification (Fundamental?) trade-off between coverage and accuracy ratul | gatech | '09 Accuracy Fault coverage Rule based Inference based State of the practice Research activity

26 The understandability challenge Admins should be able to verify the correctness of the analysis Identify culprits themselves if analysis is incorrect Two sub-problems at the intersection with HCI Visualizing complex analysis (NetClinic) Intuitiveness of analysis (ongoing work) ratul | gatech | '09

27 NetClinic: Visualizing diagnostic analysis Underlying assumption: Admins can verify analysis if information is presented appropriately They have expert, out-of-band information Views diagnosis as multi-level analysis Makes results at all levels accessible on top of a semantic graph layout Allows top-down and bottom-up navigation across levels while retaining context ratul | gatech | '09

28

29

30

31

32 NetClinic user study 11 participants with knowledge of computer networks but not of NetMedic Given 3 diagnostic tasks each after training 88% task completion rate Uncovered a rich mix of user strategies that the visualization must support ratul | gatech | '09

33 Intuitiveness of analysis What if you could modify the analysis itself to make it more accessible to humans? Counters the tendency to “optimize” for incremental gains in accuracy ratul | gatech | '09 Accuracy Understandability

34 Intuitiveness of analysis (2) Goal: Go from mechanical measures to more human centric measures Example: MoS measure for VoIP Factors to consider What information is used? E.g., Local vs. global What operations are used? E.g., Arithmetic vs. geometric means ratul | gatech | '09

35 Conclusions NetClinic enables admins to understand and verify complex diagnostic analyses ratul | gatech | '09 Accuracy Detail Understandability Accuracy Coverage Detail Coverage Accuracy NetMedic enables detailed diagnosis in enterprise networks w/o application knowledge Thinking small (networks) can provide new perspectives


Download ppt "Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD), Sharad."

Similar presentations


Ads by Google