Download presentation
Presentation is loading. Please wait.
Published bySandra Hubbard Modified over 9 years ago
1
Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace
2
Troubleshooting Networked Systems Hard to develop, debug, deploy, troubleshoot No standard way to integrate debugging, monitoring, diagnostics
3
Status quo: device centric... 28 03:55:38 PM fire... 28 03:55:39 PM fire...... [04:03:23 2006] [notice] Dispatch s1... [04:03:23 2006] [notice] Dispatch s2... [04:04:18 2006] [notice] Dispatch s3... [04:07:03 2006] [notice] Dispatch s1... [04:10:55 2006] [notice] Dispatch s2... [04:03:24 2006] [notice] Dispatch s3... [04:04:47 2006] [crit] Server s3 down...... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro 66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro 66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid...... Firewall Load Balancer Web 1 Web 2 Database
4
Status quo: device centric Determining paths: –Join logs on time and ad-hoc identifiers Relies on –well synchronized clocks –extensive application knowledge Requires all operations logged to guarantee complete paths
5
Examples 5 User DNS Server Proxy Web Server
6
Examples 6 User DNS Server Proxy Web Server
7
Examples 7 User DNS Server Proxy Web Server
8
Examples 8 User DNS Server Proxy Web Server
9
Approaches to Diagnosis Passively learn the relationships –Infer problems as deviations from the norm Actively Instrument the stack to learn relationships –Infer problems as deviations from the norm
10
Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula
11
Enterprise Management: Between a Rock and a Hard Place Manageability Stick with tried software, never change infrastructure Cheap Upgrades are hard, forget about innovation! Usability Keep pace with technology Expensive –IT staff in 1000s –72% of MS IT budget is staff Reliability Issues –Cost of down-time
12
Well-Managed Enterprises Still Unreliable 10% Troubled 85% Normal Fraction Of Requests 0.7% Down.1.02.04.06.08 10 100 1000 10000 Response time of a Web server (ms) 0 10% responses take up to 10x longer than normal How do we manage evolving enterprise networks?
13
Current Tools Miss the Forest for the Trees Monitor Individual Boxes or Protocols Flood admin with alerts Don’t convey the end-to-end picture SQL Backend Web Server Authentication Server DNS Client But, the primary goal of enterprise management is to diagnose user-perceived problems!
14
Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems Sherlock
15
Challenges for the End-to-End Approach Don’t know what user’s performance depends on
16
–Dependencies are distributed –Dependencies are non-deterministic Don’t know which dependency is causing the problem –Server CPU 70%, link dropped 10 packets, but which affected user? SQL Backend Web Server Auth. Server DNS Client E.g., Web Connection Challenges for the End-to-End Approach
17
Sherlock’s Contributions Passively infers dependencies from logs Builds a unified dependency graph incorporating network, server and application dependencies Diagnoses user problems in the enterprise Deployed in a part of the Microsoft Enterprise
18
Sherlock’s Architecture
19
Servers Clients Sherlock’s Architecture Web1 1000ms Web2 30ms File1 Timeout User Observations + = List Troubled Components Network Dependency Graph Inference Engine Sherlock works for various client-server applications
20
Video Server Data Store DNS How do you automatically learn such distributed dependencies?
21
Strawman: Instrument all applications and libraries Sherlock exploits timing info Time My Client talks to B tt My Client talks to C If talks to B, whenever talks to C Dependent Connections Not Practical
22
Sherlock exploits timing info Time tt B BB B B B False Dependence B C If talks to B, whenever talks to C Dependent Connections Strawman: Instrument all applications and libraries Not Practical
23
Sherlock exploits timing info Time If talks to B, whenever talks to C Dependent Connections tt B B C Inter-access time Dependent iff t << Inter-access time As long as this occurs with probability higher than chance Strawman: Instrument all applications and libraries Not Practical
24
Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing Video DNS Store Dependency Graph
25
Bill’s Client Store DNS Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing Infer topology from Traceroutes & configurations Video Store Video Bill Watches Video Bill DNS Bill Video Works with legacy applications Adapts to changing conditions Dependency Graph Video DNS Store
26
But hard dependencies are not enough…
27
Bill’s ClientStoreDNS Video Store Video Bill watches Video Bill DNSBill Video But hard dependencies are not enough… Need Probabilities p1 p3 If Bill caches server’s IP DNS down but Bill gets video Sherlock uses the frequency with which a dependence occurs in logs as its edge probability p2 p1=10% p2=100%
28
How do we use the dependency graph to diagnose user problems?
29
Bill’s Client Store DNS Video Store Video Bill Watches Video Bill DNS Bill Video Which components caused the problem? Need to disambiguate!! Diagnosing User Problems
30
Bill’s Client Store DNS Video Store Video Bill Watches Video Bill DNS Bill Video Diagnosing User Problems Which components caused the problem? Bill Sees Sales Sales Bill Sales Paul Watches Video2 Paul Video2 Video2 Store Video2 Use correlation to disambiguate!! Disambiguate by correlating –Across logs from same client –Across clients Prefer simpler explanations
31
Will Correlation Scale?
32
Corporate Core Will Correlation Scale? Microsoft Internal Network O(100,000) client desktops O(10,000) servers O(10,000) apps/services O(10,000) network devices Building Network Campus Core Data Center Dependency Graph is Huge
33
Can we evaluate all combinations of component failures? The number of fault combinations is exponential! Impossible to compute! Will Correlation Scale?
34
Scalable Algorithm to Correlate But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults 99.9% accurate Only a few faults happen concurrently Exponential Polynomial
35
But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults 99.9% accurate Scalable Algorithm to Correlate Only a few faults happen concurrently Only few nodes change state Exponential Polynomial
36
Re-evaluate only if an ancestor changes state Reduces the cost of evaluating a case by 30x-70x Exponential Polynomial But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults 99.9% accurate Only a few faults happen concurrently Only few nodes change state Scalable Algorithm to Correlate
37
Results
38
Experimental Setup Evaluated on the Microsoft enterprise network Monitored 23 clients, 40 production servers for 3 weeks –Clients are at MSR Redmond –Extra host on server’s Ethernet logs packets Busy, operational network –Main Intranet Web site and software distribution file server –Load-balancing front-ends –Many paths to the data-center
39
What Do Web Dependencies in the MS Enterprise Look Like?
40
Auth. Server What Do Web Dependencies in the MS Enterprise Look Like? Client Accesses Portal
41
Auth. Server What Do Web Dependencies in the MS Enterprise Look Like? Client Accesses Portal
42
Auth. Server Sherlock discovers complex dependencies of real apps. What Do Web Dependencies in the MS Enterprise Look Like? Client Accesses PortalClient Accesses Sales
43
What Do File-Server Dependencies Look Like? Client Accesses Software Distribution Server Auth. Server WINSDNS Backend Server 1 Backend Server 2 Backend Server 3 Backend Server 4 Proxy File Server 100% 10%6% 5% 2% 8% 5% 1%.3% Sherlock works for many client-server applications
44
Dependency Graph: 2565 nodes; 358 components that can fail Sherlock Identifies Causes of Poor Performance Component Index Time (days) 87% of problems localized to 16 components
45
Sherlock Identifies Causes of Poor Performance Inference Graph: 2565 nodes; 358 components that can fail Corroborated the three significant faults Component Index Time (days)
46
SNMP-reported utilization on a link flagged by Sherlock Problems coincide with spikes Sherlock Goes Beyond Traditional Tools Sherlock identifies the troubled link but SNMP cannot!
48
X-Trace X-Trace records events in a distributed execution and their causal relationship Events are grouped into tasks –Well defined starting event and all that is causally related Each event generates a report, binding it to one or more preceding events Captures full happens-before relation
49
X-Trace Output Task graph capturing task execution –Nodes: events across layers, devices –Edges: causal relations between events IP Router IP Router IP TCP 1 Start TCP 1 End IP Router IP TCP 2 Start TCP 2 End HTTP Proxy HTTP Server HTTP Client
50
Each event uniquely identified within a task: [TaskId, EventId] [TaskId, EventId] propagated along execution path For each event create and log an X-Trace report –Enough info to reconstruct the task graph Basic Mechanism IP Router IP Router IP TCP 1 Start TCP 1 End IP Router IP TCP 2 Start TCP 2 End HTTP Proxy HTTP Server HTTP Client f h b a g m n cde ijkl [T, g][T, a] X-Trace Report TaskID: T EventID: g Edge: from a, f X-Trace Report TaskID: T EventID: g Edge: from a, f
51
X-Trace Library API Handles propagation within app Threads / event-based (e.g., libasync) Akin to a logging API: –Main call is logEvent(message) Library takes care of event id creation, binding, reporting, etc Implementations in C++, Java, Ruby, Javascript
52
Task Tree X-Trace tags all network operations resulting from a particular task with the same task identifier Task tree is the set of network operations connected with an initial task Task tree could be reconstruct after collecting trace data with reports 52
53
An example of the task tree A simple HTTP request through a proxy 53
54
X-Trace Components Data –X-Trace metadata Network path –Task tree Report –Reconstruct task tree 54
55
Propagation of X-Trace Metadata The propagation of X-Trace metadata through the task tree 55
56
Propagation of X-Trace Metadata The propagation of X-Trace metadata through the task tree 56
57
The X Trace metadata FieldUsage FlagsBits that specify which of the three optional components are present TaskIDAn unique integer ID TreeInfoParentID, OpID, EdgeType DestinationSpecify the address that X-Trace report should be sent to OptionsAccommodate future extensions mechanism 57
58
X-Trace Report Architecture 58
59
X-Trace Report Architecture 59
60
X-Trace Report Architecture 60
61
X-Trace-like in Google/Bing/Yahoo Why? –Own large portion of the ecosystem –Use RPC for communication –Need to understand Time for user request Resource utilization by request
62
Discussion Report loss Non-tree request structures Partial deployment Managing report traffic Security Considerations 62
63
Sherlock V X-trace Overhead V. Accuracy Deployment issues –Invasiveness –Code modification
64
Conclusions Sherlock passively infers network-wide dependencies f rom logs and traceroutes It diagnoses faults by correlating user observations X-trace actively discovers network-wide dependencies
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.