Download presentation
Presentation is loading. Please wait.
Published byAnthony Quinn Modified over 9 years ago
1
Distributed Self-Propelled Instrumentation Alex Mirgorodskiy VMware, Inc. Barton P. Miller University of Wisconsin-Madison
2
Distributed Self-Propelled Instrumentation -2- Motivation Problems are difficult to reproduce –Intermittent or environment-specific (anomalies) –“Rare but dangerous” Systems are large collections of black boxes –Many distributed components, different vendors –Little support for monitoring/debugging Collected data are difficult to analyze –High volume –High concurrency Diagnosis of production systems is hard
3
Distributed Self-Propelled Instrumentation -3- E-commerce systems Multi-tier: Clients, Web, DB servers, Business Logic Hard to debug: vendors have SWAT teams to fix bugs –Some companies get paid $1000/hour Interne t proxy server client 1 client 2 client 3 client 4 firewall Load balancer W2W2 A2A2 D2D2 W1W1 W3W3 A3A3 A1A1 D1D1 Common Environments Web Server processes App Server processes Database Server processes
4
Distributed Self-Propelled Instrumentation -4- Common Environments Clusters and HPC systems –Large-scale: failures happen often (MTTF: 30 – 150 hours) –Complex: processing a Condor job involves 10+ processes The Grid: Beyond a single supercomputer –Decentralized –Heterogeneous: different schedulers, architectures Hard to detect failures, let alone debug them Job description submit schedd shadow negotiator collector startd starter user job Submitting host Central Manager Execution Host
5
Distributed Self-Propelled Instrumentation -5- Approach User provides activation and deactivation events Agent propagates through the system –Collects distributed control-flow traces Framework analyzes traces automatically –Separates traces into flows (e.g., HTTP requests) –Identifies anomalous flows and the causes of anomalies Host A Host B Process P Process Q Agent network Process R
6
Distributed Self-Propelled Instrumentation -6- Self-Propelled Instrumentation: Overview The agent sits inside the process –Agent = small code fragment The agent propagates through the code –Receives control –Inserts calls to itself ahead of the control flow –Crosses process, host, and kernel boundaries –Returns control to the application Key features –On-demand distributed deployment –Application-agnostic distributed deployment
7
a.out bar 8430: 8431: 8433: 8444: 8446: 8449: 844b: 844c: push mov... call mov xor pop ret foo %ebp %esp,%ebp *%eax %ebp,%esp %eax,%eax %ebp call jmp Patch1 instrument(foo) foo 0x8405 instrumenter.so call jmp instrument(%eax) *%eax 0x8446 Patch2 patch jmp %ebp %esp,%ebp foo %ebp,%esp %ebp push mov... call mov pop ret 83f0: 83f1: 83f3: 8400: 8405: 8413: 8414: Inject Activate Propagate Analyze: build call graph/CFG with Dyninst Within-process Propagation Dynamic, low-overhead control flow tracing
8
Cross-process Propagation send(msg) get peer inject send(mark) send(msg) jmp back a.outagent.so Process P agent.soa.out recv(msg) if at mark propagate recv(msg) jmp back Process Q msg mark recv(portQ) pidQ=port2pid(portQ) hijack(pidQ, agent.so) send(done) spDaemon socket Host B Host A On-demand distributed deployment
9
Distributed Self-Propelled Instrumentation -9- PDG for a Simple Socket Program PDG: Parallel Dynamic Program Dependence Graph –Nodes: observed events –Intra-process edges: link consecutive events –Cross-process edges: link sends with matching recvs PDGs from real systems are more complex startconnect acceptrecvsend close sendrecv stop client server
10
Distributed Self-Propelled Instrumentation -10- PDG for One Condor Job
11
Distributed Self-Propelled Instrumentation -11- Automated Diagnosis Challenge for manual examination –High volume of trace data Automated Approach: find anomalies –Normal behavior often is repetitive –Pathological problems often are easy to find –Focus on anomalies: infrequent bad behavior
12
Distributed Self-Propelled Instrumentation -12- Overview of the Approach Obtain a collection of control flows –E.g., per-request traces in a Web server Anomaly detection: find an unusual flow –Summarize each flow as a profile –Assign suspect scores to profiles Root cause analysis: find why a profile is anomalous –Function responsible for the anomaly Flows Φ1Φ1 Φ2Φ2 Φ3Φ3 Φ4Φ4 cause
13
Distributed Self-Propelled Instrumentation -13- Anomaly Detection: Distributed Profiles main entry foo entry send foo exit main exit recv bar exit main exit 1s 2s 3s 1s h p t (h) = t main t foo Time p t = t i = normalized time spent in function f i Communication p s = s i = normalized number of bytes sent by f i Composite p c = Concatenate p t and p s Coverage p v = v i = 1 if function f i was called; v i = 0 otherwise t bar
14
Distributed Self-Propelled Instrumentation -14- σ(g) = distance to a common or known-normal node Can detect multiple anomalies Does not require known examples of prior runs –Unsupervised algorithm Can use such examples for fewer false positives –One-class ranking algorithm g h σ(g,g k ) gkgk Anomaly Detection: Suspect Scores
15
Distributed Self-Propelled Instrumentation -15- Finding the Cause: Coverage Analysis Find call paths taken only in the anomalous flow –Δ = {main → A, main → A → B, main → A → C, main → D → E, main → D → C} Correlated with the failure Likely location of the problem main D A BC AnomNorm D EC main A BC Δ = Anom - Norm D EC
16
Distributed Self-Propelled Instrumentation -16- Limitation of coverage analysis: too many reports –Noise in the trace, different input, workload Can eliminate effects of earlier differences –Retain the shortest prefixes in Δ –Merge leaves Can rank paths by the time of occurrence or length –Put the cause ahead of the symptoms or simplify manual examination main A BC Δ = Anom - Norm D EC main A Δ’Δ’ D E,C Finding the Cause: Coverage Analysis
17
Distributed Self-Propelled Instrumentation -17- PDG for One Condor Job
18
Distributed Self-Propelled Instrumentation -18- PDG for Two Condor Jobs
19
Distributed Self-Propelled Instrumentation -19- Separating Concurrent Flows Concurrency produces interleaved traces –Servers switch from one request to another Analyzing interleaved traces is difficult –Irrelevant details from other users –High trace variability → everything is an anomaly Solution: separate traces into flows
20
Distributed Self-Propelled Instrumentation -20- Flow-Separation Algorithm Decide when two events are in the same flow –(send → recv) and (local → non-recv) Remove all other edges Flow = events reachable from a start event click 1 connect send URL recv page show page select accept select accept select recv URL send page select recv URL send page click 2 connect send URL recv page show page Web server browser 2 browser 1
21
Distributed Self-Propelled Instrumentation -21- Rules violated for programs with queues –enQ 1 and deQ 1 must belong to the same flow –Assigned to different flows by our application- independent algorithm client 1 client 2 server enQ 1 enQ 2 deQ 1 deQ 2 Limitation
22
Distributed Self-Propelled Instrumentation -22- client 1 client 2 server enQ 1 enQ 2 deQ 1 deQ 2 Addressing the Limitation: Directives Pair events using custom directives Evt: location in the code Joinattr: related events have equal attr values
23
Distributed Self-Propelled Instrumentation -23- Experimental Study: Condor Job description Condor_submit schedd shadow Job output negotiator collector startd starter user job Submitting hostCentral ManagerExecution Host
24
Distributed Self-Propelled Instrumentation -24- Job-run-twice Problem Fault handling in Condor –Any component can fail –Detect the failure –Restart the component Bug in the shadow daemon –Symptoms: user job ran twice –Cause: intermittent crash after shadow reported successful job completion
25
Distributed Self-Propelled Instrumentation -25- Debugging Approach Insert an intermittent fault into shadow Submit a cluster of several jobs –Start tracing condor_submit –Propagate into schedd, shadow, collector, negotiator, startd, starter, mail, the user job Separate the trace into flows –Processing each job is a separate flow Identify anomalous flow –Use unsupervised and one-class algorithms Find the cause of the anomaly
26
Distributed Self-Propelled Instrumentation -26- Finding Anomalous Flow Suspect scores for composite profiles Without prior knowledge, Flows 1 and 5 are unusual –Infrequent but normal activities –Use prior known-normal traces to filter them out Flow 3 is a true anomaly
27
Distributed Self-Propelled Instrumentation -27- Finding the Cause Computed coverage difference –900+ call paths Filtered the differences –37 call paths left Ranked the differences –14 th path by time / 1 st by length as called by schedd: main → DaemonCore::Driver → DaemonCore::HandleDC_SERVICEWAITPIDS → DaemonCore::HandleProcessExit → Scheduler::child_exit → DaemonCore::GetExceptionString –Called when shadow terminates with a signal Last function called by shadow = failure location
28
Distributed Self-Propelled Instrumentation -28- Conclusion Self-propelled instrumentation –On-demand, low-overhead control-flow tracing –Across process and host boundaries Automated root cause analysis –Finds anomalous control flows –Finds the causes of anomalies Separation of concurrent flows –Little application-specific knowledge
29
Distributed Self-Propelled Instrumentation -29- Related Publications A.V. Mirgorodskiy and B.P. Miller, “Diagnosing Distributed Systems with Self-Propelled Instrumentation", Under submission, –ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy07DistDiagnosis. pdf A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller, “Problem Diagnosis in Large-Scale Computing Environments", SC’06, Tampa, FL, November 2006, –ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy06ProblemDiagno sis.pdf A.V. Mirgorodskiy and B.P. Miller, "Autonomous Analysis of Interactive Systems with Self-Propelled Instrumentation", 12th Multimedia Computing and Networking (MMCN 2005), San Jose, CA, January 2005, –ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy04SelfProp.pdf
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.