Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin Naoya Maruyama, Tokyo.

Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin mirg@cs.wisc.edu Naoya Maruyama, Tokyo Institute of Technology naoya.maruyama@is.titech.ac.jp Barton P. Miller, University of Wisconsin bart@cs.wisc.edu http://www.paradyn.org/

2 Motivation Systems are complex and non-transparent –Many components, different vendors Anomalies are common –Intermittent –Environment-specific Users have little debugging expertise Finding the causes of bugs and performance problems in production systems is hard

3 Vision Autonomous, detailed, low-overhead analysis: User specifies a perceived problem cause The agent finds the actual cause Host A Host B Process P Process Q Agent network Process R

4 Applications Diagnostics of E-commerce systems –Trace the path each request takes through a system –Identify unusual paths –Find out why they are different from the norm Diagnostics of Cluster and Grid systems –Monitor behavior of different nodes in the system –Identify nodes with unusual behavior –Find out why they are different from the norm –Example: found problems in SCore middleware Diagnostics of Real-time and Interactive systems –Trace words through the phone network –Find out why some words were dropped

5 Key Components Data collection: self-propelled instrumentation –Works for a single process –Can cross the user-kernel boundary –Can be deployed on multiple nodes at the same time –Ongoing work: crossing process and host boundaries Data analysis: use repetitiveness to find anomalies –Repetitive execution of the same high-level action OR –Repetitiveness among identical processes (e.g., Cluster management tools, Parallel codes, Web server farms)

6 Focus on Control Flow Anomalies Unusual statements executed –Corner cases are more likely to have bugs Statements executed in unusual order –Race conditions Function taking unusually long to complete –Sporadic performance problems –Deadlocks, livelocks

7 Current Framework 1.Traces control flow of all processes Begins at process startup Stops upon a failure or performance degradation 2.Identifies anomalies: unusual traces Problems on a small number of nodes Both fail-stop and not 3.Identifies the causes of the anomalies Function responsible for the problem P1P1 Trace of P 1 P2P2 P3P3 P4P4

a.out bar 8430: 8431: 8433: 8444: 8446: 8449: 844b: 844c: push mov... call mov xor pop ret foo %ebp %esp,%ebp *%eax %ebp,%esp %eax,%eax %ebp call jmp Patch1 instrument(foo) foo 0x8405 6cf5: 6d20: 6d27: 6d49: push... call... iret sys_call: %eax *%eax call jmp instrument(%eax) *%eax 0x6d27 Patch3 instrumenter.so /dev/instrumenter call jmp instrument(%eax) *%eax 0x8446 Patch2 patch jmp %ebp %esp,%ebp foo %ebp,%esp %ebp push mov... call mov pop ret 83f0: 83f1: 83f3: 8400: 8405: 8413: 8414: OS Kernel patch jmp Inject Activate Propagate Analyze: build call graph/CFG with Dyninst

9 Data Collection: Trace Management call foo ret foo Tracer … Process P The trace is kept in a fixed-size circular buffer –New entries overwrite the oldest entries –Retains the most recent events leading to the problem The buffer is located in a shared memory segment –Does not disappear if the process crashes

10 Data Analysis: Find Anomalous Host Check if the anomaly was fail-stop or not: One of the traces ends substantially earlier than the others -> Fail-stop –The corresponding host is an anomaly Traces end at similar times -> Non-fail-stop –Look at differences in behavior across traces Trace end time Traces P1P1 P2P2 P3P3 P4P4

11 Data Analysis: Non-fail-stop Host Find outliers (traces different from the rest): Define a distance metric between two traces –d(g,h) = measure of similarity of traces g and h Define a trace suspect score –σ(h) = similarity of h to the common behavior Report traces with high suspect scores –Most distant from the common behavior

12 Defining the Distance Metric Compute the time profile for each host h: –p(h) = (t 1, …, t F ) –t i = normalized time spent in function f i on host h –Profiles are less sensitive to noise than raw traces Delta vector of two profiles: δ(g,h) = p(g) – p(h) Distance metric: d(g,h) = Manhattan norm of δ(g,h) t(foo) t(bar) δ(g,h) p(g) p(h)

13 Defining the Suspect Score Common behavior = normal Suspect score: σ(h) = distance to nearest neighbor –Report host with the highest σ to the analyst –h is in the big mass, σ(h) is low, h is normal –g is a single outlier, σ(g) is high, g is an anomaly What if there is more than one anomaly? g h σ(g) σ(h)

14 Defining the Suspect Score Suspect score: σ k (h) = distance to the k th neighbor –Exclude (k-1) closest neighbors –Sensitivity study: k = NumHosts/4 works well Represents distance to the “big mass”: –h is in the big mass, k th neighbor is close, σ k (h) is low –g is an outlier, k th neighbor is far, σ k (g) is high g h σ k (g) Computing the score using k=2

15 Defining the Suspect Score Anomalous means unusual, but unusual does not always mean anomalous! –E.g., MPI master is different from all workers –Would be reported as an anomaly (false positive) Distinguish false positives from true anomalies: –With knowledge of system internals – manual effort –With previous execution history – can be automated g h σ k (g)

16 Defining the Suspect Score Add traces from known-normal previous run –One-class classification Suspect score σ k (h) = distance to the k th trial neighbor or the 1 st known-normal neighbor Distance to the big mass or known-normal behavior –h is in the big mass, k th neighbor is close, σ k (h) is low –g is an outlier, normal node n is close, σ k (g) is low g h n

17 Finding Anomalous Function Fail-stop problems –Failure is in the last function invoked Non-fail-stop problems –Find why host h was marked as an anomaly –Function with the highest contribution to σ(h): σ(h) = |δ (h,g)|, where g is the chosen neighbor anomFn = arg max |δ i | i

18 Experimental Study: SCore SCore: cluster-management framework –Job scheduling, checkpointing, migration –Supports MPI, PVM, Cluster-enabled OpenMP Implemented as a ring of daemons, scored –One daemon per host for monitoring jobs –Daemons exchange keep-alive patrol messages –If no patrol message traverses the ring in 10 minutes, sc_watch kills and restarts all daemons sc_watch scored patrol

19 Debugging SCore sc_watch scored patrol Inject tracing agents into all scoreds Instrument sc_watch to find when the daemons are being killed Identify the anomalous trace Identify the anomalous function/call path

20 Finding the Host Host n129 is unusual – different from the others Host n129 is anomalous – not present in previous known-normal runs Host n129 is a new anomaly – not present in previous known-faulty runs

21 Finding the Cause Call chain with the highest contribution to the suspect score: (output_job_status -> score_write_short -> score_write -> __libc_write) –Tries to output a log message to the scbcast process Writes to the scbcast process kept blocking for 10 minutes –Scbcast stopped reading data from its socket – bug! –Scored did not handle it well (spun in an infinite loop) – bug!

22 Ongoing work Host A Host B Process P Process Q network Cross process and host boundaries –Propagate upon communication Reconstruct system-wide flows Compare flows to identify anomalies Process R

23 Ongoing work Propagate upon communication –Notice the act of communication –Identify the peer –Inject the agent into the peer –Trace the peer after it receives the data Reconstruct system-wide flows –Separate concurrent interleaved flows Compare flows –Identify common flows and anomalies

24 Conclusion Data collection: acquire call traces from all nodes –Self-propelled instrumentation: autonomous, dynamic and low-overhead Data analysis: identify unusual traces and find what made them unusual –Fine-grained: identifies individual suspect functions –Highly accurate: reduces rate of false positives using past history Come see the demo!

25 Relevant Publications A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller, "Root Cause Analysis of Failures in Large-Scale Computing Environments", Submitted for publication, ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy 05Root.pdf A.V. Mirgorodskiy and B.P. Miller, "Autonomous Analysis of Interactive Systems with Self- Propelled Instrumentation", 12th Multimedia Computing and Networking (MMCN 2005), San Jose, CA, January 2005, ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy 04SelfProp.pdf

Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin Naoya Maruyama, Tokyo.

Similar presentations

Presentation on theme: "Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin Naoya Maruyama, Tokyo."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin Naoya Maruyama, Tokyo.

Similar presentations

Presentation on theme: "Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin Naoya Maruyama, Tokyo."— Presentation transcript:

Similar presentations

About project

Feedback