Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automated Diagnosis of Chronic Problems in Production Systems Soila Kavulya Thesis Committee Christos Faloutsos, CMU Greg Ganger, CMU Matti Hiltunen, AT&T.

Similar presentations


Presentation on theme: "Automated Diagnosis of Chronic Problems in Production Systems Soila Kavulya Thesis Committee Christos Faloutsos, CMU Greg Ganger, CMU Matti Hiltunen, AT&T."— Presentation transcript:

1 Automated Diagnosis of Chronic Problems in Production Systems Soila Kavulya Thesis Committee Christos Faloutsos, CMU Greg Ganger, CMU Matti Hiltunen, AT&T Priya Narasimhan, CMU (Advisor)

2 Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila Kavulya @ March 20122

3 Motivation  Chronics are problems that are  Not transient  Not resulting in system-wide outage  Chronics occur in real production systems  VoIP  User’s calls fail due to version conflict between user and upgraded server  Hadoop (CMU’s OpenCloud)  A user job sporadically fails in map phase with cryptic block I/O error  User and admins spend 2 months troubleshooting  Traced to large heap size in tasktracker starving collocated datanodes  Chronics are due to a variety of root-causes  Configuration problems, bad hardware, software bugs  Thesis: Automate chronics diagnosis in production systems Soila Kavulya @ March 20123

4 Challenge for Diagnosis Soila Kavulya @ March 20124 Due to single node? Due to complex interactions between nodes? Due to multiple independent node? Node1 Single manifestation, multiple possible causes Node2 Node3 Node4 Node5

5 Challenges in Production Systems  Labeled failure-data is not always available  Difficult to diagnose problems not encountered before  Sysadmins’ perspective may not correspond to users’  No access to user configurations, user behavior  No access to application semantics  First sign of trouble is often a customer complaint  Customer complaints can be cryptic  Desired level of instrumentation may not be possible  As-is vendor instrumentation with limited control  Cost of added instrumentation may be high  Granularity of diagnosis consequently limited Soila Kavulya @ March 20125

6 Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila Kavulya @ March 20126

7 Objectives  “Is there a problem?” (anomaly detection)  Detect a problem despite potentially not having seen it before  Distinguish a genuine problem from a workload change  “Where is the problem?” (localization)  Drill down by analyzing different instrumentation perspectives  “What kind of problems?” (chronics)  Manifestation: exceptions, performance degradations  Root-cause: misconfiguration, bad hardware, bugs, contention  Origin: single/multiple independent sources, interacting sources  “What kind of environments?” (production systems)  Production VoIP system at AT&T  Hadoop: Open-source implementation of MapReduce Soila Kavulya @ March 20127

8 Thesis Statement Peer-comparison* enables anomaly detection in production systems despite workload changes, and the subsequent incremental fusion of different instrumentation sources enables localization of chronic problems. Soila Kavulya @ March 20128 *Comparison of some performance metric across similar (peer) system elements

9 9 rika (Swahili), noun. peer, contemporary, age-set, undergoing rites of passage (marriage) at similar times. What was our Inspiration?

10 What is a Peer?  Temporal similarity  Age-set: Born around the same time  Anomaly detection: Events within same time window  Spatial similarity  Age-set: Live in same location  Anomaly detection: Run on same node  Phase similarity  Age-set: (birth, initiation, marriage)  Anomaly detection: (map, shuffle, reduce)  Contextual similarity  Age-set: Same gender, clan  Anomaly detection: Same workload, h/w Soila Kavulya @ March 201210

11 Target Systems for Validation  VoIP system at large telecommunication provider  10s of millions of calls per day, diverse workloads  100s of network elements with heterogeneous hardware  24x7 Ops team uses alarm correlation to diagnose outages  Separate team troubleshoots long-term chronics  Labeled traces available  Hadoop: Open-source implementation of MapReduce  Diverse kinds of real workloads  Graph mining, language translation  Hadoop clusters with homogeneous hardware  Yahoo! M45 & Opencloud production clusters  Controlled experiments in Amazon EC2 cluster  Long running jobs (> 100s): Hard to label failures Soila Kavulya @ March 201211

12 In Support of Thesis Statement Soila Kavulya @ March 201212 OBJECTIVEVoIPHADOOP Anomaly Detection Heuristics-based, peer-comparison pending Peer comparison without labeled data Problem Localization Localize to customer/network- element/resource/error-code Localize to node/task/resource ChronicsExceptions, performance degradation, single/multiple- source Exceptions, performance degradation, single-source multiple-source pending Production Systems AT&T production systemEC2 test system, OpenCloud pending PublicationsOSR’11, DSN’12WASL’08, HotMetrics’09, ISSRE’09, ICDCS’10, NOMS’10, CCGRID’10

13 Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila Kavulya @ March 201213

14 Goals & Non-Goals  Goals  Anomaly detection in the absence of labeled failure-data  Diagnosis based on available instrumentation sources  Differentiation of workload changes from anomalies  Non-goals  Diagnosis of system-wide outages  Diagnosis of value faults and transient faults  Root-cause analysis at code-level  Online/runtime diagnosis  Recovery based on diagnosis Soila Kavulya @ March 201214

15 Assumptions  Majority of system is working correctly  Problems manifest in observable behavioral changes  Exceptions or performance degradations  All instrumentation is locally timestamped  Clocks are synchronized to enable system-wide correlation of data  Instrumentation faithfully captures system behavior Soila Kavulya @ March 201215

16 Overview of Approach Soila Kavulya @ March 201216 End-to-end Trace Construction End-to-end Trace Construction Performance Counters Application Logs Ranked list of root-causes Anomaly Detection Localization

17 Target System #1: VoIP Soila Kavulya @ March 201217 PSTN AccessIP Access Gateway Servers IP Base Elements Application Servers Call Control Elements ISP’s network

18 Target System #2: Hadoop Soila Kavulya @ March 201218 JobTracker NameNode TaskTracker DataNode Map/Reduce tasks HDFS blocks Master Node Slave Nodes Hadoop logs OS data Hadoop logs

19 Performance Counters  For both Hadoop and VoIP  Metrics collected periodically from /proc in OS  Monitoring interval varies from 1 sec to 15 min  Examples of metrics collected  CPU utilization  CPU run-queue size  Pages in/out  Memory used/free  Context switches  Packets sent/received  Disk blocks read/written Soila Kavulya @ March 201219

20 End-to-End Trace Construction Soila Kavulya @ March 201220 End-to-end Trace Construction End-to-end Trace Construction Performance Counters Application Logs Ranked list of root-causes Anomaly Detection Localization

21 Application Logs  Each node logs each request that passes through it  Timestamp, IP address, request duration/size, phone no., …  Log formats vary across components and systems  Application-specific parsers extract relevant attributes  Construction of end-to-end traces  Pre-defined schema used to stitch requests across nodes  Match on key attributes  In Hadoop, match tasks with same task IDs  In VoIP, match calls with same sender/receiver phone no  Incorporate time-based correlation  In Hadoop, consider block reads in same time interval as maps  In VoIP, consider calls with same phone no. within same time interval Soila Kavulya @ March 201221

22 Application Logs: VoIP Soila Kavulya @ March 201222  Combine per-element logs to obtain per-call traces  Approximate match on key attributes  Timestamps, caller-callee numbers, IP, ports  Determine call status from per-element codes  Zero talk-time, callback soon after call termination IP Base Element IP Base Element Call Control Element Application Server Gateway Server 10:03:59, START 973-123-8888 to 409-555-5555 192.156.1.2 to 11.22.34.1 10:03:59, STOP 10:03:59, ATTEMPT 973-123-8888 to 409-555-5555 10:04:01, ATTEMPT 973-123-xxxx to 409-555-xxxx 192.156.1.2 to 11.22.34.1

23 Application Logs: Hadoop (1)  Peer-comparable attributes extracted from logs  Correlate traces using IDs and request schema Soila Kavulya @ March 201223 2009-03-06 23:06:01,572 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200903062245_0051_r_000005_0 Scheduled 10 of 115 known outputs (0 slow hosts and 105 dup hosts) 2009-03-06 23:06:01,612 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 2 bytes (2 raw bytes) into RAM from attempt_200903062245_0051_m_000055_0 …from ip-10-250-90-207.ec2.internal Temporal similarity: Timestamps Hostnames: Spatial similarity Phase similarity: Map  Reduce Context similarity: TaskType

24 Application Logs: Hadoop (2)  No global IDs for correlating logs in Hadoop & VoIP  Extract causal flows using predefined schemas Soila Kavulya @ March 201224 NoSQL Database 2009-03-06 23:06:01,572 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200903062245_0051_r_000005_0 Scheduled 10 of 115 known outputs (0 slow hosts and 105 dup hosts) Application logs Extract events MapReduce: { “events” : { “Map” : { “primary-key” : “MapID”, “join-key” : “MapID”, “next-event” : “Shuffle”},… MapReduce: { “events” : { “Map” : { “primary-key” : “MapID”, “join-key” : “MapID”, “next-event” : “Shuffle”},… Flow schema (JSON) Causal flows

25 Anomaly Detection Soila Kavulya @ March 201225 End-to-end Trace Construction End-to-end Trace Construction Performance Counters Application Logs Ranked list of root-causes Anomaly Detection Localization

26 Anomaly Detection Overview Soila Kavulya @ March 201226  Some systems have rules for anomaly detection  Redialing number immediately after disconnection  Server reported error codes and exceptions  If no rules available, rely on peer-comparison  Identifies peers (nodes, flows) in distributed systems  Detect anomalies by identifying “odd-man-out”

27 Anomaly Detection (1)  Empirically determine best peer groupings  Window size, request-flow types, job information  Best grouping minimizes false positives in fault-free runs  Peer-comparison identifies “odd-man-out” behavior  Robust to workload changes  Relies on histogram-comparison  Less sensitive to timing differences  Multiple suspects might be identified  Due to propagating errors, multiple independent problems Soila Kavulya @ March 201227

28 Anomaly Detection (2) Soila Kavulya @ March 201228  Histogram comparison identifies anomalous flows  Generate aggregate histogram represents majority behavior  Compare each node’s histogram against aggregate histogram O(n)  Compute anomaly score using Kullback-Leibler divergence  Detect anomaly if score exceeds pre-specified threshold Faulty node Histograms (distributions) of durations of flows Normal node Normalized counts (total 1.0)

29 Localization Soila Kavulya @ March 201229 End-to-end Trace Construction End-to-end Trace Construction Performance Counters Application Logs Ranked list of root-causes Anomaly Detection Localization

30 “Truth table” Request Representation Node1Node2MapReadBlockOutcome Req11011SUCCESS Req20111FAIL Soila Kavulya @ March 201230 Log Snippet Req1: 20100901064914,SUCCESS,Node1,Map,ReadBlock Req2: 20100901064930,FAIL,Node2,Map,ReadBlock Log Snippet Req1: 20100901064914,SUCCESS,Node1,Map,ReadBlock Req2: 20100901064930,FAIL,Node2,Map,ReadBlock

31 Identify Suspect Attributes  Assume each attribute represented as “coin toss”  Estimate attribute distribution using Bayes  Success distribution: Prob(Attribute|Success)  Anomalous distribution: Prob(Attribute|Anomalous)  Anomaly score: KL-divergence between the two distributions http://www.pdl.cmu.edu / Belief Probability(Node2=TRUE) Successful requests Anomalous requests Indict attributes with highest divergence between distributions Soila Kavulya @ March 201231

32 Rank Problems by Severity Soila Kavulya @ March 201232 Shuffle Map Node3Node2 Step 1: All requests Problem1: Node2 Map Shuffle ExceptionXExceptionY Node3 Step 2: Filter all requests except those matching Problem1 Problem2: Node3 Shuffle Indict path with highest anomaly score 350 120 67090 290 450 160 340

33 Incorporate Performance Counters (1)  Annotate requests on indicted nodes with performance counters based on timestamps  Identify metrics most correlated with problem  Compare distribution of metrics in successful and failed requests Soila Kavulya @ March 201233 Requests on node2 # Timestamp,CallNo,Status,Memory(%),CPU(%) 20100901064914, 1, SUCCESS, 54, 6 20100901065030, 2, SUCCESS, 54, 6 20100901065530, 3, SUCCESS, 56, 4 20100901070030, 4, FAIL, 52, 45 Requests on node2 # Timestamp,CallNo,Status,Memory(%),CPU(%) 20100901064914, 1, SUCCESS, 54, 6 20100901065030, 2, SUCCESS, 54, 6 20100901065530, 3, SUCCESS, 56, 4 20100901070030, 4, FAIL, 52, 45

34 Incorporate Performance Counters (2) Soila Kavulya @ March 201234 Shuffle Map Node3Node2 All requests Problem1: Node2 Map High CPU High CPU Incorporate performance counters in diagnosis 350 120 67090

35 Why Does It Work?  Real-world data backs up utility of peer-comparison  Task durations peer-comparable in >75% of jobs [CCGrid’10]  Approach analyzes both successful and failed requests  Analyzing only failed requests might elevate common elements over causal elements  Iterative approach discovers correlated attributes  Identifies problems due to conjunctions of attributes  Filtering step identifies multiple ongoing problems  Handles unencountered problems  Does not rely on historical models of normal behavior  Does not rely on signatures of known defects Soila Kavulya @ March 201235

36 Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila Kavulya @ March 201236

37 VoIP: Diagnosis of Real Incidents Soila Kavulya @ March 201237 Examples of real-world incidentsDiagnosed Resource Indicted Customers use wrong codec to send faxes ✓ NA Customer problem causes blocked calls at IPBE. ✓ NA Blocked circuit identification codes on trunk group ✓ NA Software bug at control server causes blocked calls ✓ NA Problem with customer equipment leads to poor QoS ✓ NA Debug tracing overloads servers during peak traffic. ✓ CPU Performance problem at application server. ✓ CPU/Memory Congestion at gateway servers due to high load ✓ CPU/Concurrent Sessions Power outage and causes brief outages. ✗ NA PSX not responding to invites from app. server ✗ Low responses at app. server 8 out of 10 real incidents diagnosed

38 Day1Day2Day3Day4Day5Day6 Day1Day2Day3Day4Day5Day6 VoIP: Case Studies Soila Kavulya @ March 201238 Incident 1: Chronic due to unsupported fax codec Failed calls for two customers Failed calls for server Customers stop using unsupported codec Chronic nightly problem Unrelated chronic server problem emerges Server reset Incident 2: Chronic server problem

39 Implementation of Approach Draco: Deployment in Production at AT&T http://www.pdl.cmu.edu / 39 1. Problem1 STOP.IP-TO-PS.487.3 STOP.IP-TO-PSTN.41.0.-.- Chicago*GSXServers MemoryOverload 2. Problem2 STOP.IP-TO-PSTN.102.0.102.102 ServiceB CustomerAcme IP_w.x.y.z Search Filter ~8500 lines of C code Soila Kavulya @ March 2012

40 VoIP: Ranking Multiple Problems Soila Kavulya @ March 201240 Draco performs better at ranking multiple independent problems

41 VoIP: Performance of Algorithm Offline AnalysisAvg. Log Size Avg. Data Load Time Avg. Diagnosis Time Draco simulated-1hr (C++) 271 MB8s4s Draco real-1day (C++) 2.4 G7min8min Soila Kavulya @ March 201241 Running on 16-core Xeon (@ 2.4GHz), 24 GB Memory

42 Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila Kavulya @ March 201242

43 Hadoop: Target Clusters  10 to 100-node Amazon’s EC2 cluster  Commercial, pay-as-you-use cloud-computing resource  Workloads under our control, problems injected by us  gridmix, nutch, sort, random writer  Can harvest logs and OS data of only our workloads  4000-processor M45 & 64 node Opencloud cluster  Production environment  Offered to CMU as free cloud-computing resource  Diverse kinds of real workloads, problems in the wild  Massive machine-learning, language/machine-translation  Permission to harvest all logs and OS data Soila Kavulya @ March 201243

44 Hadoop: EC2 Fault Injection Soila Kavulya @ March 201244 FaultDescription Resource contention CPU hogExternal process uses 70% of CPU Packet-loss5% or 50% of incoming packets dropped Disk hog20GB file repeatedly written to Application bugs Source: Hadoop JIRA HADOOP-1036Maps hang due to unhandled exception HADOOP-1152Reduces fail while copying map output HADOOP-2080Reduces fail due to incorrect checksum Injected fault on single node

45 Metrics True Positive Rates Different metrics detect different problems Hadoop: Peer-comparison Results Soila Kavulya @ March 201245 Without Causal Flows Correlated problems (e.g., packet-loss) harder to localize

46 Hadoop: Peer-comparison Results Soila Kavulya @ March 201246 With Causal Flows + Localization Examples of real-world incidentsDiagnosedMetrics Indicted CPU hog ✓ Node Packet-loss ✓ Node+Shuffle Disk hog ✓ Node HADOOP-1036 ✓ Node+Map HADOOP-1152 ✓ Node+Shuffle HADOOP-2080 ✓ Node+Shuffle Correlated problems correctly identified

47 Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila Kavulya @ March 201247

48 Critique of Approach  Anomaly detection thresholds are fragile  Need to use statistical tests  Anomaly detection does not address problems at master  Peer-groups are defined statically  Assumes homogeneous clusters  Need to automate identification of peers  False positives occur if root-cause not in logs  Algorithm tends to implicate adjacent network elements  Need to incorporate more data to improve visibility Soila Kavulya @ March 201248

49 Related Work  Chronics fly under the radar  Undetected by alarm mining [Mahimkar09]  Chronics can persist undetected for long periods of time  Hard to detect using change-points [Kandula09]  Hard to demarcate problem periods [Sambasivan11]  Multiple ongoing problems at a time  Single fault assumption inadequate [Cohen05, Bodik10]  Peer-comparison on its own inadequate  Hard to localize propagating problems [Kasick10,Tan10,Kang10] Soila Kavulya @ March 201249

50 Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila Kavulya @ March 201250

51 Pending Work Soila Kavulya @ March 201251 OBJECTIVEVoIPHADOOP Anomaly Detection Heuristics-based, peer-comparison pending Peer comparison without labeled data Problem Localization Localize to customer/network- element/resource/error-code Localize to node/task/resource ChronicsExceptions, performance degradation, single/multiple- source Exceptions, performance degradation, single-source multiple-source pending Production Systems AT&T production systemEC2 test system, OpenCloud pending PublicationsOSR’11, DSN’12WASL’08, HotMetrics’09, ISSRE’09, NOMS’10, CCGRID’10

52 Pending Work: Details  OpenCloud production cluster & multiple-source problems [April-June 2012]  64-node cluster housed at Carnegie Mellon  Obtained and parsed logs from 25 real OpenCloud incidents  Root-causes include misconfigurations, h/w issues, buggy apps  Yet to analyze logs  Peer comparison in VoIP [June-July 2012]  Examining data that is not labeled, and identifying peers  Notion of a peer might be determined by function and location  Root-causes under investigation are as before  Dissertation writing [June-August 2012]  Defense [September 2012] Soila Kavulya @ March 201252

53 Collaborators & Thanks  VoIP (AT&T)  Matti Hiltunen, Kaustubh Joshi, Scott Daniels  Hadoop diagnosis  Jiaqi Tan, Xinghao Pan, Rajeev Gandhi, Keith Bare, Michael Kasick, Eugene Marinelli  Hadoop visualization  Christos Faloutsos, U Kang, Elmer Garduno, Jason Campbell (Intel), HCI 05-610 team  OpenCloud  Greg Ganger, Garth Gibson, Julio Lopez, Kai Ren, Mitch Franzos, Michael Stroucken Soila Kavulya @ March 201253

54 Summary  Peer-comparison effective for anomaly detection  Robust to workload changes  Requires little training data  Incremental fusion of different instrumentation sources enables localization of chronics  Starts with user-visible symptoms of a problem  Drills down to localize root-cause of problem  Usefulness of approach in two production systems  VoIP system at large telecommunication provider (demonstrated)  Hadoop clusters (underway) Soila Kavulya @ March 201254

55 Soila Kavulya @ March 201255 Questions? Climbing Mt. Kilimanjaro comes a distant second to a thesis proposal!

56 Selected Publications (1) Diagnosis in Production VoIP system  DSN12: Draco: Statistical Diagnosis of Chronic Problems in Large Distributed Systems. S. P. Kavulya, S. Daniels, K. Joshi, M. Hiltunen, R. Gandhi, P. Narasimhan. To appear DSN 2012.  OSR12: Practical Experiences with Chronics Discovery in Large Telecommunications Systems. S. P. Kavulya, K. Joshi, M. Hiltunen, S. Daniels, R. Gandhi, P. Narasimhan. Best Papers from SLAML 2011 in Operating Systems Review, 2011. Survey Paper & Workload Analysis of Production Hadoop Cluster  RAE12: Failure Diagnosis of Complex Systems S. P. Kavulya, K. Joshi, F. Di Giandomenico, P. Narasimhan. To appear in Book on Resilience Assessment and Evaluation. Wolter, 2012.  An analysis of traces from a production MapReduce cluster. S. Kavulya, J. Tan, R. Gandhi, P. Narasimhan. CCGrid 2010. Soila Kavulya @ March 201256

57 Selected Publications (2) Visualization in Hadoop  CHIMIT11: Understanding and improving the diagnostic workflow of MapReduce users. J. D. Campbell, A. B. Ganesan, B. Gotow, S. P. Kavulya, J. Mulholland, P. Narasimhan, S. Ramasubramanian, M. Shuster, J. Tan. CHIMIT 2011  ICDCS10: Visual, log-based causal tracing for performance debugging of MapReduce systems. J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. ICDCS 2010 Diagnosis in Hadoop (Application logs + performance counters)  NOMS10: Kahuna: Problem Diagnosis for MapReduce-Based Cloud Computing Environments. J. Tan, X. Pan, S. Kavulya, R. Gandhi, P. Narasimhan. NOMS 2010.  ISSRE09: Blind Men and the Elephant (BLIMEy): Piecing together Hadoop for Diagnosis. X. Pan, J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. ISSRE 2009. Soila Kavulya @ March 201257

58 Selected Publications (3) Diagnosis in Hadoop (Performance counters)  HotMetrics09: Ganesha: Black-Box Fault Diagnosis for MapReduce Systems. X. Pan, J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. HotMetrics 2009. Diagnosis in Hadoop (Application logs)  WASL: SALSA: Analyzing Logs as StAte Machines. J. Tan, X. Pan, S. Kavulya, R. Gandhi. P. Narasimhan. WASL 2008, Diagnosis in Group Communication Systems  SRDS08: Gumshoe: Diagnosing Performance Problems in Replicated File- Systems. S. Kavulya, R. Gandhi, P. Narasimhan. SRDS 2008.  SysML07: Fingerpointing Correlated Failures in Replicated Systems. S. Pertet, R. Gandhi, P. Narasimhan. SysML, April 2007. Soila Kavulya @ March 201258

59 Related Work (1)  [Bodik10]: Fingerprinting the datacenter: automated classification of performance crises. Peter Bodík, Moisés Goldszmidt, Armando Fox, Dawn B. Woodard, Hans Andersen: EuroSys 2010.  [Cohen05]: Capturing, indexing, clustering and retrieving system history. Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox. SOSP, 2005.  [Kandula09]: Detailed diagnosis in enterprise networks. Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, Paramvir Bahl. SIGCOMM 2009.  [Kasick10]: Black-Box Problem Diagnosis in Parallel File Systems. Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. FAST 2010.  [Kiciman05]: Detecting application-level failures in component-based Internet Services. Emre Kiciman, Armando Fox. IEEE Trans. on Neural Networks 2005. Soila Kavulya @ March 201259

60 Related Work (2)  [Mahimkar09]: Towards automated performance diagnosis in a large IPTV network. Ajay Anil Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, Qi Zhao. SIGCOMM 2009.  [Sambasivan11]: Diagnosing Performance Changes by Comparing Request Flows. Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. NSDI 2011. http://www.pdl.cmu.edu / Soila Kavulya @ March 201260


Download ppt "Automated Diagnosis of Chronic Problems in Production Systems Soila Kavulya Thesis Committee Christos Faloutsos, CMU Greg Ganger, CMU Matti Hiltunen, AT&T."

Similar presentations


Ads by Google