Automated Diagnosis of Chronic Problems in Production Systems Soila Kavulya Thesis Committee Christos Faloutsos, CMU Greg Ganger, CMU Matti Hiltunen, AT&T.

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer.

UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.

Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

CLUE: SYSTEM TRACE ANALYTICS FOR CLOUD SERVICE PERFORMANCE DIAGNOSIS Hui Zhang 1, Junghwan Rhee 1, Nipun Arora 1, Sahan Gamage 2, Guofei Jiang 1, Kenji.

Spark: Cluster Computing with Working Sets

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.

Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD), Sharad.

Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.

SIM5102 Software Evaluation

Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.

Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.

Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl.

Lecture 11 Intrusion Detection (cont)

Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.

CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.

Windows Server 2008 Chapter 11 Last Update

Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.

THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.

Towards An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 National University.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.

1 Network Monitoring Mi-Jung Choi Dept. of Computer Science KNU

Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),

A Firewall for Routers: Protecting Against Routing Misbehavior1 June 26, A Firewall for Routers: Protecting Against Routing Misbehavior Jia Wang.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma

© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Intrusion Detection Systems Paper written detailing importance of audit data in detecting misuse + user behavior 1984-SRI int’l develop method of.

 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.

Performance Testing Test Complete. Performance testing and its sub categories Performance testing is performed, to determine how fast some aspect of a.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler.

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

TraceBench: An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 PDL, National.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

SQL Database Management

Distributed Network Traffic Feature Extraction for a Real-time IDS

Hands-On Microsoft Windows Server 2008

Problem Diagnosis & VISUALIZATION

Introduction to MapReduce and Hadoop

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

MapReduce Simplied Data Processing on Large Clusters

Reference-Driven Performance Anomaly Identification

Charles Tappert Seidenberg School of CSIS, Pace University

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Automated Diagnosis of Chronic Problems in Production Systems Soila Kavulya Thesis Committee Christos Faloutsos, CMU Greg Ganger, CMU Matti Hiltunen, AT&T Priya Narasimhan, CMU (Advisor)

Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila March 20122

Motivation  Chronics are problems that are  Not transient  Not resulting in system-wide outage  Chronics occur in real production systems  VoIP  User’s calls fail due to version conflict between user and upgraded server  Hadoop (CMU’s OpenCloud)  A user job sporadically fails in map phase with cryptic block I/O error  User and admins spend 2 months troubleshooting  Traced to large heap size in tasktracker starving collocated datanodes  Chronics are due to a variety of root-causes  Configuration problems, bad hardware, software bugs  Thesis: Automate chronics diagnosis in production systems Soila March 20123

Challenge for Diagnosis Soila March Due to single node? Due to complex interactions between nodes? Due to multiple independent node? Node1 Single manifestation, multiple possible causes Node2 Node3 Node4 Node5

Challenges in Production Systems  Labeled failure-data is not always available  Difficult to diagnose problems not encountered before  Sysadmins’ perspective may not correspond to users’  No access to user configurations, user behavior  No access to application semantics  First sign of trouble is often a customer complaint  Customer complaints can be cryptic  Desired level of instrumentation may not be possible  As-is vendor instrumentation with limited control  Cost of added instrumentation may be high  Granularity of diagnosis consequently limited Soila March 20125

Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila March 20126

Objectives  “Is there a problem?” (anomaly detection)  Detect a problem despite potentially not having seen it before  Distinguish a genuine problem from a workload change  “Where is the problem?” (localization)  Drill down by analyzing different instrumentation perspectives  “What kind of problems?” (chronics)  Manifestation: exceptions, performance degradations  Root-cause: misconfiguration, bad hardware, bugs, contention  Origin: single/multiple independent sources, interacting sources  “What kind of environments?” (production systems)  Production VoIP system at AT&T  Hadoop: Open-source implementation of MapReduce Soila March 20127

Thesis Statement Peer-comparison* enables anomaly detection in production systems despite workload changes, and the subsequent incremental fusion of different instrumentation sources enables localization of chronic problems. Soila March *Comparison of some performance metric across similar (peer) system elements

9 rika (Swahili), noun. peer, contemporary, age-set, undergoing rites of passage (marriage) at similar times. What was our Inspiration?

What is a Peer?  Temporal similarity  Age-set: Born around the same time  Anomaly detection: Events within same time window  Spatial similarity  Age-set: Live in same location  Anomaly detection: Run on same node  Phase similarity  Age-set: (birth, initiation, marriage)  Anomaly detection: (map, shuffle, reduce)  Contextual similarity  Age-set: Same gender, clan  Anomaly detection: Same workload, h/w Soila March

Target Systems for Validation  VoIP system at large telecommunication provider  10s of millions of calls per day, diverse workloads  100s of network elements with heterogeneous hardware  24x7 Ops team uses alarm correlation to diagnose outages  Separate team troubleshoots long-term chronics  Labeled traces available  Hadoop: Open-source implementation of MapReduce  Diverse kinds of real workloads  Graph mining, language translation  Hadoop clusters with homogeneous hardware  Yahoo! M45 & Opencloud production clusters  Controlled experiments in Amazon EC2 cluster  Long running jobs (> 100s): Hard to label failures Soila March

In Support of Thesis Statement Soila March OBJECTIVEVoIPHADOOP Anomaly Detection Heuristics-based, peer-comparison pending Peer comparison without labeled data Problem Localization Localize to customer/network- element/resource/error-code Localize to node/task/resource ChronicsExceptions, performance degradation, single/multiple- source Exceptions, performance degradation, single-source multiple-source pending Production Systems AT&T production systemEC2 test system, OpenCloud pending PublicationsOSR’11, DSN’12WASL’08, HotMetrics’09, ISSRE’09, ICDCS’10, NOMS’10, CCGRID’10

Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila March

Goals & Non-Goals  Goals  Anomaly detection in the absence of labeled failure-data  Diagnosis based on available instrumentation sources  Differentiation of workload changes from anomalies  Non-goals  Diagnosis of system-wide outages  Diagnosis of value faults and transient faults  Root-cause analysis at code-level  Online/runtime diagnosis  Recovery based on diagnosis Soila March

Assumptions  Majority of system is working correctly  Problems manifest in observable behavioral changes  Exceptions or performance degradations  All instrumentation is locally timestamped  Clocks are synchronized to enable system-wide correlation of data  Instrumentation faithfully captures system behavior Soila March

Overview of Approach Soila March End-to-end Trace Construction End-to-end Trace Construction Performance Counters Application Logs Ranked list of root-causes Anomaly Detection Localization

Target System #1: VoIP Soila March PSTN AccessIP Access Gateway Servers IP Base Elements Application Servers Call Control Elements ISP’s network

Target System #2: Hadoop Soila March JobTracker NameNode TaskTracker DataNode Map/Reduce tasks HDFS blocks Master Node Slave Nodes Hadoop logs OS data Hadoop logs

Performance Counters  For both Hadoop and VoIP  Metrics collected periodically from /proc in OS  Monitoring interval varies from 1 sec to 15 min  Examples of metrics collected  CPU utilization  CPU run-queue size  Pages in/out  Memory used/free  Context switches  Packets sent/received  Disk blocks read/written Soila March

End-to-End Trace Construction Soila March End-to-end Trace Construction End-to-end Trace Construction Performance Counters Application Logs Ranked list of root-causes Anomaly Detection Localization

Application Logs  Each node logs each request that passes through it  Timestamp, IP address, request duration/size, phone no., …  Log formats vary across components and systems  Application-specific parsers extract relevant attributes  Construction of end-to-end traces  Pre-defined schema used to stitch requests across nodes  Match on key attributes  In Hadoop, match tasks with same task IDs  In VoIP, match calls with same sender/receiver phone no  Incorporate time-based correlation  In Hadoop, consider block reads in same time interval as maps  In VoIP, consider calls with same phone no. within same time interval Soila March

Application Logs: VoIP Soila March  Combine per-element logs to obtain per-call traces  Approximate match on key attributes  Timestamps, caller-callee numbers, IP, ports  Determine call status from per-element codes  Zero talk-time, callback soon after call termination IP Base Element IP Base Element Call Control Element Application Server Gateway Server 10:03:59, START to to :03:59, STOP 10:03:59, ATTEMPT to :04:01, ATTEMPT xxxx to xxxx to

Application Logs: Hadoop (1)  Peer-comparable attributes extracted from logs  Correlate traces using IDs and request schema Soila March :06:01,572 INFO org.apache.hadoop.mapred.ReduceTask: attempt_ _0051_r_000005_0 Scheduled 10 of 115 known outputs (0 slow hosts and 105 dup hosts) :06:01,612 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 2 bytes (2 raw bytes) into RAM from attempt_ _0051_m_000055_0 …from ip ec2.internal Temporal similarity: Timestamps Hostnames: Spatial similarity Phase similarity: Map  Reduce Context similarity: TaskType

Application Logs: Hadoop (2)  No global IDs for correlating logs in Hadoop & VoIP  Extract causal flows using predefined schemas Soila March NoSQL Database :06:01,572 INFO org.apache.hadoop.mapred.ReduceTask: attempt_ _0051_r_000005_0 Scheduled 10 of 115 known outputs (0 slow hosts and 105 dup hosts) Application logs Extract events MapReduce: { “events” : { “Map” : { “primary-key” : “MapID”, “join-key” : “MapID”, “next-event” : “Shuffle”},… MapReduce: { “events” : { “Map” : { “primary-key” : “MapID”, “join-key” : “MapID”, “next-event” : “Shuffle”},… Flow schema (JSON) Causal flows

Anomaly Detection Soila March End-to-end Trace Construction End-to-end Trace Construction Performance Counters Application Logs Ranked list of root-causes Anomaly Detection Localization

Anomaly Detection Overview Soila March  Some systems have rules for anomaly detection  Redialing number immediately after disconnection  Server reported error codes and exceptions  If no rules available, rely on peer-comparison  Identifies peers (nodes, flows) in distributed systems  Detect anomalies by identifying “odd-man-out”

Anomaly Detection (1)  Empirically determine best peer groupings  Window size, request-flow types, job information  Best grouping minimizes false positives in fault-free runs  Peer-comparison identifies “odd-man-out” behavior  Robust to workload changes  Relies on histogram-comparison  Less sensitive to timing differences  Multiple suspects might be identified  Due to propagating errors, multiple independent problems Soila March

Anomaly Detection (2) Soila March  Histogram comparison identifies anomalous flows  Generate aggregate histogram represents majority behavior  Compare each node’s histogram against aggregate histogram O(n)  Compute anomaly score using Kullback-Leibler divergence  Detect anomaly if score exceeds pre-specified threshold Faulty node Histograms (distributions) of durations of flows Normal node Normalized counts (total 1.0)

Localization Soila March End-to-end Trace Construction End-to-end Trace Construction Performance Counters Application Logs Ranked list of root-causes Anomaly Detection Localization

“Truth table” Request Representation Node1Node2MapReadBlockOutcome Req11011SUCCESS Req20111FAIL Soila March Log Snippet Req1: ,SUCCESS,Node1,Map,ReadBlock Req2: ,FAIL,Node2,Map,ReadBlock Log Snippet Req1: ,SUCCESS,Node1,Map,ReadBlock Req2: ,FAIL,Node2,Map,ReadBlock

Identify Suspect Attributes  Assume each attribute represented as “coin toss”  Estimate attribute distribution using Bayes  Success distribution: Prob(Attribute|Success)  Anomalous distribution: Prob(Attribute|Anomalous)  Anomaly score: KL-divergence between the two distributions / Belief Probability(Node2=TRUE) Successful requests Anomalous requests Indict attributes with highest divergence between distributions Soila March

Rank Problems by Severity Soila March Shuffle Map Node3Node2 Step 1: All requests Problem1: Node2 Map Shuffle ExceptionXExceptionY Node3 Step 2: Filter all requests except those matching Problem1 Problem2: Node3 Shuffle Indict path with highest anomaly score

Incorporate Performance Counters (1)  Annotate requests on indicted nodes with performance counters based on timestamps  Identify metrics most correlated with problem  Compare distribution of metrics in successful and failed requests Soila March Requests on node2 # Timestamp,CallNo,Status,Memory(%),CPU(%) , 1, SUCCESS, 54, , 2, SUCCESS, 54, , 3, SUCCESS, 56, , 4, FAIL, 52, 45 Requests on node2 # Timestamp,CallNo,Status,Memory(%),CPU(%) , 1, SUCCESS, 54, , 2, SUCCESS, 54, , 3, SUCCESS, 56, , 4, FAIL, 52, 45

Incorporate Performance Counters (2) Soila March Shuffle Map Node3Node2 All requests Problem1: Node2 Map High CPU High CPU Incorporate performance counters in diagnosis

Why Does It Work?  Real-world data backs up utility of peer-comparison  Task durations peer-comparable in >75% of jobs [CCGrid’10]  Approach analyzes both successful and failed requests  Analyzing only failed requests might elevate common elements over causal elements  Iterative approach discovers correlated attributes  Identifies problems due to conjunctions of attributes  Filtering step identifies multiple ongoing problems  Handles unencountered problems  Does not rely on historical models of normal behavior  Does not rely on signatures of known defects Soila March

Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila March

VoIP: Diagnosis of Real Incidents Soila March Examples of real-world incidentsDiagnosed Resource Indicted Customers use wrong codec to send faxes ✓ NA Customer problem causes blocked calls at IPBE. ✓ NA Blocked circuit identification codes on trunk group ✓ NA Software bug at control server causes blocked calls ✓ NA Problem with customer equipment leads to poor QoS ✓ NA Debug tracing overloads servers during peak traffic. ✓ CPU Performance problem at application server. ✓ CPU/Memory Congestion at gateway servers due to high load ✓ CPU/Concurrent Sessions Power outage and causes brief outages. ✗ NA PSX not responding to invites from app. server ✗ Low responses at app. server 8 out of 10 real incidents diagnosed

Day1Day2Day3Day4Day5Day6 Day1Day2Day3Day4Day5Day6 VoIP: Case Studies Soila March Incident 1: Chronic due to unsupported fax codec Failed calls for two customers Failed calls for server Customers stop using unsupported codec Chronic nightly problem Unrelated chronic server problem emerges Server reset Incident 2: Chronic server problem

Implementation of Approach Draco: Deployment in Production at AT&T / Problem1 STOP.IP-TO-PS STOP.IP-TO-PSTN Chicago*GSXServers MemoryOverload 2. Problem2 STOP.IP-TO-PSTN ServiceB CustomerAcme IP_w.x.y.z Search Filter ~8500 lines of C code Soila March 2012

VoIP: Ranking Multiple Problems Soila March Draco performs better at ranking multiple independent problems

VoIP: Performance of Algorithm Offline AnalysisAvg. Log Size Avg. Data Load Time Avg. Diagnosis Time Draco simulated-1hr (C++) 271 MB8s4s Draco real-1day (C++) 2.4 G7min8min Soila March Running on 16-core Xeon 2.4GHz), 24 GB Memory

Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila March

Hadoop: Target Clusters  10 to 100-node Amazon’s EC2 cluster  Commercial, pay-as-you-use cloud-computing resource  Workloads under our control, problems injected by us  gridmix, nutch, sort, random writer  Can harvest logs and OS data of only our workloads  4000-processor M45 & 64 node Opencloud cluster  Production environment  Offered to CMU as free cloud-computing resource  Diverse kinds of real workloads, problems in the wild  Massive machine-learning, language/machine-translation  Permission to harvest all logs and OS data Soila March

Hadoop: EC2 Fault Injection Soila March FaultDescription Resource contention CPU hogExternal process uses 70% of CPU Packet-loss5% or 50% of incoming packets dropped Disk hog20GB file repeatedly written to Application bugs Source: Hadoop JIRA HADOOP-1036Maps hang due to unhandled exception HADOOP-1152Reduces fail while copying map output HADOOP-2080Reduces fail due to incorrect checksum Injected fault on single node

Metrics True Positive Rates Different metrics detect different problems Hadoop: Peer-comparison Results Soila March Without Causal Flows Correlated problems (e.g., packet-loss) harder to localize

Hadoop: Peer-comparison Results Soila March With Causal Flows + Localization Examples of real-world incidentsDiagnosedMetrics Indicted CPU hog ✓ Node Packet-loss ✓ Node+Shuffle Disk hog ✓ Node HADOOP-1036 ✓ Node+Map HADOOP-1152 ✓ Node+Shuffle HADOOP-2080 ✓ Node+Shuffle Correlated problems correctly identified

Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila March

Critique of Approach  Anomaly detection thresholds are fragile  Need to use statistical tests  Anomaly detection does not address problems at master  Peer-groups are defined statically  Assumes homogeneous clusters  Need to automate identification of peers  False positives occur if root-cause not in logs  Algorithm tends to implicate adjacent network elements  Need to incorporate more data to improve visibility Soila March

Related Work  Chronics fly under the radar  Undetected by alarm mining [Mahimkar09]  Chronics can persist undetected for long periods of time  Hard to detect using change-points [Kandula09]  Hard to demarcate problem periods [Sambasivan11]  Multiple ongoing problems at a time  Single fault assumption inadequate [Cohen05, Bodik10]  Peer-comparison on its own inadequate  Hard to localize propagating problems [Kasick10,Tan10,Kang10] Soila March

Outline  Motivation  Thesis Statement  Approach  End-to-end trace construction  Anomaly detection  Localization  Evaluation  VoIP  Hadoop  Critique & Related Work  Pending Work Soila March

Pending Work Soila March OBJECTIVEVoIPHADOOP Anomaly Detection Heuristics-based, peer-comparison pending Peer comparison without labeled data Problem Localization Localize to customer/network- element/resource/error-code Localize to node/task/resource ChronicsExceptions, performance degradation, single/multiple- source Exceptions, performance degradation, single-source multiple-source pending Production Systems AT&T production systemEC2 test system, OpenCloud pending PublicationsOSR’11, DSN’12WASL’08, HotMetrics’09, ISSRE’09, NOMS’10, CCGRID’10

Pending Work: Details  OpenCloud production cluster & multiple-source problems [April-June 2012]  64-node cluster housed at Carnegie Mellon  Obtained and parsed logs from 25 real OpenCloud incidents  Root-causes include misconfigurations, h/w issues, buggy apps  Yet to analyze logs  Peer comparison in VoIP [June-July 2012]  Examining data that is not labeled, and identifying peers  Notion of a peer might be determined by function and location  Root-causes under investigation are as before  Dissertation writing [June-August 2012]  Defense [September 2012] Soila March

Collaborators & Thanks  VoIP (AT&T)  Matti Hiltunen, Kaustubh Joshi, Scott Daniels  Hadoop diagnosis  Jiaqi Tan, Xinghao Pan, Rajeev Gandhi, Keith Bare, Michael Kasick, Eugene Marinelli  Hadoop visualization  Christos Faloutsos, U Kang, Elmer Garduno, Jason Campbell (Intel), HCI team  OpenCloud  Greg Ganger, Garth Gibson, Julio Lopez, Kai Ren, Mitch Franzos, Michael Stroucken Soila March

Summary  Peer-comparison effective for anomaly detection  Robust to workload changes  Requires little training data  Incremental fusion of different instrumentation sources enables localization of chronics  Starts with user-visible symptoms of a problem  Drills down to localize root-cause of problem  Usefulness of approach in two production systems  VoIP system at large telecommunication provider (demonstrated)  Hadoop clusters (underway) Soila March

Soila March Questions? Climbing Mt. Kilimanjaro comes a distant second to a thesis proposal!

Selected Publications (1) Diagnosis in Production VoIP system  DSN12: Draco: Statistical Diagnosis of Chronic Problems in Large Distributed Systems. S. P. Kavulya, S. Daniels, K. Joshi, M. Hiltunen, R. Gandhi, P. Narasimhan. To appear DSN  OSR12: Practical Experiences with Chronics Discovery in Large Telecommunications Systems. S. P. Kavulya, K. Joshi, M. Hiltunen, S. Daniels, R. Gandhi, P. Narasimhan. Best Papers from SLAML 2011 in Operating Systems Review, Survey Paper & Workload Analysis of Production Hadoop Cluster  RAE12: Failure Diagnosis of Complex Systems S. P. Kavulya, K. Joshi, F. Di Giandomenico, P. Narasimhan. To appear in Book on Resilience Assessment and Evaluation. Wolter,  An analysis of traces from a production MapReduce cluster. S. Kavulya, J. Tan, R. Gandhi, P. Narasimhan. CCGrid Soila March

Selected Publications (2) Visualization in Hadoop  CHIMIT11: Understanding and improving the diagnostic workflow of MapReduce users. J. D. Campbell, A. B. Ganesan, B. Gotow, S. P. Kavulya, J. Mulholland, P. Narasimhan, S. Ramasubramanian, M. Shuster, J. Tan. CHIMIT 2011  ICDCS10: Visual, log-based causal tracing for performance debugging of MapReduce systems. J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. ICDCS 2010 Diagnosis in Hadoop (Application logs + performance counters)  NOMS10: Kahuna: Problem Diagnosis for MapReduce-Based Cloud Computing Environments. J. Tan, X. Pan, S. Kavulya, R. Gandhi, P. Narasimhan. NOMS  ISSRE09: Blind Men and the Elephant (BLIMEy): Piecing together Hadoop for Diagnosis. X. Pan, J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. ISSRE Soila March

Selected Publications (3) Diagnosis in Hadoop (Performance counters)  HotMetrics09: Ganesha: Black-Box Fault Diagnosis for MapReduce Systems. X. Pan, J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. HotMetrics Diagnosis in Hadoop (Application logs)  WASL: SALSA: Analyzing Logs as StAte Machines. J. Tan, X. Pan, S. Kavulya, R. Gandhi. P. Narasimhan. WASL 2008, Diagnosis in Group Communication Systems  SRDS08: Gumshoe: Diagnosing Performance Problems in Replicated File- Systems. S. Kavulya, R. Gandhi, P. Narasimhan. SRDS  SysML07: Fingerpointing Correlated Failures in Replicated Systems. S. Pertet, R. Gandhi, P. Narasimhan. SysML, April Soila March

Related Work (1)  [Bodik10]: Fingerprinting the datacenter: automated classification of performance crises. Peter Bodík, Moisés Goldszmidt, Armando Fox, Dawn B. Woodard, Hans Andersen: EuroSys  [Cohen05]: Capturing, indexing, clustering and retrieving system history. Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox. SOSP,  [Kandula09]: Detailed diagnosis in enterprise networks. Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, Paramvir Bahl. SIGCOMM  [Kasick10]: Black-Box Problem Diagnosis in Parallel File Systems. Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. FAST  [Kiciman05]: Detecting application-level failures in component-based Internet Services. Emre Kiciman, Armando Fox. IEEE Trans. on Neural Networks Soila March

Related Work (2)  [Mahimkar09]: Towards automated performance diagnosis in a large IPTV network. Ajay Anil Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, Qi Zhao. SIGCOMM  [Sambasivan11]: Diagnosing Performance Changes by Comparing Request Flows. Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. NSDI / Soila March