Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),

Slides:

Advertisements

Similar presentations

MINERVA: an automated resource provisioning tool for large-scale storage systems G. Alvarez, E. Borowsky, S. Go, T. Romer, R. Becker-Szendy, R. Golding,

Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer.

UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.

Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.

4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.

1 BGP Anomaly Detection in an ISP Jian Wu (U. Michigan) Z. Morley Mao (U. Michigan) Jennifer Rexford (Princeton) Jia Wang (AT&T Labs)

Module 20 Troubleshooting Common SQL Server 2008 R2 Administrative Issues.

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

Traffic Engineering With Traditional IP Routing Protocols

Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin.

DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)

A General approach to MPLS Path Protection using Segments Ashish Gupta Ashish Gupta.

Learning-Based Anomaly Detection in BGP Updates Jian Zhang Jennifer Rexford Joan Feigenbaum.

Ordering of events in Distributed Systems & Eventual Consistency Jinyang Li.

MCITP Guide to Microsoft Windows Server 2008 Server Administration (Exam #70-646) Chapter 14 Server and Network Monitoring.

Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.

Performance Debugging in Data Centers: Doing More with Less Prashant Shenoy, UMass Amherst Joint work with Emmanuel Cecchet, Maitreya Natu, Vaishali Sadaphal.

A General approach to MPLS Path Protection using Segments Ashish Gupta Ashish Gupta.

Software Process and Product Metrics

Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.

Creating a Calibration Measurement Monitoring System for Many, Ever-changing, Complex Instruments John Wilson Software QA Engineer Agilent Technologies,

Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.

Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.

Windows Server 2008 Chapter 11 Last Update

1 Automatic Misconfiguration Disagnosis with PeerPressure Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang Microsoft Research OSDI 2004,

Anomaly detection Problem motivation Machine Learning.

Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

1 Automated Fault diagnosis in VoIP 31st March,2006 Vishal Kumar Singh and Henning Schulzrinne.

Distributed Asynchronous Bellman-Ford Algorithm

Automated Diagnosis of Chronic Problems in Production Systems Soila Kavulya Thesis Committee Christos Faloutsos, CMU Greg Ganger, CMU Matti Hiltunen, AT&T.

1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.

Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski,

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]

Introduction to Hadoop and HDFS

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Assuring Performance of Carrier-Class Networks and Enterprise Contact Centers SP-11: Ensuring Service Quality While Increasing Revenue February 4, 2009.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.

Developer TECH REFRESH 15 Junho 2015 #pttechrefres h Understand your end-users and your app with Application Insights.

Example: Rumor Performance Evaluation Andy Wang CIS 5930 Computer Systems Performance Analysis.

Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.

Power at Your Fingertips –Overlooked Gems in Oracle EM John Sheaffer Principal Sales Consultant – Oracle Corporation.

CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

Probabilistic Model-Driven Recovery in Distributed Systems Kaustubh R. Joshi, Matti A. Hiltunen, William H. Sanders, and Richard D. Schlichting May 2,

Efficient Implementation of Complex Interventions in Large Scale Epidemic Simulations Network Dynamics & Simulation Science Laboratory Jiangzhuo Chen Joint.

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

Maikel Leemans Wil M.P. van der Aalst. Process Mining in Software Systems 2 System under Study (SUS) Functional perspective Focus: User requests Functional.

Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.

SQL Advanced Monitoring Using DMV, Extended Events and Service Broker Javier Villegas – DBA | MCP | MCTS.

Software Testing. SE, Testing, Hans van Vliet, © Nasty question  Suppose you are being asked to lead the team to test the software that controls.

Experience Report: System Log Analysis for Anomaly Detection

Jian Wu (University of Michigan)

Software Architecture in Practice

Problem Diagnosis & VISUALIZATION

Chapter 18 Software Testing Strategies

Monitoring of the infrastructure from the VO perspective

Predictive Performance

QNX Technology Overview

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Zoie Barrett and Brian Lam

Jia-Bin Huang Virginia Tech

Assoc. Prof. Marc FRÎNCU, PhD. Habil.

What’s Happening with my App, Application Insights?

Presentation transcript:

Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU), Priya Narasimhan (CMU) PARALLEL DATA LABORATORY Carnegie Mellon University

Automated Problem Diagnosis Diagnosing problems Creates major headaches for administrators Worsens as scale and system complexity grows Goal: automate it and get proactive Failure detection and prediction Problem determination (or “fingerpointing”) Problem visualization How: Instrumentation plus statistical analysis November 12http://

Target Systems for Validation VoIP system at large telecom provider 10s of millions of calls per day, diverse workloads 100s of heterogeneous network elements Labeled traces available Hadoop: MapReduce implementation  Hadoop clusters with homogeneous hardware  Yahoo! M45 & Opencloud production clusters  Controlled experiments in Amazon EC2 cluster  Long running jobs (> 100s): Hard to label failures 12

Assumptions of Approach Majority of system is working correctly Problems manifest in observable behavioral changes Exceptions or performance degradations All instrumentation is locally timestamped Clocks are synchronized to enable system- wide correlation of data Instrumentation faithfully captures system behavior 12

Overview of Diagnostic Approach End-to-end Trace Construction End-to-end Trace Construction Performance Counters Application Logs Ranked list of root-causes Anomaly Detection Localization November 12

Anomaly Detection Overview Some systems have rules for anomaly detection, e.g., Redialing number immediately after disconnection Server reported error codes and exceptions If no rules available, rely on peer-comparison Identifies peers (nodes, flows) in distributed systems Detect anomalies by identifying “odd-man-out” 12

Anomaly Detection Approach Histogram comparison identifies anomalous nodes Pairwise comparison of node histograms Detect anomaly if difference between histograms exceeds pre-specified threshold Faulty node Histograms (distributions) of durations of flows Normal node Normalized counts (total 1.0) November 12

Localization Overview 1.Obtain labeled end-to-end traces (labels indicate failures and successes) Telecom systems –Use heuristics, e.g., Redialing number immediately after disconnection Hadoop –Use peer-comparison for anomaly detection since heuristics for detection are unavailable 2.Localize source of problems Score attributes based on how well they distinguish failed calls from successful ones November 12http://

“Truth Table” Call Representation November 12http:// Server1Server2Customer1Phone1Outcome Call11101SUCCESS Call21011FAIL Log Snippet Call1: 09:31am,SUCCESS, Server1,Server2,Phone1 Call2: 09:32am,FAIL,Server1,Customer1,Phone1 Log Snippet Call1: 09:31am,SUCCESS, Server1,Server2,Phone1 Call2: 09:32am,FAIL,Server1,Customer1,Phone1 10s of thousands of attributes 10s of millions of calls

Identify Suspect Attributes Estimate conditional probability distributions Prob(Success|Attribute) vs Prob(Failure|Attribute) Update belief on distribution with each call seen November 12http:// Degree of Belief Probability Success|Customer1 Failure|Customer1 Anomaly score: Distance between distributions

Find Multiple Ongoing Problems Search for combination of attributes that maximize anomaly score E.g., (Customer1 and ServerOS4) Greedy search limits combinations explored Iterative search identifies multiple problems November 12http:// 1. Chronic signature1 Customer1 ServerOS4 2. Chronic signature2 PhoneType7 Time of Day (GMT) Failed Calls UI: Ranked list of chronics

Evaluation Prototype in use by Ops team Daily reports over past 2 years Helped Ops to quickly discover new chronics For example, to analyze 25 million VoIP calls 2 2.4GHz Xeon cores, used <1 GB of memory Data loading: 1.75 minutes for 6GB of data Diagnosis: ~4 seconds per signature (near-interactive) November 12http://

1. Chronic Signature1 Service_A Customer_A 2. Chronic Signature2 Service_A Customer_N IP_Address_N Call Quality (QoS) Violations November 12http:// Message loss used as the event failure indicator (>1%) Draco showed most QoS issues were tied to specific customers and not ISP network elements (as was previously believed) Customer name, IP Incident at ISP: Failed Calls Time of Day (GMT) Failed Calls Time of Day (GMT)

In Summary… Use peer-comparison for anomaly detection Localize source of problems using statistics Applicable when end-to-end traces available E.g., customer, network element, version conflicts Approach used on Trone might vary Depends on instrumentation available Also depends on fault-model November 12http://