Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin Naoya Maruyama, Tokyo.

Slides:

Advertisements

Similar presentations

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Advertisements

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

Self-Propelled Instrumentation Alex Mirgorodskiy Barton Miller Computer Sciences Department University.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.

Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,

Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Paradyn Project Paradyn / Dyninst Week College Park, Maryland March 26-28, 2012 Self-propelled Instrumentation Wenbin Fang.

Distributed Self-Propelled Instrumentation Alex Mirgorodskiy VMware, Inc. Barton P. Miller University of Wisconsin-Madison.

Module 20 Troubleshooting Common SQL Server 2008 R2 Administrative Issues.

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

GridFlow: Workflow Management for Grid Computing Kavita Shinde.

6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.

Quality of Service in IN-home digital networks Alina Albu 7 November 2003.

Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.

1 Location-Based Services Using GSM Cell Information over Symbian OS Final Year Project LYU0301 Mok Ming Fai (mfmok1) Lee Kwok Chau (leekc1)

Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer

7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.

Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Peter Dinda Department of Computer Science Northwestern University.

Loupe /loop/ noun a magnifying glass used by jewelers to reveal flaws in gems. a logging and error management tool used by.NET teams to reveal flaws in.

Fundamentals of Python: From First Programs Through Data Structures

Automated Diagnosis of Software Configuration Errors

Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.

Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.

1 CS503: Operating Systems Part 1: OS Interface Dongyan Xu Department of Computer Science Purdue University.

Vulnerability-Specific Execution Filtering (VSEF) for Exploit Prevention on Commodity Software Authors: James Newsome, James Newsome, David Brumley, David.

Automated Tracing and Visualization of Software Security Structure and Properties Symposium on Visualization for Cyber Security 2012 (VizSec’12) Seattle,

Virtual Machine Hosting for Networked Clusters: Building the Foundations for “Autonomic” Orchestration Based on paper by Laura Grit, David Irwin, Aydan.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

CS 501: Software Engineering Fall 1999 Lecture 16 Verification and Validation.

Paradyn Project Dyninst/MRNet Users’ Meeting Madison, Wisconsin August 7, 2014 The Evolution of Dyninst in Support of Cyber Security Emily Gember-Jacobson.

Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski,

1 CS 501 Spring 2003 CS 501: Software Engineering Lecture 16 System Architecture and Design II.

Transparent Grid Enablement Using Transparent Shaping and GRID superscalar I. Description and Motivation II. Background Information: Transparent Shaping.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Bug Localization with Machine Learning Techniques Wujie Zheng

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

March 12, 2001 Kperfmon-MP Multiprocessor Kernel Performance Profiling Alex Mirgorodskii Computer Sciences Department University of Wisconsin.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Architecture-based Reliability of web services Presented in SRG Group meeting January 24, 2011 Cobra Rahmani.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Example: Rumor Performance Evaluation Andy Wang CIS 5930 Computer Systems Performance Analysis.

ABone Architecture and Operation ABCd — ABone Control Daemon Server for remote EE management On-demand EE initiation and termination Automatic EE restart.

Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha.

CS603 Basics of underlying platforms January 9, 2002.

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

Debugging Threaded Applications By Andrew Binstock CMPS Parallel.

Source Level Debugging of Parallel Programs Roland Wismüller LRR-TUM, TU München Germany.

© 2001 Week (14 March 2001)Paradyn & Dyninst Demonstrations Paradyn & Dyninst Demos Barton P. Miller Computer.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler.

By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

TraceBench: An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 PDL, National.

Correct RelocationMarch 20, 2016 Correct Relocation: Do You Trust a Mutated Binary? Drew Bernat

Tool Support for Testing Classify different types of test tools according to their purpose Explain the benefits of using test tools.

Test Loads Andy Wang CIS Computer Systems Performance Analysis.

Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.

Pitfalls: Time Dependent Behaviors CS433 Spring 2001 Laxmikant Kale.

Kernel Code Coverage Nilofer Motiwala Computer Sciences Department

Introduction to Operating Systems

Reference-Driven Performance Anomaly Identification

Why Threads Are A Bad Idea (for most purposes)

Why Threads Are A Bad Idea (for most purposes)

Why Threads Are A Bad Idea (for most purposes)

Presentation transcript:

Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin Naoya Maruyama, Tokyo Institute of Technology Barton P. Miller, University of Wisconsin

2 Motivation Systems are complex and non-transparent –Many components, different vendors Anomalies are common –Intermittent –Environment-specific Users have little debugging expertise Finding the causes of bugs and performance problems in production systems is hard

3 Vision Autonomous, detailed, low-overhead analysis: User specifies a perceived problem cause The agent finds the actual cause Host A Host B Process P Process Q Agent network Process R

4 Applications Diagnostics of E-commerce systems –Trace the path each request takes through a system –Identify unusual paths –Find out why they are different from the norm Diagnostics of Cluster and Grid systems –Monitor behavior of different nodes in the system –Identify nodes with unusual behavior –Find out why they are different from the norm –Example: found problems in SCore middleware Diagnostics of Real-time and Interactive systems –Trace words through the phone network –Find out why some words were dropped

5 Key Components Data collection: self-propelled instrumentation –Works for a single process –Can cross the user-kernel boundary –Can be deployed on multiple nodes at the same time –Ongoing work: crossing process and host boundaries Data analysis: use repetitiveness to find anomalies –Repetitive execution of the same high-level action OR –Repetitiveness among identical processes (e.g., Cluster management tools, Parallel codes, Web server farms)

6 Focus on Control Flow Anomalies Unusual statements executed –Corner cases are more likely to have bugs Statements executed in unusual order –Race conditions Function taking unusually long to complete –Sporadic performance problems –Deadlocks, livelocks

7 Current Framework 1.Traces control flow of all processes Begins at process startup Stops upon a failure or performance degradation 2.Identifies anomalies: unusual traces Problems on a small number of nodes Both fail-stop and not 3.Identifies the causes of the anomalies Function responsible for the problem P1P1 Trace of P 1 P2P2 P3P3 P4P4

a.out bar 8430: 8431: 8433: 8444: 8446: 8449: 844b: 844c: push mov... call mov xor pop ret foo %ebp %esp,%ebp *%eax %ebp,%esp %eax,%eax %ebp call jmp Patch1 instrument(foo) foo 0x8405 6cf5: 6d20: 6d27: 6d49: push... call... iret sys_call: %eax *%eax call jmp instrument(%eax) *%eax 0x6d27 Patch3 instrumenter.so /dev/instrumenter call jmp instrument(%eax) *%eax 0x8446 Patch2 patch jmp %ebp %esp,%ebp foo %ebp,%esp %ebp push mov... call mov pop ret 83f0: 83f1: 83f3: 8400: 8405: 8413: 8414: OS Kernel patch jmp Inject Activate Propagate Analyze: build call graph/CFG with Dyninst

9 Data Collection: Trace Management call foo ret foo Tracer … Process P The trace is kept in a fixed-size circular buffer –New entries overwrite the oldest entries –Retains the most recent events leading to the problem The buffer is located in a shared memory segment –Does not disappear if the process crashes

10 Data Analysis: Find Anomalous Host Check if the anomaly was fail-stop or not: One of the traces ends substantially earlier than the others -> Fail-stop –The corresponding host is an anomaly Traces end at similar times -> Non-fail-stop –Look at differences in behavior across traces Trace end time Traces P1P1 P2P2 P3P3 P4P4

11 Data Analysis: Non-fail-stop Host Find outliers (traces different from the rest): Define a distance metric between two traces –d(g,h) = measure of similarity of traces g and h Define a trace suspect score –σ(h) = similarity of h to the common behavior Report traces with high suspect scores –Most distant from the common behavior

12 Defining the Distance Metric Compute the time profile for each host h: –p(h) = (t 1, …, t F ) –t i = normalized time spent in function f i on host h –Profiles are less sensitive to noise than raw traces Delta vector of two profiles: δ(g,h) = p(g) – p(h) Distance metric: d(g,h) = Manhattan norm of δ(g,h) t(foo) t(bar) δ(g,h) p(g) p(h)

13 Defining the Suspect Score Common behavior = normal Suspect score: σ(h) = distance to nearest neighbor –Report host with the highest σ to the analyst –h is in the big mass, σ(h) is low, h is normal –g is a single outlier, σ(g) is high, g is an anomaly What if there is more than one anomaly? g h σ(g) σ(h)

14 Defining the Suspect Score Suspect score: σ k (h) = distance to the k th neighbor –Exclude (k-1) closest neighbors –Sensitivity study: k = NumHosts/4 works well Represents distance to the “big mass”: –h is in the big mass, k th neighbor is close, σ k (h) is low –g is an outlier, k th neighbor is far, σ k (g) is high g h σ k (g) Computing the score using k=2

15 Defining the Suspect Score Anomalous means unusual, but unusual does not always mean anomalous! –E.g., MPI master is different from all workers –Would be reported as an anomaly (false positive) Distinguish false positives from true anomalies: –With knowledge of system internals – manual effort –With previous execution history – can be automated g h σ k (g)

16 Defining the Suspect Score Add traces from known-normal previous run –One-class classification Suspect score σ k (h) = distance to the k th trial neighbor or the 1 st known-normal neighbor Distance to the big mass or known-normal behavior –h is in the big mass, k th neighbor is close, σ k (h) is low –g is an outlier, normal node n is close, σ k (g) is low g h n

17 Finding Anomalous Function Fail-stop problems –Failure is in the last function invoked Non-fail-stop problems –Find why host h was marked as an anomaly –Function with the highest contribution to σ(h): σ(h) = |δ (h,g)|, where g is the chosen neighbor anomFn = arg max |δ i | i

18 Experimental Study: SCore SCore: cluster-management framework –Job scheduling, checkpointing, migration –Supports MPI, PVM, Cluster-enabled OpenMP Implemented as a ring of daemons, scored –One daemon per host for monitoring jobs –Daemons exchange keep-alive patrol messages –If no patrol message traverses the ring in 10 minutes, sc_watch kills and restarts all daemons sc_watch scored patrol

19 Debugging SCore sc_watch scored patrol Inject tracing agents into all scoreds Instrument sc_watch to find when the daemons are being killed Identify the anomalous trace Identify the anomalous function/call path

20 Finding the Host Host n129 is unusual – different from the others Host n129 is anomalous – not present in previous known-normal runs Host n129 is a new anomaly – not present in previous known-faulty runs

21 Finding the Cause Call chain with the highest contribution to the suspect score: (output_job_status -> score_write_short -> score_write -> __libc_write) –Tries to output a log message to the scbcast process Writes to the scbcast process kept blocking for 10 minutes –Scbcast stopped reading data from its socket – bug! –Scored did not handle it well (spun in an infinite loop) – bug!

22 Ongoing work Host A Host B Process P Process Q network Cross process and host boundaries –Propagate upon communication Reconstruct system-wide flows Compare flows to identify anomalies Process R

23 Ongoing work Propagate upon communication –Notice the act of communication –Identify the peer –Inject the agent into the peer –Trace the peer after it receives the data Reconstruct system-wide flows –Separate concurrent interleaved flows Compare flows –Identify common flows and anomalies

24 Conclusion Data collection: acquire call traces from all nodes –Self-propelled instrumentation: autonomous, dynamic and low-overhead Data analysis: identify unusual traces and find what made them unusual –Fine-grained: identifies individual suspect functions –Highly accurate: reduces rate of false positives using past history Come see the demo!

25 Relevant Publications A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller, "Root Cause Analysis of Failures in Large-Scale Computing Environments", Submitted for publication, ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy 05Root.pdf A.V. Mirgorodskiy and B.P. Miller, "Autonomous Analysis of Interactive Systems with Self- Propelled Instrumentation", 12th Multimedia Computing and Networking (MMCN 2005), San Jose, CA, January 2005, ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy 04SelfProp.pdf