Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

Slides:



Advertisements
Similar presentations
Welcome to Middleware Joseph Amrithraj
Advertisements

1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
Ira Cohen, Jeffrey S. Chase et al.
CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,
Approaches to EJB Replication. Overview J2EE architecture –EJB, components, services Replication –Clustering, container, application Conclusions –Advantages.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
DT211/3 Internet Application Development JSP: Processing User input.
CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.
Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/15 EICE team Model-Level Debugging of Embedded Real-Time Systems Wolfgang Haberl, Markus.
An Overview of Database Access on the Web An Overview of Database Access on the Web Using ASP and Microsoft Database Technology Sheffield Hallam University.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
Mining Behavior Models Wenke Lee College of Computing Georgia Institute of Technology.
Leveraging User Interactions for In-Depth Testing of Web Applications Sean McAllister, Engin Kirda, and Christopher Kruegel RAID ’08 1 Seoyeon Kang November.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
1 Reliable Adaptive Distributed Systems Armando Fox, Michael Jordan, Randy H. Katz, David Patterson, George Necula, Ion Stoica, Doug Tygar.
SIMULATING ERRORS IN WEB SERVICES International Journal of Simulation: Systems, Sciences and Technology 2004 Nik Looker, Malcolm Munro and Jie Xu.
Towards Autonomic Hosting of Multi-tier Internet Services Swaminathan Sivasubramanian, Guillaume Pierre and Maarten van Steen Vrije Universiteit, Amsterdam,
Expediting Programmer AWAREness of Anomalous Code Sarah E. Smith Laurie Williams Jun Xu November 11, 2005.
Understanding and Managing WebSphere V5
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
Web Application Architecture: multi-tier (2-tier, 3-tier) & mvc
Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.
Exploiting Application Semantics: Harvest, Yield CS 444A Fall 99 Software for Critical Systems Armando Fox & David Dill © 1999 Armando Fox.
Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.
Unit Testing & Defensive Programming. F-22 Raptor Fighter.
System/Software Testing
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Applets & Video Games 1 Last Edited 1/10/04CPS4: Java for Video Games Applets &
1 RADS Conceptual Architecture Commodity Internet & IP networks Edge Network Distributed Middleware Client SLT Services Distributed Middleware Server Router.
Tutorial 121 Creating a New Web Forms Page You will find that creating Web Forms is similar to creating traditional Windows applications in Visual Basic.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Composing Adaptive Software Authors Philip K. McKinley, Seyed Masoud Sadjadi, Eric P. Kasten, Betty H.C. Cheng Presented by Ana Rodriguez June 21, 2006.
Bug Localization with Machine Learning Techniques Wujie Zheng
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Putting it all together Dynamic Data Base Access Norman White Stern School of Business.
AjaxScope & Doloto: Towards Optimizing Client-side Web 2.0 App Performance Ben Livshits Microsoft Research (joint work with Emre.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
1 MSCS 237 Overview of web technologies (A specific type of distributed systems)
® IBM Software Group © 2007 IBM Corporation Best Practices for Session Management
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
CSC 480 Software Engineering Lecture 15 Oct 21, 2002.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
DynaRIA: a Tool for Ajax Web Application Comprehension Dipartimento di Informatica e Sistemistica University of Naples “Federico II”, Italy Domenico Amalfitano.
Combining Statistical Monitoring and Predictable Recovery for Self-Management Armando Fox, Emre Kıcıman, Stanford University Dave Patterson, Mike Jordan,
ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Recovery-Oriented Computing Detecting and Diagnosing Application-Level Failures in Internet Services Emre Kıcıman and Armando Fox {emrek,
CSI 3125, Preliminaries, page 1 SERVLET. CSI 3125, Preliminaries, page 2 SERVLET A servlet is a server-side software program, written in Java code, that.
A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong.
Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
Defensive Programming. Good programming practices that protect you from your own programming mistakes, as well as those of others – Assertions – Parameter.
Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University.
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
A statistical anomaly-based algorithm for on-line fault detection in complex software critical systems A. Bovenzi – F. Brancati Università degli Studi.
SOFTWARE TESTING TRAINING TOOLS SUPPORT FOR SOFTWARE TESTING Chapter 6 immaculateres 1.
Experience Report: System Log Analysis for Anomaly Detection
Large Distributed Systems
Chapter 8 – Software Testing
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Regulating Data Flow in J2EE Application Server
Processes The most important processes used in Web-based systems and their internal organization.
Supporting Fault-Tolerance in Streaming Grid Applications
Evaluating Transaction System Performance
Presentation transcript:

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek, ROC Retreat, 2002/01

Motivation  Systems are large and getting larger –1000’s of replicated HW/SW components used in different combinations –composable services further increase complexity  Systems are dynamic –frequent changes software releases, new machines –resources are allocated at runtime e.g. load balancing, QoS, personalization  Difficult to diagnose failures –what is really happening in the system? –how to tell what’s different about failed requests?

Existing Techniques  Dependency models/graphs –Detect failures, check all components that failed requests depend on –Problem: Need to check all dependencies (large # of false positives) Hard to generate and keep up-to-date  Monitoring & alarm correlation –Detect non-functioning components and often generates alarm storms filter alarms for root-cause analysis –Problem: need to instrument every component hard to detect interaction faults and latent faults

The Pinpoint Approach  Trace real client requests –record every component used in a request –record success/failure and performance of requests –can be used to build dynamic dependency graphs to visualize what is really going on  Statistical analysis –search for components that “cause” failures –data mining techniques  Built into middleware –requires no application code changes –application knowledge only for end-to-end failure detection

B2 B2 X Examples  Identify faulty components  Anomaly detection A1 Req B1 C1 A1 Req B1 A2 X

Framework Communications Layer (Tracing & Internal F/D) AB C Components #1 Requests External F/D #2 #3 Statistical Analysis Predicted Faults 1,A 1,C 2,B.. 1, success 2, fail 3,... Logs

Prototype Implementation  Built on top of J2EE platform –Sun J2EE 1.2 single-node reference implementation –added logging of Beans, JSP, & JSP tags –detect exceptions thrown out of components –required no application code changes  Layer 7 network sniffer in Java –TCP timeouts, HTTP errors, malformed HTML  PolyAnalyst statistical analysis –bucket analysis & dependency discovery –offline analysis

Experimental Setup  Demo app: J2EE Pet Store –e-commerce site w/~30 components  Load generator –replay trace of browsing –approx. TPCW WIPSo load (~50% ordering)  Fault injection –6 components, 2 from each tier –single-components faults and interaction faults –includes exceptions, infinite loops, null calls  55 tests, 5 min runs –performance overhead of tracing/logging: 5%

Observations about the PetStore App  large # of components used in a dynamic page request: median 14, min 6, max 23  large sets of tightly coupled components that are always used together

Metrics  Precision: identified/predicted, (C/P)  Recall: identified/actual, (C/A)  Accuracy: whether all actual faults are correctly identified (recall == 100%) –boolean measure Predicted Faults (P) Actual Faults (A) Correctly Identified Faults (C)

4 Analysis Techniques  Pinpoint: clusters of components that statistically correlate with failures  Detection: components where Java exceptions were detected –union across all failed requests –similar to what an event monitoring system outputs  Intersection: intersection of components used in failed requests  Union: union of all components used in failed requests

Results: Accuracy/Precision vs Technique  Pinpoint has high accuracy with relatively high precision

Prototype Limitations  Assumptions –client requests provide good coverage over components and combinations –requests are autonomous (don’t corrupt state and cause later requests to fail) however, dependency graphs are useful to identify shared state  Currently can’t detect the following: –faults that only degrade performance –faults due to pathological inputs help programmers debug by recording and replaying failed requests

Future Work  Visualization of dynamic dependency –at various granularity: components, machine, tier  Capture additional differentiating factors –URLs, cookies, DB tables –helps to identify independent faults  Study the effects of transient failures  Performance analysis  Online statistical analysis  Looking for real systems with real applications –Oceanstore? WebLogic/WebSphere?

Conclusions  Dynamic tracing and statistical analysis give improvements in accuracy & precision –Handles dynamic configurations well –Without requiring application code changes –Reduces human work in large systems –But, need good coverage of combinations and autonomous requests

Thank you  Acknowledgements: Aaron Brown, Eric Brewer, Armando Fox, George Candea, Kim Keeton, and Dave Patterson

Backup slides

Results: Accuracy under Interaction Faults

Problem Determination  Analogy: trying to locate a car accident on Golden Gate Bridge –on a foggy day –using a toy model –on a clear day