Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

Slides:



Advertisements
Similar presentations
1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
Advertisements

Apache Struts Technology
Ira Cohen, Jeffrey S. Chase et al.
VanarSena: Automated App Testing. App Testing Test the app for – performance problems – crashes Testing app in the cloud – Upload app to a service – App.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,
Approaches to EJB Replication. Overview J2EE architecture –EJB, components, services Replication –Clustering, container, application Conclusions –Advantages.
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
Notes to the presenter. I would like to thank Jim Waldo, Jon Bostrom, and Dennis Govoni. They helped me put this presentation together for the field.
DT211/3 Internet Application Development JSP: Processing User input.
Self-Correlating Predictive Information Tracking for Large-Scale Production Systems Zhao, Tan, Gong, Gu, Wambolt Presented by: Andrew Hahn.
Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/15 EICE team Model-Level Debugging of Embedded Real-Time Systems Wolfgang Haberl, Markus.
An Overview of Database Access on the Web An Overview of Database Access on the Web Using ASP and Microsoft Database Technology Sheffield Hallam University.
1 Software Testing and Quality Assurance Lecture 30 - Introduction to Software Testing.
Usability Test by Knowing User’s Every Move - Bharat chaitanya.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Leveraging User Interactions for In-Depth Testing of Web Applications Sean McAllister, Engin Kirda, and Christopher Kruegel RAID ’08 1 Seoyeon Kang November.
1 Automatic Request Categorization in Internet Services Abhishek B. Sharma (USC) Collaborators: Ranjita Bhagwan (MSR, India) Monojit Choudhury (MSR, India)
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
SIMULATING ERRORS IN WEB SERVICES International Journal of Simulation: Systems, Sciences and Technology 2004 Nik Looker, Malcolm Munro and Jie Xu.
Achieving self-healing in service delivery software systems by means of case- based reasoning Stefania Montani Cosimo Anglano Presented by Tony Schneider.
Towards Autonomic Hosting of Multi-tier Internet Services Swaminathan Sivasubramanian, Guillaume Pierre and Maarten van Steen Vrije Universiteit, Amsterdam,
Understanding and Managing WebSphere V5
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.
UNIT-V The MVC architecture and Struts Framework.
Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.
CSCI 6962: Server-side Design and Programming Course Introduction and Overview.
Christopher Jeffers August 2012
CWIC Developers Meeting January 29 th 2014 Calin Duma Service Level Agreements High-Availability, Reliability and Performance.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
ACME: a platform for benchmarking distributed applications David Oppenheimer, Vitaliy Vatkovskiy, and David Patterson ROC Retreat 12 Jan 2003.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Bug Localization with Machine Learning Techniques Wujie Zheng
XMPP Concrete Implementation Updates: 1. Why XMPP 2 »XMPP protocol provides capabilities that allows realization of the NHIN Direct. Simple – Built on.
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
1 Some initial Design suggestions… Getting started… where to begin? Find out whether your design architecture will work… as soon as possible. If you need.
OWASP Top Ten #1 Unvalidated Input. Agenda What is the OWASP Top 10? Where can I find it? What is Unvalidated Input? What environments are effected? How.
AjaxScope & Doloto: Towards Optimizing Client-side Web 2.0 App Performance Ben Livshits Microsoft Research (joint work with Emre.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
Software Testing Reference: Software Engineering, Ian Sommerville, 6 th edition, Chapter 20.
1 MSCS 237 Overview of web technologies (A specific type of distributed systems)
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
CSC 480 Software Engineering Lecture 15 Oct 21, 2002.
DynaRIA: a Tool for Ajax Web Application Comprehension Dipartimento di Informatica e Sistemistica University of Naples “Federico II”, Italy Domenico Amalfitano.
Accommodating Bursts in Distributed Stream Processing Systems Yannis Drougas, ESRI Vana Kalogeraki, AUEB
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Recording Actor Provenance in Scientific Workflows Ian Wootten, Shrija Rajbhandari, Omer Rana Cardiff University, UK.
ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen
1 BBN Technologies Quality Objects (QuO): Adaptive Management and Control Middleware for End-to-End QoS Craig Rodrigues, Joseph P. Loyall, Richard E. Schantz.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 JSP Application Models.
Recovery-Oriented Computing Detecting and Diagnosing Application-Level Failures in Internet Services Emre Kıcıman and Armando Fox {emrek,
Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.
Development of a QoE Model Himadeepa Karlapudi 03/07/03.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler.
Event Management. EMU Graham Heyes April Overview Background Requirements Solution Status.
Maikel Leemans Wil M.P. van der Aalst. Process Mining in Software Systems 2 System under Study (SUS) Functional perspective Focus: User requests Functional.
IV&VS Capabilities. 2 V IRTUAL USER GENERATOR 3 V IRTUAL U SER T ECHNOLOGY AND ADVANTAGES  Simulates a real user  Requires less resources – machines.
UC Marco Vieira University of Coimbra
SOFTWARE TESTING TRAINING TOOLS SUPPORT FOR SOFTWARE TESTING Chapter 6 immaculateres 1.
Web Application Vulnerabilities, Detection Mechanisms, and Defenses
Software Architecture in Practice
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Chapter 18 Software Testing Strategies
Evaluating Transaction System Performance
RM3G: Next Generation Recovery Manager
Web Application Server 2001/3/27 Kang, Seungwoo. Web Application Server A class of middleware Speeding application development Strategic platform for.
Presentation transcript:

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek, ROC Retreat, 2002/01

Motivation  Systems are large and getting larger –1000’s of replicated HW/SW components –used in different combinations  Systems are dynamic –resources are allocated at runtime –e.g. load balancing, personalization  Difficult to diagnose failures –how to tell what’s different about failed requests?

Current Techniques  Dependency Models –Detect failures, check all components that failed requests depend on –Problem: Need to check all dependencies Hard to generate and keep up-to-date  Monitoring & Alarm Correlation –Detect non-functioning components and often generates alarm storms filter alarms for root-cause analysis –Problem: need to instrument every component hard to detect interaction faults

The Pinpoint Approach  Trace many real client requests –Record every component used in a request –Detect success/failure of requests –Can be used as dynamic dependency graphs  Statistical Correlation –Search for components that “cause” failures  Built into middleware –Requires no application code changes –Application knowledge only for end-to-end failure detection

Framework Communications Layer (Tracing & Internal F/D) A B C Components #1 Requests External F/D #2 #3 Statistical Analysis Detected Faults 1,A 1,C 2,B.. 1, success 2, fail 3,... Logs

Prototype Implementation  Built on top of J2EE platform –Sun J2EE 1.2 single-node reference impl. –Added logging of Beans, JSP, & JSP tags –Detect exceptions thrown out of components –Required no application code changes  Layer 7 network sniffer –TCP timeouts, malformed HTML, app-level string searches  PolyAnalyst statistical analysis –Bucket analysis & dependency discovery

Experimental Setup  Demo app: J2EE Pet Store –e-commerce site w/~30 components  Load generator –replay trace of browsing –Approx. TPCW WIPSo load (~50% ordering)  Fault injection parameters –Trigger faults based on combinations of used components –Inject exceptions, infinite loops, null calls  55 tests with single-components faults and interaction faults –5-min runs of a single client (J2EE server limitation)

Application Observations  # of components used in a dynamic request: median 14, min 6, max 23  large number of tightly coupled components that are always used together

Metrics  Precision: C/P  Recall: C/A  Accuracy: whether all actual faults are correctly identified (recall == 100%) –boolean measure Predicted Faults (P) Actual Faults (A) Correctly Identified Faults (C)

4 Analysis Techniques  Pinpoint: clusters of components that statistically correlate with failures  Detection: components where Java exceptions were detected –union across all failed requests –similar to what an event monitoring system outputs  Intersection: intersection of components used in failed requests  Union: union of all components used in failed requests

Results  Pinpoint has high accuracy with relatively high precision

Pinpoint Prototype Limitations  Assumptions –client requests provide good coverage over components and combinations –requests are autonomous (don’t corrupt state and cause later requests to fail)  Currently can’t detect the following: –faults that only degrade performance –faults due to pathological inputs

Conclusions  Dynamic tracing and statistical analysis give improvements in accuracy & precision –Handles dynamic configurations well –Without requiring application code changes –Reduces human work in large systems –But, need good coverage of combinations and autonomous requests  Future Work –Test with real distributed systems with real applications Oceanstore? WebLogic/WebSphere? –Capture additional differentiating factors –Visualization of dynamic dependency –Performance analysis –Online data analysis