ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen

Slides:



Advertisements
Similar presentations
Object Oriented Analysis And Design-IT0207 iiI Semester
Advertisements

Autonomous Recovery in Componentized Internet Application Candea et. al Vikram Negi.
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Evaluation of a Scalable P2P Lookup Protocol for Internet Applications
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.
Ira Cohen, Jeffrey S. Chase et al.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,
Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.
Technical Architectures
CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.
Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/15 EICE team Model-Level Debugging of Embedded Real-Time Systems Wolfgang Haberl, Markus.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
CS 501: Software Engineering Fall 2000 Lecture 16 System Architecture III Distributed Objects.
1 Software Testing and Quality Assurance Lecture 30 – Testing Systems.
Leveraging User Interactions for In-Depth Testing of Web Applications Sean McAllister, Engin Kirda, and Christopher Kruegel RAID ’08 1 Seoyeon Kang November.
Performance Debugging in Data Centers: Doing More with Less Prashant Shenoy, UMass Amherst Joint work with Emmanuel Cecchet, Maitreya Natu, Vaishali Sadaphal.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
SIMULATING ERRORS IN WEB SERVICES International Journal of Simulation: Systems, Sciences and Technology 2004 Nik Looker, Malcolm Munro and Jie Xu.
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.
System/Software Testing
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Testing Tools. Categories of testing tools Black box testing, or functional testing Testing performed via GUI. The tool helps in emulating end-user actions.
1 Autonomic Computing An Introduction Guenter Kickinger.
Software Quality Assurance Lecture #8 By: Faraz Ahmed.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
CS 501: Software Engineering Fall 1999 Lecture 16 Verification and Validation.
Using Queries for Distributed Monitoring and Forensics Atul Singh Rice University Peter Druschel Max Planck Institute for Software Systems Timothy Roscoe.
1.eCognition Overview. 1 eCognition eCognition is a knowledge utilisation platform based on Active Knowledge Network technology eCognition covers the.
Bottlenecks: Automated Design Configuration Evaluation and Tune.
VeriFlow: Verifying Network-Wide Invariants in Real Time
ACME: a platform for benchmarking distributed applications David Oppenheimer, Vitaliy Vatkovskiy, and David Patterson ROC Retreat 12 Jan 2003.
Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.
Enterprise JavaBeans. What is EJB? l An EJB is a specialized, non-visual JavaBean that runs on a server. l EJB technology supports application development.
Cluster Reliability Project ISIS Vanderbilt University.
1 Software testing. 2 Testing Objectives Testing is a process of executing a program with the intent of finding an error. A good test case is in that.
Bug Localization with Machine Learning Techniques Wujie Zheng
Engr. M. Fahad Khan Lecturer Software Engineering Department University Of Engineering & Technology Taxila.
©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.
DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
CSC 480 Software Engineering Lecture 15 Oct 21, 2002.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside
Dynamic and Selective Combination of Extension in Component-based Applications Eddy Truyen, Bart Vanhaute, Wouter Joosen, Pierre Verbaeten, Bo N. Jørgensen.
Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Xiong Junjie Node-level debugging based on finite state machine in wireless sensor networks.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.
Dapper, a Large-Scale Distributed System Tracing Infrastructure
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler.
T EST T OOLS U NIT VI This unit contains the overview of the test tools. Also prerequisites for applying these tools, tools selection and implementation.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.
Constraint Framework, page 1 Collaborative learning for security and repair in application communities MIT site visit April 10, 2007 Constraints approach.
Building Enterprise Applications Using Visual Studio®
Distribution and components
CHAPTER 3 Architectures for Distributed Systems
Software testing strategies 2
Fault Tolerance Distributed Web-based Systems
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Hardware Counter Driven On-the-Fly Request Signatures
Outline System architecture Current work Experiments Next Steps
Presentation transcript:

ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen

ROC Retreat 1/14/2003 Motivation  Divide and conquer, layering, and replication are fundamental design principles –e.g. Internet systems, P2P systems, and sensor networks  Execution context is dispersed throughout the system => difficult to monitor and debug  Lots of existing low-level tools that help with debugging individual components, but not a collection of them –Much of the system is in how the components are put together  Observation: a widening gap between the systems we are building and the tools we have

ROC Retreat 1/14/2003 Peer Sensor Current Approach Apache Java Bean Java Bean Database Java Bean

ROC Retreat 1/14/2003 Current Approach  Micro analysis tools, like code-level debuggers (e.g. gdb) and application logs, offers details of each individual component  Scenario –A user reports request A1 failed –You try the same request, A2, but it works fine –What to do next? Apache Java Bean Java Bean Database Java Bean gdb X = 1 Y = 2 A1A2 X = 3 Y = 1 X = 3 Y = 2

ROC Retreat 1/14/2003 Macro Analysis  Macro analysis exploits non-local context to improve reliability and performance –Performance examples: Scout, ILP, Magpie  Statistical view is essential for large, complex systems  Analogy: micro analysis allows you to understand the details of individual honeybee; macro analysis is needed to understand how the bees interact to keep the beehive functioning

ROC Retreat 1/14/2003 Observation  Systems have a single system-wide execution paths associated with each request –E.g. request/response, one-way messages  Scout, SEDA, Ninja use paths to specify how to service requests  Our philosophy –Use only dynamic, observed behavior –Application-independent techniques

ROC Retreat 1/14/2003 Our Approach  Use runtime paths to connect the dots! –dynamically captures the interactions and dependency between components –look across many requests to get the overall system behavior more robust to noise  Components are only partially known (“gray boxes”) Apache Java Bean Java Bean Database Java Bean path paths

ROC Retreat 1/14/2003 Apache Java Bean Java Bean Java Bean Java Bean Java Bean paths Apache Java Bean Java Bean Database Java Bean Sensor paths Peer paths Our Approach  Applicable to a wide range of systems. AB DE FG C paths

ROC Retreat 1/14/2003 Open Challenges in Systems Today 1.Deducing system structure –manual approach is error-prone –static analysis doesn’t consider resources 2.Detecting application-level failures –often don’t exhibit lower-level symptoms 3.Diagnosing failures –failures may manifest far from the actual faults –multi-component faults  Goal: reduce time to detection, recovery, diagnosis, and repair

ROC Retreat 1/14/2003 Talk Outline  Motivation  Model and architecture  Applying macro analysis  Future directions

ROC Retreat 1/14/2003 Runtime Paths  Instrument code to dynamically trace requests through a system at the component level –record call path + the runtime properties –e.g. components, latency, success/failure, and resources used to service each request  Use statistical analysis detect and diagnose problems –e.g. data mining, machine learning, etc.  Runtime analysis tells you how the system is actually being used, not how it may be used  Complements existing micro analysis tools

ROC Retreat 1/14/2003 Reusable Analysis Framework Architecture  Tracer –Tags each request with a unique ID, and carries it with the request throughout the system –Report observations (component name + resource + performance properties) for each component  Aggregator + Repository –Reconstructs paths and stores them  Declarative Query Engine –Supports statistical queries on paths –Data mining and machine learning routines  Visualization A Tracer C B D E F Aggregator Path Repository Query Engine Visualization Developers/ Operators request

ROC Retreat 1/14/2003 Request Tracing  Challenge: maintaining an ID with each request throughout the system  Tracing is platform-specific but can be application- generic and reusable across applications  2 classes of techniques –Intra-thread tracing Use per-thread context to store request ID (e.g. ThreadLocal in Java) ID is preserved if the same thread is used to service the request –Inter-thread tracing For extensible protocols like HTTP, inject new headers that will be preserved (e.g. REQ_ID: xx) Modify RPC to pass request ID under the cover Piggyback onto messages

ROC Retreat 1/14/2003 Talk Outline  Motivation  Model and architecture  Applying macro analysis –Inferring system structure –Detection application-level failures –Diagnosing failures  Future directions

ROC Retreat 1/14/2003 Inferring System Structure  Key idea: paths directly capture application structure 2 requests

ROC Retreat 1/14/2003 Indirect Coupling of Requests  Key idea: paths associate requests with internal state  Trace requests from web server to database –Parse client-side SQL queries to get sharing of db tables –Straightforward to extend to more fine-grained state (e.g. rows) Request types Database tables

ROC Retreat 1/14/2003 Failure Detection and Diagnose  Detecting application-level failures –Key idea: paths change under failures => detect failures via path changes.  Diagnosing failures –Key idea: bad paths touch root cause(s). Find common features.

ROC Retreat 1/14/2003 Future Directions  Key idea: violation of macro invariants are signs of buggy implementation or intrusion  Message paths in P2P and sensor networks –a general mechanism to provide visibility into the collective behavior of multiple nodes –micro or static approaches by themselves don’t work well in dynamic, distributed settings –e.g. algorithms have upper bounds on the # of hops Although hop count violation can be detected locally, paths help identify nodes that route messages incorrectly –e.g. detecting nodes that are slow or corrupt msgs

ROC Retreat 1/14/2003 Conclusion  Macro analysis fills the need when monitoring and debugging systems where local context is of insufficient use  Runtime path-based approach dynamically traces request paths and statistically infer macro properties  A shared analysis framework that is reusable across many systems –Simplifies the construction of effective tools for other systems and the integration with recovery techniques like RR  –Paper includes a commercial example from Tellme! (thanks to Anthony Accardi and Mark Verber)

ROC Retreat 1/14/2003 Backup Slides

ROC Retreat 1/14/2003 Backup Slides

ROC Retreat 1/14/2003 Current Approach  Micro analysis tools, like code-level debuggers (e.g. gdb) and application logs, offers details of each individual component Apache Java Bean Java Bean Database Java Bean gdb X = 1 Y = 2 Java Bean X = 1 Y = 2 Java Bean X = 2 Y = 3 Java Bean X = 5 Y = 2 Java Bean X = 3 Y = 2 Java Bean X = 2 Y = 4 Java Bean X = 7 Y = 1

ROC Retreat 1/14/2003 Related Work  Commercial request tracing systems –Announced in 2002, a few months after Pinpoint was developed –PerformaSure and AppAssure focus on performance problems. –IntegriTea captures and playback failure conditions. –Focus on individual requests rather than overall behavior, and on recreating the failure condition.  Extensive work in event/alarm correlation, mostly in the context of network management (i.e. IP) –Don’t directly capture relationship between events –Rely on human knowledge or use machine learning to suppress alarms.  Distributed debuggers –PDT, P2D2, TotalView, PRISM, pdbx –Aggregates views from multiple components, but do not capture relationship and interaction between components –Comparative debuggers: Wizard, GUARD  Dependency models –Most are statically generated and are likely to be inconsistent. –Brown et al. takes an active, black box approach but is invasive. Candea et al. dynamically trace failures propagation.

ROC Retreat 1/14/ Detecting Failures using Anomaly Detection  Key idea: paths change under failures => detect failures via path changes  Anomalies –Unusual paths –Changes in distribution –Changes in latency/response time  Examples: –Error paths are shorter. –User behavior changes under failures Retries a few times then give up  Implement as long running queries (i.e. diff )  Challenges: –detecting application-level failures –comparing sets of paths

ROC Retreat 1/14/ Root-cause Analysis  Key idea: all bad paths touch root cause, find common features  Challenge: a small set of known bad paths and a large set of maybes  Ideally want to correlate and rank all combinations of feature sets –E.g. association rules mining –May get false alarms because the root cause may not be one of the features  Automatic generation of dynamic functional and state dependency graphs –Helps developers and operators understand inter-component dependency and inter-request dependency –Input to recovery algorithms that use dependency graphs

ROC Retreat 1/14/ Verifying Macro Invariants  Key idea: violations of high-level invariants are signs of intrusion or bugs  Example: Peer auditing –Problem: A small number of faulty or malicious nodes can bring down the system –Corruption should be statistically visible in your behavior look for nodes that delay or corrupt messages or route messages incorrectly –Apply root-cause analysis to locate the misbehaving peers Some distributed auditing is necessary  Example: P2P implementation verification –Problem: are messages delivered as specified by the algorithms? –Detect extra hops, loops, and verify that the paths are correct –Can implement as a query: select length from paths where (length > log2(N))

ROC Retreat 1/14/ Detecting Single Point of Failure  Key idea: paths converge on a single-point of failure  Useful for finding out what to replicate to improve availability  P2P example: –Many P2P systems rely on overlay networks, which typically are networks built on top of the IP infrastructure. –It’s common for several overlay links to fail together if they depend on a shared physical IP link that failed  Implement as a query: –intersect edge.IP_links from paths Peer Sensor AB DE FG C D

ROC Retreat 1/14/ Monitoring of Sensor Networks  An emerging area with primitive tools  Key idea: use paths to reconstruct topology and membership  Example: –Membership select unique node from paths –Network topology for directed information dissemination  Challenge: limited bandwidth –Can record a (random) subset of the nodes for each path, then statistically reconstruct the paths

ROC Retreat 1/14/2003 Macro Analysis  Look across many requests to get the overall system behavior –more robust to noise Request 1Request 2Request 3Request 4 Component AXXX Component BX Component CXX Macro Analysis

ROC Retreat 1/14/2003 Properties of Network Systems  Web services, P2P systems, and sensor networks can have tens of thousands of nodes each running many application components  Continuous adaptation provides high availability, but also makes it difficult to reproduce and debug errors  Constant evolution of software and hardware

ROC Retreat 1/14/2003 Motivation  Difficult to understand and debug network systems –e.g. Clustered Internet systems, P2P systems and sensor networks –Composed of many components –Systems are becoming larger, more dynamic, and more distributed  Workload is unpredictable and impractical to simulate –Unit testing is necessary but insufficient. Components break when used together under real workload  Don’t have tools that capture the interactions between components and the overall behavior –Existing debugging tools and application-level logs only do micro analysis

ROC Retreat 1/14/2003 Macro vs Micro Analysis Macro AnalysisMicro Analysis Resolution Component. Complements micro analysis tools. Line or variable Overhead Low. Can use it in actual deployment. High. Typically not used in deployment other than application logs.

ROC Retreat 1/14/2003 What’s a dynamic path?  A dynamic path is the (control flow + runtime properties) of a request –Think of it as a stack trace across process/machine boundaries with runtime properties –Dynamically constructed by tracing requests through a system  Runtime properties –Resources (e.g. host, version) –Performance properties (e.g. latency) –Arguments (e.g. URL, args, SQL statement) –Success/failure request RequestID: 1 Seq Num: 1 Name: A Host: xx Latency: 10ms Success: true ….. D E Path A AB CD EF

ROC Retreat 1/14/2003 Related Work  Micro debugging tools –RootCause provides extensible logging of method calls and arguments. –Diduce look for inconsistencies in variable usage. –Complements macro analysis tools.  Languages for monitoring –InfoSpect looks for inconsistencies in system state using a logic language  Network flow-based monitoring –RTFM and Cisco NetFlow classify and record network flows  Statistical and data mining languages –S, DMQL, WebML

ROC Retreat 1/14/2003 Visualization Techniques  Tainted paths: mark all flows that have a certain property (e.g. failed or slow) with a distinct color and overlay it on the graph  Detecting performance bottlenecks: look for replicated nodes that have different colors  Detecting anomaly: look for missing edges and unknown paths

ROC Retreat 1/14/2003 Pinpoint Framework Communications Layer (Tracing & Internal F/D) A B C Components #1 Requests External F/D #2 #3 Statistical Analysis Detected Faults 1,A 1,C 2,B.. 1, success 2, fail 3,... Logs

ROC Retreat 1/14/2003 Experimental Setup  Demo app: J2EE Pet Store –e-commerce site w/~30 components  Load generator –replay trace of browsing –Approx. TPCW WIPSo load (~50% ordering)  Fault injection parameters –Trigger faults based on combinations of used components –Inject exceptions, infinite loops, null calls  55 tests with single-components faults and interaction faults –5-min runs of a single client (J2EE server limitation)

ROC Retreat 1/14/2003 Application Observations  # of components used in a dynamic web page request: –median 14, min 6, max 23  large number of tightly coupled components that are always used together

ROC Retreat 1/14/2003 Metrics  Precision: C/P  Recall: C/A  Accuracy: whether all actual faults are correctly identified (recall == 100%) –boolean measure Predicted Faults (P) Actual Faults (A) Correctly Identified Faults (C)

ROC Retreat 1/14/ Analysis Techniques  Pinpoint: clusters of components that statistically correlate with failures  Detection: components where Java exceptions were detected –union across all failed requests –similar to what an event monitoring system outputs  Intersection: intersection of components used in failed requests  Union: union of all components used in failed requests

ROC Retreat 1/14/2003 Results  Pinpoint has high accuracy with relatively high precision

ROC Retreat 1/14/2003 Pinpoint Prototype Limitations  Assumptions –client requests provide good coverage over components and combinations –requests are autonomous (don’t corrupt state and cause later requests to fail)  Currently can’t detect the following: –faults that only degrade performance –faults due to pathological inputs  Single-node only

ROC Retreat 1/14/2003 Current Status  Simple graph visualization

ROC Retreat 1/14/2003 Proposed Research  3 classes of large network systems –Clustered Internet systems Tiered architecture, high bandwidth, many replicas –Peer-to-peer (P2P) systems, including sensor networks Widely distributed nodes, dynamic membership –Sensor networks Limited storage, processing, and bandwidth.

ROC Retreat 1/14/2003 P2P Systems: Tracing  Trace messages by piggybacking the current node name to the messages  Tracing overhead –Assume 32-bit per node name and a very conservative log2(N) hops for each msg and –Data overhead is 40% for a 1500-byte message in a node system

ROC Retreat 1/14/2003 P2P Systems: Implementation Verification  Current debugging techniques: lots of printf()’s on each node and manually correlate the paths taken by messages  How do you know the messages are delivered as specified by the algorithms?  Use message paths to check for routing invariants –detect extra hops, loops, and verify that the paths are correct  Can implement as a query: –select length from paths where (length > log2(N))