Pip: Detecting the Unexpected in Distributed Systems

Slides:

Advertisements

Similar presentations

MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.

Advertisements

Module 20 Troubleshooting Common SQL Server 2008 R2 Administrative Issues.

Lesson 13-Intrusion Detection. Overview Define the types of Intrusion Detection Systems (IDS). Set up an IDS. Manage an IDS. Understand intrusion prevention.

Automated Tests in NICOS Nightly Control System Alexander Undrus Brookhaven National Laboratory, Upton, NY Software testing is a difficult, time-consuming.

Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.

Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.

Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.

Virtual Memory Tuning   You can improve a server’s performance by optimizing the way the paging file is used   You may want to size the paging file.

Module 15: Monitoring. Overview Formulate requirements and identify resources to monitor in a database environment Types of monitoring that can be carried.

CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.

MCTS Guide to Microsoft Windows 7

Developing Workflows with SharePoint Designer David Coe Application Development Consultant Microsoft Corporation.

Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds Collaborators: Janet Wiener.

Designing For Testability. Incorporate design features that facilitate testing Include features to: –Support test automation at all levels (unit, integration,

PLATFORM INDEPENDENT SOFTWARE DEVELOPMENT MONITORING Mária Bieliková, Karol Rástočný, Eduard Kuric, et. al.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,

Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha.

Internal and Confidential Cognos CoE COGNOS 8 – Event Studio.

Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin.

Software Quality Assurance and Testing Fazal Rehman Shamil.

SQL Database Management

Java Web Services Orca Knowledge Center – Web Service key concepts.

Architecture Review 10/11/2004

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Core LIMS Training: Project Management

Agenda:- DevOps Tools Chef Jenkins Puppet Apache Ant Apache Maven Logstash Docker New Relic Gradle Git.

Project Management: Messages

How to Contribute to System Testing and Extract Results

Essentials of UrbanCode Deploy v6.1 QQ147

Troubleshooting Tools

UML Diagrams By Daniel Damaris Novarianto S..

Use Cases Discuss the what and how of use cases: Basics Benefits

Improving searches through community clustering of information

Object-Oriented Analysis and Design

Software testing

Designing For Testability

Cross Platform Development using Software Matrix

Unified Modeling Language

MCTS Guide to Microsoft Windows 7

Introduction to Triggers

Chapter 19: Architecture, Implementation, and Testing

LCGAA nightlies infrastructure

The Client/Server Database Environment

Introduction to Operating System (OS)

CHAPTER 3 Architectures for Distributed Systems

UML Diagrams Jung Woo.

Applied Software Implementation & Testing

Chapter 10: Process Implementation with Executable Models

Design and Programming

Model-View-Controller Patterns and Frameworks

Knowledge Based Workflow Building Architecture

IMPORTANT NOTICE TO STUDENTS:

Testing and Test-Driven Development CSC 4700 Software Engineering

Pong: Diagnosing Spatio-Temporal Internet Congestion Properties

Lecture 1: Multi-tier Architecture Overview

BPMN - Business Process Modeling Notations

Conceptual Architecture of PostgreSQL

Conceptual Architecture of PostgreSQL

Chapter 7 –Implementation Issues

Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.

Planning and Storyboarding a Web Site

Why Threads Are A Bad Idea (for most purposes)

CodePainter Revolution Trainer Course

Overview of Workflows: Why Use Them?

Why Threads Are A Bad Idea (for most purposes)

Why Threads Are A Bad Idea (for most purposes)

Lab 8: GUI testing Software Testing LTAT

Lecture 34: Testing II April 24, 2017 Selenium testing script 7/7/2019

Yining ZHAO Computer Network Information Center,

Presentation transcript:

Pip: Detecting the Unexpected in Distributed Systems Patrick Reynolds Duke University Charles Killian, Amin Vahdat UC San Diego Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah HP Labs, Palo Alto Today I want to talk about the Pip toolkit for discovering structural and performance bugs in distributed systems. This work was published NSDI just about a year ago. The main contribution of Pip is that it combines two existing debugging approaches into a single system, path analysis and automated expectation checking, and shows how this works well on distributed systems and applications. UWCS OS Seminar Discussion Andy Pavlo 07 May 2007

Problem Statement Distributed systems exhibit complex behaviors that can be difficult to debug: Often more difficult than centralized systems. Parallel, inter-node activity are difficult to capture with serial, single-node tools: Need something more robust than traditional profilers and debuggers. The main motivation behind this work is based on the premise that it difficult to debug distributed systems. This is because of the inherent parallelism of the systems and that these systems are susceptible to extra sources of failure. More components, more network communication, more messages. Applications may cross administrative boundaries. AFS? Thus, the authors argue that existing debugging tools are inadequate to truly work in this environment. We need something more robust than just debug messages.

Problem Statement Once behavior is captured, how do you analyze it? Structural bugs: Application processing & communication Performance problems: Throughput bottlenecks Consumption of resources Unexpected interdependencies But the problem doesn't stop there, once you actually have trace data, how do you decide whether your application is behaving as it should. This is especially hard in distributed systems, where the placement and order of events can differ from one operation to the next. So the goal is to be able to reason about your system in two different ways. First, is the program performing correctly? That is, is the system sending, processing, and transmitting messages in the proper order. Second, is the system using too much or too little resources during operations. Too long = bottleneck Too little = truncated processing How can we express this in a debugging tool?

Pip Overview Suite of programs to gather, check, and display the behavior of distributed systems. Uses explicit path identifiers and programmer- written expectations to check program behavior. Pip compares actual behavior to expected behavior. The system they developed is called Pip. Automatically check the actual behavior of a system against how the programmers' expectn Infer activity based on traces of network, application, and OS events. Useful for systems driven by user requests. Captures the delays and resource consumption associate with each request. Path-based debuggers can help programmers find aberrant paths and assist in optimizing throughput and end-to-end latency. Authors argue that Pip is useful for 3 types: Orig Developers: verify/debug own systems 2nd Developers: learn about exsting systems Sys Maintainers: monitor system for changes

System Overview Annotation Library Declarative Expectations Language Trace Checker Behavior Explorer GUI The Pip system has for main components. In the workflow shown here, we start with an application that is augmented with annotations When the program is executed, it produces tracefiles. These tracefiles are aggregated together into a single database by a reconciliation process. These paths are then checked against a expectations provided by the programmer about how the system is suppose to run The results from these process and other information about the program's execution can then be visualized by graphical interfaces. Application Trace Files Reconciliation Annotations Expectations Path Database Checker+Explorer Icon Source: Paul Davey (2007)

Application Annotation Pip constructs an application's behavior model from generated events: Manual source code annotations Automatic middleware insertions Execution paths are based on: Tasks Messages Notices Annotations are used to generate events and resource measurements of a running app Annotations may be added manually the programmer into the source code Or they can link their programs with middleware that has been augmented with Pip's annotation library. Automatic generation This behavior model is comprised of multiple execution paths in the system. An execution path in Pip is made up of 3 comps Tasks: an interval of processing with a beginning and and end Messages: any communication between hosts or threads on the same host Notices: an identifier that a task has occured. This is equivalent to a log message with a timestamp and a path identifier for context.

Application Annotation Set Path ID Start/End Task Send/Receive message Generate Notice I want to give a quick example of what I mean by an execution path. Simple HTTP request that triggers some processing, a database query, and a response First the application will set a unique identifier for this execution path for all the components of the distributed system. Notice that this path will be used across the multiple hosts involved in responding to the request. We then define start and end markers for the various tasks in the the path We also tag the messages that are sent and received between the hosts. Lastly, once the operation is complete, the application publishes notices to Pip about what happened during the execution. Received Request Processed Request Sent Response WWW Parse HTTP { } { Send Response } Execute { } App Server Query { } Database time

Expectations Declarative language to describe application structure, timing, and resource consumption. Expresses parallelism. Accommodates variation in the order and number of events for multiple paths. Expectations are external descriptions about the behavior of an application. This includes the execution steps, timing of certain events, and how much resources are us This is an example of an expectation written for the Web server in the previous example. We set a time limit on how quickly the application can process the request. We define the communications between the web server and the app server. We also have a check to see if the execution caused a database error and will invalidate the execution path. Now I think that one would write the expectations first, but I think it is easier to illustrate what the concepts are in reverse One thing that the Pip authors tout is the ability to express parallelism in expectations So you could define certain actions to occur in any order. Quorum example validator CGIRequest task(“Parse HTTP”) limit(CPU_TIME, 100ms); notice(m/Received Request: .*/); send(AppServer); recv(AppServer); invalidator DatabaseError notice(m/Database error: .*/);

Expectations Example: Quorum validator Request recv(Client) limit (SIZE, {=44b}); task(“Read”) { repeat 3 { send(Peer); } repeat 2 { recv(Peer); task(“ReadReply”); } future { send(Client); Expectations are external descriptions about the behavior of an application. This includes the execution steps, timing of certain events, and how much resources are us This is an example of an expectation written for the Web server in the previous example. We set a time limit on how quickly the application can process the request. We define the communications between the web server and the app server. We also have a check to see if the execution caused a database error and will invalidate the execution path. Now I think that one would write the expectations first, but I think it is easier to illustrate what the concepts are in reverse One thing that the Pip authors tout is the ability to express parallelism in expectations So you could define certain actions to occur in any order. Quorum example

Expectations Recognizers: Aggregates: Description of structural and performance behavior. Matching Matching with performance violations Non-matching Aggregates: Assertions about properties of sets of paths. Expectations can take two forms Recognizes are used to validate or invalidate individual path instances The example from the previous slide is a recognizer. It has parts to both validate and invalidate the execution path. Aggregates are rules about the properties of multiple paths. For example, you can define rules about the number of instances matched by a given recognizer.

Trace Checker Pip generates a search tree from expectations. The trace checker matches results from the path database with expectations.

Behavior Explorer Interactive GUI displays: Casual Path Structure Communication Structure Valid/Invalid Paths Resource Usage Graphs

Behavior Explorer Source: Pip web page (2007) http://issg.cs.duke.edu/pip/

Behavior Explorer Casual Path Viewer Executed tasks, messages, and notices Timing & Resource Properties Source: Pip web page (2007) http://issg.cs.duke.edu/pip/

Pip vs. Paradyn The Paradyn Configuration Language (PCL) allows programmers to describe expected characteristics of applications. “...PCL cannot express the casual path structure of threads, tasks, and messages in a program, nor does Paradyn reveal the program's structure”.

Using Pip in Condor No high-level debugging tool is currently used by Condor developers. Inner-working knowledge about daemon interactions is either scattered in source code documentations or with a few developers. One of the running themes in Condor is that whenever we get a support email from somebody, the standard response is always “send us your log files”.

Discussion Questions?