Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds Collaborators: Janet Wiener.

Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener Jeffrey Mogul Mehul Shah HP Labs http://issg.cs.duke.edu/pip/

NSDI - May 8th, 2006 Introduction Distributed systems are essential to the Internet Distributed systems are complex – Harder to debug than centralized systems Pip uses causal paths to find bugs – Characterize system behavior – Deviations from expected behavior often indicate bugs Experimental results show that Pip is effective – Found 19 bugs in 6 test systems

NSDI - May 8th, 2006 Challenges of debugging distributed systems Distributed systems have more bugs – Parallelism is hard – New sources of failure More components = more faults Network errors Security breaches Finding bugs is harder – More nodes, events, and messages to keep track of – Applications may cross administrative domains – Bugs on one node may be caused by events on another

NSDI - May 8th, 2006 Causal path analysis Programmers wish to examine and check system- wide behaviors – Causal paths – Components of end-to-end delay – Attribution of resource consumption Unexpected behavior might indicate a bug – Pip compares actual behavior to expected behavior Web server App server Database 500ms 2000 page faults

NSDI - May 8th, 2006 What kinds of bugs? Structure bugs – Incorrect placement/timing of processing and communication Performance bugs – Throughput bottlenecks – Over- or under-consumption of resources Bugs present in an actual run of the system – Selecting input for path coverage is orthogonal Web server App server Database 500ms 2000 page faults

NSDI - May 8th, 2006 Three target audiences Primary programmer – Debugging or optimizing his/her own system Secondary programmer – Inheriting a project or joining a programming team – Discovery: learning how the system behaves Operator – Monitoring a running system for unexpected behavior – Performing regression tests after a change

NSDI - May 8th, 2006 Outline Introduction Pip – Expressing expected behavior – Exploring application behavior – Results Conclusions

NSDI - May 8th, 2006 Workflow Pip: 1. Captures events from a running system 2. Reconstructs behavior from events 3. Checks behavior against expectations 4. Displays unexpected behavior Both structure and resource violations Goal: help programmers locate and explain bugs Behavior model Application Expectations Pip checker Unexpected structure Resource violations Pip explorer: visualization GUI

NSDI - May 8th, 2006 Describing application behavior Application behavior consists of paths – All events, on any node, related to one high-level operation – Definition of a path is programmer defined – Path is often causal, related to a user request WWW App srv DB Parse HTTP Query Send response Run application time

NSDI - May 8th, 2006 Describing application behavior Within paths are tasks, messages, and notices – Tasks: processing with start and end points – Messages: send and receive events for any communication Includes network, synchronization (lock/unlock), and timers – Notices: time-stamped strings; essentially log entries WWW App srv DB Parse HTTP Query Send response Run application time “Request = /cgi/…”“2096 bytes in response” “done with request 12”

NSDI - May 8th, 2006 Outline Introduction Pip – Expressing expected behavior – Exploring application behavior – Results Conclusions

NSDI - May 8th, 2006 Expectations Pip has two kinds of expectations Recognizers classify paths into sets Aggregates assert properties of sets R1 R2 R3 ? 

NSDI - May 8th, 2006 Expectations: recognizers Application behavior consists of paths Each recognizer matches paths – A path can match more than one recognizer A recognizer can be a validator, an invalidator, or neither Any path matching zero validators or at least one invalidator is unexpected behavior: bug? validator CGIRequest thread WebServer(*, 1) task(“Parse HTTP”) limit(CPU_TIME, 100ms); notice(m/Request URL:.*/); send(AppServer); recv(AppServer); invalidator DatabaseError notice(m/Database error:.*/);

NSDI - May 8th, 2006 Expectations: recognizers language repeat: matches a ≤ n ≤ b copies of a block xor: matches any one of several blocks future: lets a block match now or later – done: forces the named block to match repeat between 1 and 3 { … } xor { branch: … } future F1 { … } … done(F1);

NSDI - May 8th, 2006 Expectations: aggregate expectations Recognizers categorize paths into sets Aggregates make assertions about sets of paths – Instances, unique instances, resource constraints – Simple math and set operators assert(instances(CGIRequest) > 4); assert(max(CPU_TIME, CGIRequest) < 500ms); assert(max(REAL_TIME, CGIRequest) <= 3*avg(REAL_TIME, CGIRequest));

NSDI - May 8th, 2006 Automating expectations and annotations Expectations can be generated from behavior model – Create a recognizer for each actual path – Eliminate repetition – Strike a balance between over- and under-specification Annotations can be generated by middleware – As in several of our test systems Behavior model Application Expectations Pip checker Annotations Unexpected behavior

NSDI - May 8th, 2006 Exploring behavior Expectations checker generates lists of valid and invalid paths Explore both sets – Why did invalid paths occur? – Is any unexpected behavior misclassified as valid? Insufficiently constrained expectations Pip may be unable to express all expectations

NSDI - May 8th, 2006 Timing and resource properties for one task Causal view of path Visualization: causal paths Caused tasks, messages, and notices on that thread

NSDI - May 8th, 2006 Visualization: timelines View one path instance as a timeline – Different colors = different hosts – Each bar = task – Click a bar to see task details

NSDI - May 8th, 2006 Visualization: performance graphs Plot per-task or per-path resource metrics – Cumulative distribution (CDF) – Probability density (PDF) – Changing values over time Click on a point to see its value and the task/path represented Time (s) Delay (ms)

NSDI - May 8th, 2006 Results We have applied Pip to several distributed systems: – FAB: distributed block store – SplitStream: DHT-based multicast protocol – Others: RanSub, Bullet, SWORD, Oracle of Bacon We have found unexpected behavior in each system We have fixed bugs in some systems … and used Pip to verify that the behavior was fixed

NSDI - May 8th, 2006 Results: SplitStream DHT-based multicast protocol 13 bugs found, 12 fixed – 11 found using expectations, 2 found using GUI Structural bug: some nodes have up to 25 children when they should have at most 18 – This bug was fixed and later reoccurred – Root cause #1: variable shadowing – Root cause #2: failed to register a callback How discovered: first in the explorer GUI, confirmed with automated checking

NSDI - May 8th, 2006 Results: FAB Distributed block store 1 bug found, fixed – Four protocols checked: read, write, Paxos, membership Performance bug: quorum operations call nodes in arbitrary order – Should overlap computation when possible

NSDI - May 8th, 2006 Results: FAB Distributed block store For disk-bound workloads, call self last For cache-bound workloads, call self second – Actually, last in quorum

NSDI - May 8th, 2006 Conclusions Causal paths are a useful abstraction of distributed system behavior Expectations serve as a high-level description – Summary of inter-component behavior and timing – Regression test for structure and performance Finding unexpected behavior can help us find bugs – Both structure and performance bugs – Visualization can expose additional bugs http://issg.cs.duke.edu/pip/

NSDI - May 8th, 200626 Extra slides Related work Pip results: RanSub Pip vs. printf Pip resource metrics Building a behavior model Annotations Checking expectations Visualization: communication graph

NSDI - May 8th, 2006 Related work Causal paths – Pinpoint [Chen, 2004] – Magpie [Barham, 2004] Black-box inferences – Finding stepping stones [Zhang, 1997] – Wavelets to debug network performance [Huang, 2001] Expectations-checking systems – PSpec [Perl, 1993] – Meta-level compilation [Engler, 2000] – Paradyn [Miller, 1995] Model checking – MaceMC [Killian, 2006] – VeriSoft [Godefroid, 2005]

NSDI - May 8th, 2006 Pip results: RanSub 2 bugs found, 1 fixed Structural bug: during first round of communication, parent nodes send summary messages before hearing from all children – Root cause: uninitialized state variables Performance bug: linear increase in end-to-end delay for the first ~2 minutes – Suspected root cause: data structure listing all discovered nodes

NSDI - May 8th, 2006 Pip vs. printf Both record interesting events to check off-line – Pip imposes structure and automates checking – Generalizes ad hoc approaches Pipprintf Nesting, causal orderUnstructured Time, path, and threadNo context CPU and I/O dataNo resource information Automatic verification using declarative language Verification with ad hoc grep or expect scripts SQL queries“Queries” using Perl scripts Automatic generation for some middleware Manual placement

NSDI - May 8th, 2006 Pip resource metrics Real time User time, system time – CPU time = user + system – Busy time = CPU time / real time Major and minor page faults (paging and allocation) Voluntary and involuntary context switches Message size and latency Number of messages sent Causal depth of path Number of threads, hosts in path

NSDI - May 8th, 2006 Building a behavior model Sources of events: Annotations in source code – Programmer inserts statements manually Annotations in middleware – Middleware inserts annotations automatically – Faster and less error-prone Passive tracing or interposition – Easier, but less information Or any combination of the above Model consists of paths constructed from events recorded by the running application

NSDI - May 8th, 2006 Annotations Set path ID Start/end task Send/receive message Notice WWW App srv DB Parse HTTP Query Send response Run application time “Request = /cgi/…”“2096 bytes in response” “done with request 12”

NSDI - May 8th, 2006 Checking expectations Traces Categorized paths Reconciliation Events database Paths Path construction Expectation checking Application For each path P For each recognizer R Does R match P? Check each aggregate Expectations Match start/end task, send/receive message Organize events into causal paths

NSDI - May 8th, 2006 Visualization: communication graph Graph view of all host-to-host network traffic – Nodes = hosts, edges = network links in use – Directed edge = unidirectional communication

Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds Collaborators: Janet Wiener.

Similar presentations

Presentation on theme: "Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds Collaborators: Janet Wiener."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds Collaborators: Janet Wiener.

Similar presentations

Presentation on theme: "Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds Collaborators: Janet Wiener."— Presentation transcript:

Similar presentations

About project

Feedback