Presenter: Chi-Hung Lu 1
Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments Protocols involve complex interactions among a collection of networked machines Need to handle failures ranging from network problems to crashing nodes Intricate sequences of events can trigger complex errors as a result of mishandled corner cases 2
Approaches Logging-based Debugging X-Trace Bi-directional Distributed BackTracker (BDB) Pip Deterministic Replay WiDS Friday Jockey Model Checking MaceMC 3
R. Fonseca et al, NSDI 07 4
Problem Description It is difficult to diagnose the source of the problem for an internet application Current network diagnostic tools only focus on one particular protocol Does not share information on the application between the user, service, and the network operators 5
Examples traceroute Could locate IP connectivity problem Could not reveal proxy or DNS failures HTTP monitoring suite Could locate application problem Could not diagnose routing problems 6
Examples 7 User DNS Server Proxy Web Server
Examples 8 User DNS Server Proxy Web Server
Examples 9 User DNS Server Proxy Web Server
Examples 10 User DNS Server Proxy Web Server
X-Trace An integrated tracing framework Record the network path that were taken Invoke X-Trace when initiating an application task Insert X-Trace metadata with a task identifier in the request Propagate the metadata down to lower layers through protocol interfaces 11
Task Tree X-Trace tags all network operations resulting from a particular task with the same task identifier Task tree is the set of network operations connected with an initial task Task tree could be reconstruct after collecting trace data with reports 12
An example of the task tree A simple HTTP request through a proxy 13
X-Trace Components Data X-Trace metadata Network path Task tree Report Reconstruct task tree 14
Propagation of X-Trace Metadata The propagation of X-Trace metadata through the task tree 15
Propagation of X-Trace Metadata The propagation of X-Trace metadata through the task tree 16
The X Trace metadata FieldUsage FlagsBits that specify which of the three optional components are present TaskIDAn unique integer ID TreeInfoParentID, OpID, EdgeType DestinationSpecify the address that X-Trace report should be sent to OptionsAccommodate future extensions mechanism 17
Operation of X-Trace Metadata 18
Operation of X-Trace Metadata 19
X-Trace Report Architecture 20
X-Trace Report Architecture 21
X-Trace Report Architecture 22
Usage Scenario (1) Web request and recursive DNS queries 23
Usage Scenario (2) A request fault annotated with user input 24
Usage Scenario (3) A client and a server communicate over I3 overlay network 25
Usage Scenario (3) Internet Indirect Infrastructure (I3) 26
Usage Scenario (3) Internet Indirect Infrastructure (I3) 27
Usage Scenario (3) Internet Indirect Infrastructure (I3) 28
Usage Scenario (3) Tree for normal operation 29
Usage Scenario (3) The receiver host fails 30
Usage Scenario (3) Middlebox process crash 31
Usage Scenario (3) The middlebox host fails 32
Discussion Report loss Non-tree request structures Partial deployment Managing report traffic Security Considerations 33
X. Liu et al, NSDI 07 34
Problem Description Log mining is both labor-intensive and fragile Latent bugs often are distributed across multiple nodes Logs reflect incomplete information of an execution Non-determinism of distributed application 35
Goals Efficiently verify application properties Provide fairly complete information about an execution Reproduce the buggy runs deterministically and faithfully 36
Approach Log the actual execution of a distributed system Apply predicate checking in a centralized simulator over a run driven by testing scripts or replayed by logs Output violation report along with message traces An execution is interpreted as a sequence of events, which are dispatched to corresponding handling routines 37
Components A versatile script language Allow a developer to refine system properties into straightforward assertions A checker Inspect for violations 38
Architecture Components of WiDS Checker 39
Architecture Reproduce real runs Log all non-deterministic events using Lamport’s logical clock Check user-defined predicates A versatile scription language to specify system states being observed and the predicates for invariants and correctness Screen out false alarms with auxiliary information For liveness properties Trace root causes using a visualization tool 40
Programming with WiDS WiDS APIs are mostly member function of the WiDSObject class WiDS runtime maintains an event queue to buffer pending events and dispatches them to corresponding handling routines 41
Enabling Replay Logging Log all WiDS nondeterminism Redirect OS calls and log the results Embed a Lamport Clock in each out-going message Checkpoint Support partial replay Save the WiDS process context Replay Start from the beginning or a checkpoint Replay events in serialized Lamport order 42
Checker Observe memory state Define states and evaluate predicates Refresh database for each event Maintain history Re-evaluate modified predicates Auxiliary information for violations Liveness properties only guarantee to be true eventually 43
44
45
46
Visualization Tools Message flow graph 47
Evaluation Benchmark and result summary 48
Performance Running time for evaluating predicates 49
Logging Overhead Percentage of logging time 50
Discussion System is debugged by those who developed it Bugs are hunted by those who are intimately familiar with the system 51