Download presentation
Presentation is loading. Please wait.
Published byPhilomena Wright Modified over 9 years ago
1
PERFORMANCE DEBUGGING FOR DISTRIBUTED SYSTEMS OF BLACK BOXES Marcos K. AguileraMarcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, Athicha MuthitacharoenJeffrey C. MogulJanet L. Wiener Patrick ReynoldsAthicha Muthitacharoen Presented by 김 운태
2
Outline Problem statement & goals Overview of our approach Algorithms The nesting algorithm (RPC) The convolution algorithm (RPC or free-form) Experimental results Conclusions
3
Motivation Complex distributed systems Built from black box components Heavy communications traffic Bottlenecks at some specific nodes These systems may have performance problems High or erratic latency Caused by complex system interactions Isolating performance bottlenecks is hard We cannot always examine or modify system components We need tools to infer where bottlenecks are Choose which black boxes to open
4
Goals Isolating performance bottlenecks Find high-impact causal path patterns Causal path: series of nodes that sent/received messages. Each message is caused by receipt of previous message, an d Some causal paths occur many times High-impact: occurs frequently, and contributes significant ly to overall latency Identify high-latency nodes on high-impact pattern s Add significant latency to these patterns Then What should We do? ------- Messages Trace is enough
5
Example multi-tier system
6
The Black Box Desired properties Zero-knowledge, zero-instrumentation, zero-perturbation Scalability Accuracy Efficiency (time and space) Complex distributed system built from “black boxes” Performance bottlenecks
7
Outline Problem statement & goals Overview of our approach Algorithms The nesting algorithm (RPC) The convolution algorithm (RPC or free-form) Experimental results Conclusions
8
Overview of Approach Obtain traces of messages between components Ethernet packets, middleware messages, etc. Collect traces as non-invasively as possible Require very little information: [timestamp, source, destination, call/return, call-id] Analyze traces using our algorithms Nesting: faster, more accurate, limited to RPC-style systems Convolution: works for all message-based systems Visualize results and highlight high-impact paths
9
Challenges Trace contain interleaved messages from many causal paths How to identify causal paths? Causality trace by Timestamp Want only statistically significant causal paths How to differentiate significance? It is easy! They appear repeatedly
10
Outline Problem statement & goals Overview of our approach Algorithms The nesting algorithm (RPC) The convolution algorithm (RPC or free-form) Experimental results Conclusions
11
The nesting algorithm Depends on RPC-style communication Infers causality from “nesting” relationships by messag e timestamps Suppose A calls B and B calls C before returning to A Then the B C call is “nested” in the A B call Uses statistical correlation time node A node B node C call return
12
Nesting: an example causal path AB C D Consider this system of 4 nodes time node A node B node C call return call return node D return Looking for internal delays at each node
13
Steps of the nesting algorithm 1. Pair call and return messages (A B, B A), (B D, D B), (B C, C B) 2. Find and score all nesting relationships B C nested in A B B D also nested in A B 3. Pick best parents Here: unambiguous 4. Reconstruct call paths A B [C ; D] Execution time: O(mp) m = number of messages p = mean of per-node parallelism time node A node B node C call return call return node D return
14
Inferring nesting An example of Parallel calls − Local info not enough − Use aggregate info − Histograms keep track of possible latencies − Medium-length delay will be selected − Assign nesting − Heuristic methods time node A node B node C t1 t2 t3 t4
15
Outline Problem statement & goals Overview of our approach Algorithms The nesting algorithm The convolution algorithm Experimental results Conclusions
16
The convolution algorithm “Time signal” of messages for each edge Node A sent message to node B at times 1,2,5,6,7
17
The convolution algorithm Look for time-shifted similarities Compute cross correlation C(u) = Find peaks in C(u) Time shift of peak indicate delay Use Fast Fourier Transforms u: delay
18
Convolution: an example causal path A BB C Peaks suggest causality between A B and B C delay
19
Convolution Alg. details Time complexity: O(em+eSlogS) m = # message e = # edge in output graph S = # time steps in trace Need to choose time step size Must be shorter than delays of interest Too coarse: poor accuracy Too fine: long running time Robust to noise in trace
20
Algorithm comparison Nesting Looks at individual paths and then aggregates Finds rare paths Requires call/return style communication Fast enough for real-time analysis Convolution Applicable to a broader class of systems Slower: more work with less information May need to try different time steps to get good resu lts Reasonable for off-line analysis
21
Communication style Rare events Level of Trace detail Time and space complexity Visualization Nesting AlgorithmConvolution Algorithm RPC onlyRPC or free-form messages Yes, but hardNo + call/return tag Linear space Linear time Linear space Polynomial time RPC call and return combined More compact Less compact Summarize
22
Outline Problem statement & goals Overview of our approach Algorithms Experimental results Maketrace: a trace generator Multi-tier configuration Validation of accuracy Execution costs Conclusions
23
Maketrace Synthetic trace generator Needed for testing Validate output for known input Check corner cases Uses set of causal path templates All call and return messages, with latencies Delays are x ± y seconds, Gaussian normal distributi on Recipe to combine paths Parallelism, start/stop times for each path Duration of trace
24
Multi-tier configuration
25
Result of multi-tier trace and delay By using nesting algorithm Infer causal path and dela y of each edges and nodes
26
By using convolution algorithm Result of multi-tier trace and delay
27
Validation of accuracy Metrics for evaluating accuracy False negative: the algorithm failed to find a path False positive: the algorithm inferred a path that wasn’t there Example Actual path: ‘A -> B -> C -> D’ x 2 Inferred path: ‘A-> B -> C -> D’, ‘A -> B’, ‘C -> D’ Counting path instances One false negative Two false positives
28
Validation of accuracy False negative rate for top N pattern Is bounded in most cases by 1/N Caused by large N Caused by near-ties in the ranking Caused by near-ties in the ranking
29
Execution costs (1) Nesting algorithm = O(mp) (2) Convolution algorithm = O(em + eSlogS) ≈ O(eSlogS)
30
Conclusions Looking for bottlenecks in black box systems Finding causal paths is enough to find bottlene cks Algorithms to find paths in traces really work We find correct latency distributions Two very different algorithms get similar result s Passively collected traces have sufficient infor mation
31
THANK YOU! ANY QUESTION?
32
Desired results for one trace Causal paths How often How much time spent Nodes Host/component name Time spent in node and all of the nodes it calls Edges Time parent waits before calling child
33
Related work Systems that trace end-to-end causality via m odified middleware using modified JVM or J2 EE layers Magpie (Microsoft Research), aimed at perform ance debugging Pinpoint (Stanford/Berkeley), aimed at locating faults Products such as AppAssure, PerformaSure, Opt iBench Systems that make inferences from traces Intrusion detection (Zhang & Paxson, LBL) uses traces + statistics to find compromised systems
34
Future work Automate trace gathering and convers ion Sliding-window versions of algorithms Find phased behavior Reduce memory usage of nesting algorit hm Improve speed of convolution algorithm Validate usefulness on more complicat ed systems
35
Visualization GUI Goal: highlight do minant paths Paths sorted By frequency By total time Red highlights High-cost nodes Timeline Nested calls Dominant subcal ls Time plots Node time Call delay
36
Pseudo-code for the nesting algorithm Detects calls pairs and find all possible nestings of on e call pair in another Pick the most likely candidate for the causing call for each call pair Derive call paths from the causal relationships procedure FindCallPairs for each trace entry (t1, CALL/RET, sender A, receiver B, callid id) case CALL: store (t1,CALL,A,B,id) in Topencalls case RETURN: find matching entry (t2, CALL, B, A, id) in Topencalls if match is found then remove entry from Topencalls update entry with return message timestamp t2 add entry to Tcallpairs entry.parents := {all callpairs (t3, CALL, X, A, id2) in Topencalls with t3 < t2} procedure ScoreNestings for each child (B, C, t2, t3) in Tcallpairs for each parent (A, B, t1, t4) in child.parents scoreboard[A, B, C, t2-t1] += (1/|child.parents|) procedure FindNestedPairs for each child (B; C; t2; t3) in call pairs maxscore := 0 for each p (A, B, t1, t4) in child.parents score[p] := scoreboard[A, B, C, t2-t1]*penalty if (score[p] > maxscore) then maxscore := score[p] parent := p parent.children := parent.children U {child} procedure FindCallPaths initialize hash table Tpaths for each callpair (A, B, t1, t2) if callpair.parents = null then root := { CreatePathNode(callpair, t1) } if root is in Tpaths then update its latencies else add root to Tpaths function CreatePathNode(callpair (A, B, t1, t4), tp) node := new node with name B node.latency := t4 - t1 node.call_delay := t1 - tp for each child in callpair.children node.edges := node.edges U { CreatePathNode(child, t1)} return node
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.