Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha Muthitacharoen, MIT WISP November 2004
page 220 October 2003 Project5 - SOSP Example multi-tier system client web server client web server database server database server application server application server authentication server
page 320 October 2003 Project5 - SOSP Motivation Complex distributed systems are built from black box components These systems may have performance problems High or erratic latency Locating the causes of these problems is hard We can’t always examine or modify system components We need tools to infer where bottlenecks are Choose which black boxes to open
page 420 October 2003 Project5 - SOSP Contributions of our work Tools to highlight which black boxes have problems Require only passive information, such as packet traces Infer where most of time is spent from traces Person can then use more invasive tools to examine those boxes Reduce time and cost to debug complex systems Improve quality of delivered systems
page 520 October 2003 Project5 - SOSP Example causal path client web server client web server database server database server application server application server authentication server 100ms
page 620 October 2003 Project5 - SOSP Goals of our tools Find high-impact causal paths through a distributed system Causal path: series of nodes that sent/received messages – Each message is caused by receipt of previous message – Some causal paths occur many times High impact: – Occurs frequently – Contributes significantly to overall latency Without modifications or semantic knowledge Report per-node latencies on causal paths
page 720 October 2003 Project5 - SOSP Overview of our approach Obtain traces of messages between components – Ethernet packets, middleware messages, etc. – Collect traces as non-invasively as possible Analyze traces using algorithms Visualize results and highlight high-impact paths Requires very little information: [timestamp, source, destination]
page 820 October 2003 Project5 - SOSP Outline Problem statement & goals Overview of our approach Algorithm Experimental results Related work Conclusions
page 920 October 2003 Project5 - SOSP The convolution algorithm: input Time From To 0.01 A B 0.02 A B 0.04 B D 0.05 C F...
page 1020 October 2003 Project5 - SOSP The convolution algorithm: output A C D B E F E F F E G G G G G G
page 1120 October 2003 Project5 - SOSP Basic idea Creates a “time signal” for messages from each node Given time signals S 1 (t)=(A B) and S 2 (t)=(B X) Computes convolution of S 2 (t) and S 1 (–t) = S 1 * S 2 (can be computed quickly using fast fourier transforms) S 1 (t)=(A B msgs) time
page 1220 October 2003 Project5 - SOSP S 1 (t)=(A B msgs) S 2 (t)=(B X msgs) S 1 * S 2 = conv(S 2 (t), S 1 (-t)) Spikes suggest causality between A B and B X msgs Spikes suggest causality between A B and B X msgs Time shift of a spike indicates its characteristic delay Time shift of a spike indicates its characteristic delay
page 1320 October 2003 Project5 - SOSP Details: first step Choose starting node A Use trace to add edges from it Time From To 0.01 A B 0.02 A B 0.04 A C 0.05 A B A B C
page 1420 October 2003 Project5 - SOSP Continuing Time From To … B D … B E … B F … B G A B C ?? (A B)*(B E) (A B)*(B D) d
page 1520 October 2003 Project5 - SOSP How Time From To t1 A B t2 A B t3 A B t4 A B Time From To … t1+d B D … t2+d B D … t3+d B D t3+d B E … t4+d B D
page 1620 October 2003 Project5 - SOSP Heuristic to find spikes threshold 1: n1 stddev over mean threshold 2: n2 stddev over mean n1 = 2 n2 = 1.5
page 1720 October 2003 Project5 - SOSP Recursing to continue Observations: 1. (B D) are not all msgs from B to D (only those caused by A) 2. Stop recursion when too few messages left or no more spikes found A B D d ??
page 1820 October 2003 Project5 - SOSP Outline Problem statement & goals Overview of our approach Algorithm Experimental results Conclusions
page 1920 October 2003 Project5 - SOSP Results: service delays Jeff logged all headers for two months Parsed 80K Received headers in 12K messages Received: from cceexg11.americas.cpqcorp.net... by wera.hpl.hp.com... ; Fri, 4 Apr :35: – Yields (timestamp, sender, receiver) trace records Used Convolution Algorithm to – Reconstruct message paths – Find typical delays Note: this is NOT the most direct way to use headers – We made the problem harder so as to test our algorithm
page 2020 October 2003 Project5 - SOSP trace: output , , , ,
page 2120 October 2003 Project5 - SOSP Results: Petstore Sun’s demo application for J2EE Stanford’s PinPoint project provided us with traces – One trace has a node that is artificially slowed down
page 2220 October 2003 Project5 - SOSP Future work Automate trace gathering and conversion Sliding-window versions of algorithms – Find phased behavior – Reduce memory usage of nesting algorithm – Improve speed of convolution algorithm Validate usefulness on more complicated systems What are limits of our approach?
page 2320 October 2003 Project5 - SOSP Conclusion Looking for bottlenecks in black box systems Use signal processing techniques to find causal paths in the network and its delays For more information – Contact us if you have multi-hop message traces!