Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha.

Slides:



Advertisements
Similar presentations
WAP5 Black-box Performance Debugging for Wide-Area Distributed Systems Patrick Reynolds With:
Advertisements

Topics to be discussed Introduction Performance Factors Methodology Test Process Tools Conclusion Abu Bakr Siddiq.
The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.
Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.
1 Profit from usage data analytics: Recent trends in gathering and analyzing IVR usage data Vasudeva Akula, Convergys Corporation 08/08/2006.
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications Piyush Shivam, Shivnath Babu, Jeffrey Chase Duke University.
Manufacturing Productivity Solutions Management Metrics for Lean Manufacturing Companies Total Productive Maintenance (T.P.M.) Overall Equipment Effectivity.
Performance Debugging for Distributed Systems of Black Boxes Yinghua Wu, Haiyong Xie Yinghua Wu, Haiyong Xie.
Performance Debugging for Distributed Systems of Black Boxes Eri Prasetyo Sources: - Tutorial at quality week, 2002,
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Toolbox Mirror -Overview Effective Distributed Learning.
Chapter 12: Web Usage Mining - An introduction
1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.
Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.
Access 2007 Product Review. With its improved interface and interactive design capabilities that do not require deep database knowledge, Microsoft Office.
Internet Traffic Patterns Learning outcomes –Be aware of how information is transmitted on the Internet –Understand the concept of Internet traffic –Identify.
Interactions Between Delayed Acks and Nagle’s Algorithm in HTTP and HTTPS: Problems and Solutions Arthur Goldberg Robert Buff New York University March.
Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Efficient and Robust Computation of Resource Clusters in the Internet Efficient and Robust Computation of Resource Clusters in the Internet Chuang Liu,
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
Surge: A Network Analysis Tool Crossbow Technology.
1 A Distributed Delay-Constrained Dynamic Multicast Routing Algorithm Quan Sun and Horst Langendorfer Telecommunication Systems Journal, vol.11, p.47~58,
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Introduction to eValid Presentation Outline What is eValid? About eValid, Inc. eValid Features System Architecture eValid Functional Design Script Log.
Network Measurement Bandwidth Analysis. Why measure bandwidth? Network congestion has increased tremendously. Network congestion has increased tremendously.
Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.
Leveraging User Interactions for In-Depth Testing of Web Application Sean McAllister Secure System Lab, Technical University Vienna, Austria Engin Kirda.
Virtual Memory Tuning   You can improve a server’s performance by optimizing the way the paging file is used   You may want to size the paging file.
A Security Analysis of the Network Time Protocol (NTP) Presentation by Tianen Liu.
On the Anonymity of Anonymity Systems Andrei Serjantov (anonymous)
Detecting Node encounters through WiFi By: Karim Keramat Jahromi Supervisor: Prof Adriano Moreira Co-Supervisor: Prof Filipe Meneses Oct 2013.
 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds Collaborators: Janet Wiener.
Bottlenecks: Automated Design Configuration Evaluation and Tune.
Global NetWatch Copyright © 2003 Global NetWatch, Inc. Factors Affecting Web Performance Getting Maximum Performance Out Of Your Web Server.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
Systems Support for End-to-End Performance Management Sandip Agarwala PhD Advisor: Karsten Schwan College of Computing Georgia Tech.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.
AUTOMATION OF WEB-FORM CREATION - KINNERA ANGADI – MS FINAL DEFENSE GUIDANCE BY – DR. DANIEL ANDRESEN.
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
1 Passive Network Tomography Using Bayesian Inference Lili Qiu Joint work with Venkata N. Padmanabhan and Helen J. Wang Microsoft Research Internet Measurement.
Copyright © 2003 OPNET Technologies, Inc. Confidential, not for distribution to third parties. Session 1341: Case Studies of Security Studies of Intrusion.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Performance Measurement Builder Phone: Web: Anna Marie Schmidt.
1 Optical Packet Switching Techniques Walter Picco MS Thesis Defense December 2001 Fabio Neri, Marco Ajmone Marsan Telecommunication Networks Group
Intersite Coordination (ISC) Mark Mathewson. Sept 24-27, 2002Intersite Coordination2 Overview Motivation for ISC How ISC Works GFESuite ISC Capabilities.
PERFORMANCE DEBUGGING FOR DISTRIBUTED SYSTEMS OF BLACK BOXES Marcos K. AguileraMarcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds,
Overview of Previous Lesson(s) Over View  A program must be translated into a form in which it can be executed by a computer.  The software systems.
Critical Path Analysis of TCP Transactions Authors:Paul Barford (University of Wisconsin-Madison) Mark Crovella (University of Boston) Member, IEEE Source:IEEE/ACM.
N. Hu (CMU)L. Li (Bell labs) Z. M. Mao. (U. Michigan) P. Steenkiste (CMU) J. Wang (AT&T) Infocom 2005 Presented By Mohammad Malli PhD student seminar Planete.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Measurement in the Internet Measurement in the Internet Paul Barford University of Wisconsin - Madison Spring, 2001.
Emir Halepovic, Jeffrey Pang, Oliver Spatscheck AT&T Labs - Research
Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin.
© 2015 Pittsburgh Supercomputing Center Opening the Black Box Using Web10G to Uncover the Hidden Side of TCP CC PI Meeting Austin, TX September 29, 2015.
1 Network Tomography Using Passive End-to-End Measurements Lili Qiu Joint work with Venkata N. Padmanabhan and Helen J. Wang.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
CIS-NG CASREP Information System Next Generation Shawn Baugh Amy Ramirez Amy Lee Alex Sanin Sam Avanessians.
Software Testing and Quality Assurance Practical Considerations (1) 1.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Software Testing.
UI-Performance Optimization by Identifying its Bottlenecks
Pip: Detecting the Unexpected in Distributed Systems
Presentation transcript:

Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha Muthitacharoen, MIT WISP November 2004

page 220 October 2003 Project5 - SOSP Example multi-tier system client web server client web server database server database server application server application server authentication server

page 320 October 2003 Project5 - SOSP Motivation Complex distributed systems are built from black box components These systems may have performance problems High or erratic latency Locating the causes of these problems is hard We can’t always examine or modify system components We need tools to infer where bottlenecks are Choose which black boxes to open

page 420 October 2003 Project5 - SOSP Contributions of our work Tools to highlight which black boxes have problems Require only passive information, such as packet traces Infer where most of time is spent from traces Person can then use more invasive tools to examine those boxes Reduce time and cost to debug complex systems Improve quality of delivered systems

page 520 October 2003 Project5 - SOSP Example causal path client web server client web server database server database server application server application server authentication server 100ms

page 620 October 2003 Project5 - SOSP Goals of our tools Find high-impact causal paths through a distributed system Causal path: series of nodes that sent/received messages – Each message is caused by receipt of previous message – Some causal paths occur many times High impact: – Occurs frequently – Contributes significantly to overall latency Without modifications or semantic knowledge Report per-node latencies on causal paths

page 720 October 2003 Project5 - SOSP Overview of our approach Obtain traces of messages between components – Ethernet packets, middleware messages, etc. – Collect traces as non-invasively as possible Analyze traces using algorithms Visualize results and highlight high-impact paths Requires very little information: [timestamp, source, destination]

page 820 October 2003 Project5 - SOSP Outline Problem statement & goals Overview of our approach Algorithm Experimental results Related work Conclusions

page 920 October 2003 Project5 - SOSP The convolution algorithm: input Time From To 0.01 A B 0.02 A B 0.04 B D 0.05 C F...

page 1020 October 2003 Project5 - SOSP The convolution algorithm: output A C D B E F E F F E G G G G G G

page 1120 October 2003 Project5 - SOSP Basic idea Creates a “time signal” for messages from each node Given time signals S 1 (t)=(A  B) and S 2 (t)=(B  X) Computes convolution of S 2 (t) and S 1 (–t) = S 1 * S 2 (can be computed quickly using fast fourier transforms) S 1 (t)=(A  B msgs) time

page 1220 October 2003 Project5 - SOSP S 1 (t)=(A  B msgs) S 2 (t)=(B  X msgs) S 1 * S 2 = conv(S 2 (t), S 1 (-t)) Spikes suggest causality between A  B and B  X msgs Spikes suggest causality between A  B and B  X msgs Time shift of a spike indicates its characteristic delay Time shift of a spike indicates its characteristic delay

page 1320 October 2003 Project5 - SOSP Details: first step Choose starting node A Use trace to add edges from it Time From To 0.01 A B 0.02 A B 0.04 A C 0.05 A B A B C

page 1420 October 2003 Project5 - SOSP Continuing Time From To … B D … B E … B F … B G A B C ?? (A  B)*(B  E) (A  B)*(B  D) d

page 1520 October 2003 Project5 - SOSP How Time From To t1 A B t2 A B t3 A B t4 A B Time From To … t1+d B D … t2+d B D … t3+d B D t3+d B E … t4+d B D

page 1620 October 2003 Project5 - SOSP Heuristic to find spikes threshold 1: n1 stddev over mean threshold 2: n2 stddev over mean n1 = 2 n2 = 1.5

page 1720 October 2003 Project5 - SOSP Recursing to continue Observations: 1. (B  D) are not all msgs from B to D (only those caused by A) 2. Stop recursion when too few messages left or no more spikes found A B D d ??

page 1820 October 2003 Project5 - SOSP Outline Problem statement & goals Overview of our approach Algorithm Experimental results Conclusions

page 1920 October 2003 Project5 - SOSP Results: service delays Jeff logged all headers for two months Parsed 80K Received headers in 12K messages Received: from cceexg11.americas.cpqcorp.net... by wera.hpl.hp.com... ; Fri, 4 Apr :35: – Yields (timestamp, sender, receiver) trace records Used Convolution Algorithm to – Reconstruct message paths – Find typical delays Note: this is NOT the most direct way to use headers – We made the problem harder so as to test our algorithm

page 2020 October 2003 Project5 - SOSP trace: output , , , ,

page 2120 October 2003 Project5 - SOSP Results: Petstore Sun’s demo application for J2EE Stanford’s PinPoint project provided us with traces – One trace has a node that is artificially slowed down

page 2220 October 2003 Project5 - SOSP Future work Automate trace gathering and conversion Sliding-window versions of algorithms – Find phased behavior – Reduce memory usage of nesting algorithm – Improve speed of convolution algorithm Validate usefulness on more complicated systems What are limits of our approach?

page 2320 October 2003 Project5 - SOSP Conclusion Looking for bottlenecks in black box systems Use signal processing techniques to find causal paths in the network and its delays For more information – Contact us if you have multi-hop message traces!