Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

1 CS533 Modeling and Performance Evaluation of Network and Computer Systems Capacity Planning and Benchmarking (Chapter 9)
Evaluating Heuristics for the Fixed-Predecessor Subproblem of Pm | prec, p j = 1 | C max.
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Analyzing the tradeoffs between breakup and cloning in the context of organizational self-design By Sachin Kamboj.
Software Engineering for Real- Time: A Roadmap H. Kopetz. Technische Universitat Wien, Austria Presented by Wing Kit Hor.
Energy Management and Adaptive Behavior Tarek Abdelzaher.
Mining Sequence Patterns from Wind Tunnel Experimental Data Zhenyu Liu †, Wesley W. Chu †, Adam Huang ‡, Chris Folk ‡, Chih-Ming Ho ‡
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
RIT Software Engineering
SE 450 Software Processes & Product Metrics 1 Defect Removal.
Performance Evaluation
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)
Issues with Data Mining
CMSC 345 Fall 2000 Unit Testing. The testing process.
Adaptive Overload Control for Busy Internet Servers Matt Welsh and David Culler USENIX Symposium on Internet Technologies and Systems (USITS) 2003 Alex.
Software Engineering Chapter 23 Software Testing Ku-Yaw Chang Assistant Professor Department of Computer Science and Information.
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
Profile Driven Component Placement for Cluster-based Online Services Christopher Stewart (University of Rochester) Kai Shen (University of Rochester) Sandhya.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Module 5 Planning for SQL Server® 2008 R2 Indexing.
Analysis of Simulation Results Chapter 25. Overview  Analysis of Simulation Results  Model Verification Techniques  Model Validation Techniques  Transient.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
IXA 1234 : C++ PROGRAMMING CHAPTER 1. PROGRAMMING LANGUAGE Programming language is a computer program that can solve certain problem / task Keyword: Computer.
Methodology of Simulations n CS/PY 399 Lecture Presentation # 19 n February 21, 2001 n Mount Union College.
Software Testing Reference: Software Engineering, Ian Sommerville, 6 th edition, Chapter 20.
Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Power Containers: An OS Facility for Fine-Grained Power and Energy Management on Multicore Servers Kai Shen, Arrvindh Shriraman, Sandhya Dwarkadas, Xiao.
The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.
CE Operating Systems Lecture 7 Threads & Introduction to CPU Scheduling.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Requirements Engineering. Requirements engineering processes The processes used for RE vary widely depending on the application domain, the people involved.
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Full and Para Virtualization
Selection of Behavioral Parameters: Integration of Case-Based Reasoning with Learning Momentum Brian Lee, Maxim Likhachev, and Ronald C. Arkin Mobile Robot.
1 Exploiting Nonstationarity for Performance Prediction Christopher Stewart (University of Rochester) Terence Kelly and Alex Zhang (HP Labs)
03/03/051 Performance Engineering of Software and Distributed Systems Research Activities at IIT Bombay Varsha Apte March 3 rd, 2005.
Software Quality Assurance and Testing Fazal Rehman Shamil.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Network Weather Service. Introduction “NWS provides accurate forecasts of dynamically changing performance characteristics from a distributed set of metacomputing.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
1 Performance Modeling and System Management for Multi-Component Online Services Christopher Stewart and Kai Shen University of Rochester.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
SOFTWARE TESTING. SOFTWARE Software is not the collection of programs but also all associated documentation and configuration data which is need to make.
Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.
Experience Report: System Log Analysis for Anomaly Detection
Department of Computer Science University of California, Santa Barbara
CPU Scheduling G.Anuradha
Reference-Driven Performance Anomaly Identification
Evaluating Transaction System Performance
TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for Online Search Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, and T. N. Vijaykumar.
CPU SCHEDULING.
Dept. of Computer Science, Univ. of Rochester
Hardware Counter Driven On-the-Fly Request Signatures
Request Behavior Variations
A General Approach to Real-time Workflow Monitoring
Outline System architecture Current work Experiments Next Steps
Review of Previous Lesson
Presentation transcript:

Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas O’Neill University of Rochester Presented at the 2 nd USENIX Workshop on Hot Topics in System Dependability

2 Context Distributed server systems  Example: J2EE Application servers  Many system configurations Switches that control runtime execution  Wide range of workload conditions exogenous demands for system resources Example J2EE Runtime Conditions System Configurations Concurrency limit Component placement Workload Conditions Request rate

3 Presumptions Performance expectations based on knowledge of system design are reasonable  Lead developers–high-level algorithms  Administrators–day-to-day experience Example Expectation Little’s Law Average number of requests in the system equals the average arrival rate times service time

4 Throughput Anomalies Actual Expectation Component Placement Strategies Real Performance Anomalies Problem Statement Dependable performance is important for system management  QoS scheduling  SLA negotiations Performance Anomalies – runtime conditions in which performance falls below expectations – are not uncommon

5 Goals Previous Work: Anomaly characterization can aid the debugging process and guide online avoidance  [AGU-SOSP99, QUI-SOSP05, CHE-NSDI04, COH-SOSP05, KEL-WORLDS05]  Focused on specific runtime conditions (e.g., those encountered during a particular execution) We wish to depict all anomalous conditions Comprehensive depictions can:  Aid the debugging of production systems before distribution  Enable preemptive avoidance of anomalies in live systems

6 Approach  Our depictions are derived in a 3-step process: 1. Generate performance expectations by building a comprehensive whole-system performance model 2. Search for anomalous runtime conditions 3. Extrapolate a comprehensive anomaly depiction  Challenges:  Model must consider a wide-range of system configurations  Systematic method to determine anomaly error threshold  An appropriate method to detect correlations between runtime conditions and anomalies

7 Outline Performance expectations for a wide-range of configuration settings Determination of the anomaly error threshold Decision-tree based anomaly depiction Preliminary results Discussion/ Conclusion

8 Comprehensive Performance Expectations Modeling the configuration space is hard  Configurations have complex effects on performance  Considering a wide-range of configurations increases model complexity Our modeling methodology  Build performance models as a hierarchy of sub-models  Sub-models can be independently adjusted to consider new system configurations

9 Rules for Our Sub-Model Hierarchies The output of each sub-model is a workload property  Workload property – internal demands for resources (e.g., CPU consumption) The inputs to each sub-model are either 1. workload properties 2. system configuration settings Sub-models on the highest level produce performance expectations Workload properties at the lowest level, canonical workload properties, can be measured independent of system configurations

10 A Hierarchy of Sub-Models We leverage the workload properties of earlier work [STE-NSDI05] Advantages  Sub-models have meaning Limitations  Configuration dependencies may make sub-models complex Hierarchy of sub-models for J2EE application servers.

11 Outline Performance expectations for a wide-range of configuration settings Determination of the anomaly error threshold Decision-tree based anomaly depiction Preliminary results Discussion/ Conclusion

12 Determination of the Anomaly Error Threshold Sometimes slight discrepancies between actual and expected performance should be tolerated Leniency depends on the end-use of the depiction  For online avoidance: focus on error magnitude  Large errors may induce poor management decisions  Sensitivity analysis of system management functions  For debugging: focus on targeted performance bugs  Noisy depictions will mislead debuggers  Group anomalies with the same root cause

13 Anomaly Error Threshold for Debugging Observation: anomaly manifestations due to the same cause are more likely to share similar error magnitude than unrelated anomaly manifestations Root causes can be grouped by clustering based on the expectation error:

14 Anomaly Error Threshold for Debugging Knee-points mark clusters boundaries Knee-point selection  Higher magnitude emphasizes large anomalies  Low magnitude captures multiple anomalies Validation: we notice that knee points disappear when problems are resolved 100% % 60% 40% 20% 0% knee Response Time Tput Sample runtime conditions (sorted on expectation error) Expectation Error Clustering

15 Outline Performance expectations for a wide-range of configuration settings Determination of the anomaly error threshold Decision-tree based anomaly depiction Preliminary results Discussion/ Conclusion

16 Decision Tree Based Anomaly Depictions Decision trees correlate anomalies to problematic runtime conditions Interpretable Unlike Neural Nets, SVM, Perceptrons No prior knowledge Unlike Bayesian trees [COH-OSDI04] Versatile If a=0: anomaly If a=1,b=0: normal If a=1,b=1: anomaly White-box Usage for Debugging Hints Prefer shorter, easily interpreted trees Black-box Usage for Avoidance Prefer longer, more precise tree aa b bb cc cc Anomaly 80% prob. Normal 70% prob. Anomaly 90% prob a=0, b=1, c=2,…. Anomaly Normal

17 Design Recap We wish to depict performance anomalies across a wide-range of system configurations and workload conditions 1. Derive performance expectations via a hierarchy of sub-models 2. Search for anomalous runtime conditions with carefully selected anomaly error threshold 3. Use decision trees to extrapolate a comprehensive anomaly depiction

18 Outline Performance expectations for a wide-range of configuration settings Determination of the anomaly error threshold Decision-tree based anomaly depiction Preliminary results Discussion/ Conclusion

19 Depiction Assisted Debugging System: JBoss  8 runtime conditions (including app type)  4 machine cluster, 2.66 GHz CPU Found and fixed 3 performance anomalies  One is shown in detail below  Depiction of a real performance anomaly. Misunderstood J2EE configuration which manifests when multiple components are placed on node 2

20 Discovered Anomalies 1. Misunderstood J2EE configuration caused remote invocations to unintentionally execute locally 2. A mishandled out-of-memory error under high concurrency caused the Tomcat 5.0 servlet container to drop requests 3. Circular dependency in the component invocation sequences caused connection timeouts under certain component placement strategies

21 Outline Performance expectations for a wide-range of configuration settings Determination of the anomaly error threshold Decision-tree based anomaly depiction Preliminary results Discussion/ Conclusion

22 Discussion Limitations  Cannot detect non-deterministic anomalies  Is it model inaccuracy or a performance anomaly? Requires manual investigation, but model is much less complex than the system  Debugging is still a manual process Future work  Short term: Investigate more system configurations  Short term: Depict anomalies in more systems  Long term: More systematic depiction-assisted debugging methods

23 Take Away Comprehensive depictions of performance anomalies on a wide-range of runtime conditions can aid debugging and avoidance We have designed and implemented an approach to:  Model a wide-range of system configurations  Determine anomalous conditions  Depict the anomalies in an easy-to-interpret fashion We have already used our approach to find 3 performance bugs