Systems Support for End-to-End Performance Management Sandip Agarwala PhD Advisor: Karsten Schwan College of Computing Georgia Tech.

Slides:

Advertisements

Similar presentations

Dynamic Task Assignment Load Index for Geographically Distributed Web Services PhD Research Proposal By: Dhiah Al-Shammary Supervised.

Advertisements

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,

Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.

The Active Streams approach to adaptive distributed systems Fabián E. Bustamante, Greg Eisenhauer, Karsten Schwan, and Patrick Widener

1 SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.

Detecting Transient Bottlenecks in n-Tier Applications through Fine- Grained Analysis Qingyang Wang Advisor: Calton Pu.

CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 15 –QoS Admission, QoS Negotiation, and Establishment of AV Connections Klara Nahrstedt.

Look Who’s Talking: Discovering Dependencies between Virtual Machines Using CPU Utilization HotCloud 10 Presented by Xin.

Capacity Planning and Predicting Growth for Vista Amy Edwards, Ezra Freeloe and George Hernandez University System of Georgia 2007.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Dynamic Provisioning for Multi-tier Internet Applications Bhuvan Urgaonkar, Prashant.

Providing Performance Guarantees for Cloud Applications Anshul Gandhi IBM T. J. Watson Research Center Stony Brook University 1 Parijat Dube, Alexei Karve,

What will my performance be? Resource Advisor for DB admins Dushyanth Narayanan, Paul Barham Microsoft Research, Cambridge Eno Thereska, Anastassia Ailamaki.

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu, Michael J. Freedman, Jennifer Rexford Princeton University.

Click to add text Introduction to the new mainframe: Large-Scale Commercial Computing © Copyright IBM Corp., All rights reserved. Chapter 7: Systems.

Capriccio: Scalable Threads for Internet Services ( by Behren, Condit, Zhou, Necula, Brewer ) Presented by Alex Sherman and Sarita Bafna.

Scheduling in Batch Systems

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.

OnCall: Defeating Spikes with Dynamic Application Clusters Keith Coleman and James Norris Stanford University June 3, 2003.

Performance Evaluation of Load Sharing Policies on a Beowulf Cluster James Nichols Marc Lemaire Advisor: Mark Claypool.

Measuring Performance Chapter 12 CSE807. Performance Measurement To assist in guaranteeing Service Level Agreements For capacity planning For troubleshooting.

Operating Systems Operating System Support for Multimedia.

© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Automated Workload Management in.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

What Can You do With BTM? Business Transaction Management touches the following disciplines:  Performance Management  Application Management  Capacity.

New Challenges in Cloud Datacenter Monitoring and Management

23 September 2004 Evaluating Adaptive Middleware Load Balancing Strategies for Middleware Systems Department of Electrical Engineering & Computer Science.

Towards Autonomic Hosting of Multi-tier Internet Services Swaminathan Sivasubramanian, Guillaume Pierre and Maarten van Steen Vrije Universiteit, Amsterdam,

Resource Management in Virtualization-based Data Centers Bhuvan Urgaonkar Computer Systems Laboratory Pennsylvania State University Bhuvan Urgaonkar Computer.

Computer Science Cataclysm: Policing Extreme Overloads in Internet Applications Bhuvan Urgaonkar and Prashant Shenoy University of Massachusetts.

Internet Traffic Management Prafull Suryawanshi Roll No - 04IT6008.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Black-box and Gray-box Strategies for Virtual Machine Migration Timothy Wood, Prashant.

Generating Adaptation Policies for Multi-Tier Applications in Consolidated Server Environments College of Computing Georgia Institute of Technology Gueyoung.

 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.

Introduction To Windows Azure Cloud

Adaptive Control of Virtualized Resources in Utility Computing Environments HP Labs: Xiaoyun Zhu, Mustafa Uysal, Zhikui Wang, Sharad Singhal University.

1 NETE4631 Managing the Cloud and Capacity Planning Lecture Notes #8.

Copyright 2007, Information Builders. Slide 1 Performance and Tuning Tips Mark Nesson/Vashti Ragoonath October 2008.

SCAN: a Scalable, Adaptive, Secure and Network-aware Content Distribution Network Yan Chen CS Department Northwestern University.

Internet Traffic Management. Basic Concept of Traffic Need of Traffic Management Measuring Traffic Traffic Control and Management Quality and Pricing.

Database Replication Policies for Dynamic Content Applications Gokul Soundararajan, Cristiana Amza, Ashvin Goel University of Toronto EuroSys 2006: Leuven,

A Novel Adaptive Distributed Load Balancing Strategy for Cluster CHENG Bin and JIN Hai Cluster.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science An Analytical Model for Multi-tier Internet Services and its Applications Bhuvan.

Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.

© 2009 IBM Corporation Best Practices in making production - grade applications -A Performance Architect’s View Archanaa Panda, Bharathraj – IBM, HiPODS,

DBAS: A Deployable Bandwidth Aggregation System Karim Habak†, Moustafa Youssef†, and Khaled A. Harras‡ †Egypt-Japan University of Sc. and Tech. (E-JUST)

1 Scheduling The part of the OS that makes the choice of which process to run next is called the scheduler and the algorithm it uses is called the scheduling.

Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part V Workload Characterization for the Web (Book, chap. 6)

“Trusted Passages”: Meeting Trust Needs of Distributed Applications Mustaque Ahamad, Greg Eisenhauer, Jiantao Kong, Wenke Lee, Bryan Payne and Karsten.

Design and Evaluation of a Model for Multi-tiered Internet Applications Bhuvan Urgaonkar Internship project talk – Services Management Middleware Dept,

1 Admission Control and Request Scheduling in E-Commerce Web Sites Sameh Elnikety, EPFL Erich Nahum, IBM Watson John Tracey, IBM Watson Willy Zwaenepoel,

Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part V Workload Characterization for the Web.

Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha.

Latency as a Performability Metric: Experimental Results Pete Broadwell

Girish M. Jashnani Sales Consultant Manage your E-Business Suite more effectively.

When Average is Not Average: Large Response Time Fluctuations in n-Tier Applications Qingyang Wang, Yasuhiko Kanemasa, Calton Pu, Motoyuki Kawaba.

Theophilus Benson*, Ashok Anand*, Aditya Akella*, Ming Zhang + *University of Wisconsin, Madison + Microsoft Research.

Using Correlated Tracing to Diagnose Query Level Performance What’s slowing down my app? Jerome Halmans Senior Software Development Engineer Microsoft.

Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,

Optimizing Distributed Actor Systems for Dynamic Interactive Services

Abhinav Kamra, Vishal Misra CS Department Columbia University

Architecture and Algorithms for an IEEE 802

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Applying Control Theory to Stream Processing Systems

Comparison of the Three CPU Schedulers in Xen

Capriccio – A Thread Model

PerfView Measure and Improve Your App’s Performance for Free

Admission Control and Request Scheduling in E-Commerce Web Sites

Specialized Cloud Architectures

Presentation transcript:

Systems Support for End-to-End Performance Management Sandip Agarwala PhD Advisor: Karsten Schwan College of Computing Georgia Tech

Source: Gartner (December 2005) Complexity, complexity, complexity…

Reasons for Complexity Application diversity Interdependencies Heterogeneous components –Too many different technologies and platform Too little “hints” from the system to the administrators –Legacy issues; Application-specific solutions Insufficient information about the system to drive self-management  Lack of Automation

Online System Management ControlExecute MonitorAnalyze Workload Scheduling Capacity and SLA management Design evaluation and tuning Bottleneck detection Resource provisioning, accounting, etc. Proposed Approach: Service Path

Service Path Front - end Web Servers Middle-tier Servlet Server Application Logic (EJBs, etc.) Data Base Back - end I n t e r n e t Proxy Server System abstractions that describe the dynamic dependencies between the different distributed application components Service Class: Application-level request class, e.g. SLA class

Service Path Characteristics End-to-End analysis Online Non-intrusive Application-generic

Outline Background Motivation Service path –Discovery with E2EProf –Refinement with SysProf –Automated SLA Enforcement Related Work Future Plans

E2EProf time (A  B) (B  C) time D1D1 D2D2 Black-box approach Correlate per-edge time series signals Monitor network packet traces ( source, destination, timestamps ) Model traces as per-edge time series signals or density functions A X B C D

Basic Approach Delay at B Compute cross-correlation (D 1 D 2 ) A X B C D (A  B) (B  C) (A  B) (B  D) Spike  Causality Spike’s position  Delay No spike

Evaluation with 4-tier RUBiS 1 Tomcat Server 1 Tomcat Server 2 MySQL Server Apache Web Server 1 Clients comment bidding CPU bound I/O bound EJB Server 2 EJB Server 1

Service Path Detection in RUBiS Highest delay node Highest delay nodes Static server assignment Round-robin load balancer

Change detection in RUBiS Injected Delay

Revenue Pipeline Total Traffic: 1.34 million / day (56k / hour) Delta Air Lines’ Application TACS IN & TACS OUT XIN & XOUT APEX IN & APEX OUT Error/Warning (Tivoli) Logs

Time of the day Latency (sec) Delta Air Lines’ Application TACS S1S1 S8S8 S7S7 S3S3 S2S2 Client requests TACS Huge request burst

Outline Background Motivation Service path –Discovery with E2EProf –Refinement with SysProf –Automated SLA Enforcement Related Work Future Plans

Beyond dependency and latency… C1 C2 S1 S3 S2 S5 S6 S4 Solution: Zoom into the servicepath with SysProf No application hints or instrumentation Monitor resource usage on per-class basis

SysProf Methodology eth driver BDD Network Stack System Call FS/ VM/ etc. A1A1 A2A2 ANAN Scheduler User Kernel Scheduler Instrumentation points From client To client Init CID Context Switches Net softirq system call parameters, PID, App functions Disk I/O Track request context –Work done for processing a request class –May span user-level or kernel-level –Executes in more than one contexts (e.g. processes, threads, softirqs) –Happens in a system-visible event (e.g. system calls)

Class ID Propagation Init CID Process  CID From client To client Msg  CID Packet  CID Inherits CID Front-Tier Middle-TierEnd-Tier User Kernel

Application of SysProf Resource Accounting Utility Billing Bottleneck detection Capacity Estimation Root-Cause Analysis Black-Box SLA management

Resource-Aware Adaptive Control Tomcat Server 1 Tomcat Server 2 MySQL Server EJB Server 2 EJB Server 1 Class 1 Class 2 Class 3 Cluster workloads contending for same resources Separate Queue/Controller for each cluster Front-end Controller + Scheduler

Resource-Aware Adaptive Control With SysProf Capacity = 80 req/s per server No SysProf

Summary Service Path –System abstractions to represent dependencies and request path E2EProf and Pathmap –Dependency and latency analysis SysProf –Service-based resource analysis Aid human operator and automate end-to-end performance management

Thank You! Questions?

Extra Slides

Pathmap Optimizations time Packet timestamp trace Time-series signal Or Density Function Cross-correlation series Bursty traffic Sliding window (W) Run-length compression Upper-bound On latency W