Performance Debugging in Data Centers: Doing More with Less Prashant Shenoy, UMass Amherst Joint work with Emmanuel Cecchet, Maitreya Natu, Vaishali Sadaphal.

Slides:

Advertisements

Similar presentations

Computer Systems & Architecture Lesson 2 4. Achieving Qualities.

Advertisements

Topics to be discussed Introduction Performance Factors Methodology Test Process Tools Conclusion Abu Bakr Siddiq.

Performance Testing - Kanwalpreet Singh.

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

Introduction to IRRIIS testing platform IRRIIS MIT Conference ROME 8 February 2007 Claudio Balducelli.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Chapter 19: Network Management Business Data Communications, 5e.

Fine-Grained Latency and Loss Measurements in the Presence of Reordering Myungjin Lee, Sharon Goldberg, Ramana Rao Kompella, George Varghese.

Ira Cohen, Jeffrey S. Chase et al.

4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.

Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)

Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,

CLOUD COMPUTING AN OVERVIEW & QUALITY OF SERVICE Hamzeh Khazaei University of Manitoba Department of Computer Science Jan 28, 2010.

Chapter 19: Network Management Business Data Communications, 4e.

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

Source-Adaptive Multilayered Multicast Algorithms for Real- Time Video Distribution Brett J. Vickers, Celio Albuquerque, and Tatsuya Suda IEEE/ACM Transactions.

DFence: Transparent Network-based Denial of Service Mitigation CSC7221 Advanced Topics in Internet Technology Presented by To Siu Sang Eric ( )

Fault, Configuration, Performance Management

Security in Wireless Sensor Networks Perrig, Stankovic, Wagner Jason Buckingham CSCI 7143: Secure Sensor Networks August 31, 2004.

Spring Routing & Switching Umar Kalim Dept. of Communication Systems Engineering 06/04/2007.

Bandwidth Allocation in a Self-Managing Multimedia File Server Vijay Sundaram and Prashant Shenoy Department of Computer Science University of Massachusetts.

Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer

Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.

Intrusion Detection System Marmagna Desai [ 520 Presentation]

New Challenges in Cloud Datacenter Monitoring and Management

Adaptive Server Farms for the Data Center Contact: Ron Sheen Fujitsu Siemens Computers, Inc Sever Blade Summit, Getting the.

Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†

Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.

Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.

Ao-Jan Su, David R. Choffnes, Fabián E. Bustamante and Aleksandar Kuzmanovic Department of EECS Northwestern University Relative Network Positioning via.

Distributed Control of FACTS Devices Using a Transportation Model Bruce McMillin Computer Science Mariesa Crow Electrical and Computer Engineering University.

CS 501: Software Engineering Fall 1999 Lecture 16 Verification and Validation.

1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.

Top-Down Network Design Chapter Nine Developing Network Management Strategies Oppenheimer.

HERO: Online Real-time Vehicle Tracking in Shanghai Xuejia Lu 11/17/2008.

Network Aware Resource Allocation in Distributed Clouds.

Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.

Test Loads Andy Wang CIS Computer Systems Performance Analysis.

Chapter 4 Realtime Widely Distributed Instrumention System.

1 Network Monitoring Mi-Jung Choi Dept. of Computer Science KNU

한국기술교육대학교 컴퓨터 공학 김홍연 Habitat Monitoring with Sensor Networks DKE.

Tony McGregor RIPE NCC Visiting Researcher The University of Waikato DAR Active measurement in the large.

Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated.

1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.

1 Advanced Behavioral Model Part 1: Processes and Threads Part 2: Time and Space Chapter22~23 Speaker: 陳奕全 Real-time and Embedded System Lab 10 Oct.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

Plethora: A Wide-Area Read-Write Storage Repository Design Goals, Objectives, and Applications Suresh Jagannathan, Christoph Hoffmann, Ananth Grama Computer.

Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)

Bug Isolation via Remote Sampling. Lemonade from Lemons Bugs manifest themselves every where in deployed systems. Each manifestation gives us the chance.

16/11/ Semantic Web Services Language Requirements Presenter: Emilia Cimpian

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,

Company LOGO Network Management Architecture By Dr. Shadi Masadeh 1.

Network Computing Laboratory Load Balancing and Stability Issues in Algorithms for Service Composition Bhaskaran Raman & Randy H.Katz U.C Berkeley INFOCOM.

Resolve today’s IT management dilemma Enable generalist operators to localize user perceptible connectivity problems Raise alerts prioritized by the amount.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 1 Database Systems.

CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.

Best detection scheme achieves 100% hit detection with

UCI Large-Scale Collection of Application Usage Data to Inform Software Development David M. Hilbert David F. Redmiles Information and Computer Science.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Distributed Network Monitoring in the Wisconsin Advanced Internet Lab Paul Barford Computer Science Department University of Wisconsin – Madison Spring,

SDN Network Updates Minimum updates within a single switch

Chapter 1 Database Systems

Human Complexity of Software

Chapter 1 Database Systems

Achieving Resilient Routing in the Internet

Presentation transcript:

Performance Debugging in Data Centers: Doing More with Less Prashant Shenoy, UMass Amherst Joint work with Emmanuel Cecchet, Maitreya Natu, Vaishali Sadaphal and Harrick Vin

Data Centers Today Large number of computing, communication, and storage systems Wide range of applications and services Rapidly increasing scale and complexity Limited understanding and control over the operations

Equity Trade Plant Portion of the data center operated by an investment bank for processing trading orders; Nodes represent application processes; Edges indicate flow of requests;

Equity Trade Plant Receives and processes 4-6 million equity orders (trade requests) million market updates (news, stock-tick updates, etc.) IT infrastructure for processing orders and updates consists of thousands of application components running on hundreds of servers Portion of the data center operated by an investment bank for processing trading orders; Nodes represent application processes; Edges indicate flow of requests;

Performance Debugging in Data Centers Low end-to-end latency for processing each request is a critical business requirement Increase in latency can be due to – Dynamic changes in workload – Slowing down of a processing node due to hardware or software errors Performance debugging involves detecting and localizing performance faults Longer localization time leads to greater business impact

Performance Debugging in Data Centers Four key steps – Build a model of normal operations of a system – Place probes to monitor the operational system – Detect performance faults in near-real-time – Localize faults by combining the knowledge derived from model and monitored data

Performance Debugging in Data Centers Four key steps – Build a model of normal operations of a system – Place probes to monitor the operational system – Detect performance faults in near-real-time – Localize faults by combining the knowledge derived from model and monitored data Effectiveness of these steps depends on the number and type of data collection probes available in the system. However, system administrators are reluctant to introduce probes into production environment, especially if the probes are intrusive (and can modify the system behavior)

Basic Practical Requirement Minimize the amount of instrumentation to gather real-time operational statistics Minimize the intrusiveness of the data gathering methods

Basic Practical Requirement Minimize the amount of instrumentation to gather real-time operational statistics Minimize the intrusiveness of the data gathering methods Much of the prior research ignores this requirement and demands: Significant instrumentation (e.g., requiring probes to be placed at each process/server) Significant intrusiveness (e.g., requiring each request to carry a request-ID to track request flows)

Characterizing State-of-the-art

Basic Practical Requirement System operators are always Minimize the amount of instrumentation to gather real-time operational statistics Minimize the intrusiveness of the data gathering methods Much of the prior research ignores this requirement and demands: Significant instrumentation (e.g., requiring probes to be placed at each process/server) Significant intrusiveness (e.g., requiring each request to carry a request-ID to track request flows) For automated performance debugging to become practical and effective, one needs to develop techniques that are more effective with less instrumentation and intrusiveness We raise several issues and challenges in designing these techniques

Instrumentation Vs. Intrusiveness Extent of instrumentation and amount of intrusiveness complement each other – E.g., collection of request component dependency High instrumentation-Low intrusiveness – Each node monitors request arrival event Low instrumentation-High intrusiveness – Each request stores information of the component it passes through

Instrumentation Vs. Intrusiveness Extent of instrumentation and amount of intrusiveness complement each other – Collection of request component dependency – High instrumentation-Low intrusiveness Each node monitors request arrival event – Low instrumentation-High intrusiveness Each request stores information of the component it passes through Observation: It is possible to tradeoff the level of instrumentation against the level of intrusiveness needed for a technique

Instrumentation Vs. Intrusiveness Extent of instrumentation and amount of intrusiveness complement each other – Collection of request component dependency – High instrumentation-Low intrusiveness Each node monitors request arrival event – Low instrumentation-High intrusiveness Each request stores information of the component it passes through Observation: It is possible to tradeoff the level of instrumentation against the level of intrusiveness needed for a technique Production systems place significant restrictions on which nodes can be instrumented as well as the level of intrusiveness permitted

Instrumentation Vs. Intrusiveness Extent of instrumentation and amount of intrusiveness complement each other – Collection of request component dependency – High instrumentation-Low intrusiveness Each node monitors request arrival event – Low instrumentation-High intrusiveness Each request stores information of the component it passes through Observation 3: It is possible to tradeoff the level of instrumentation against the level of intrusiveness needed for a technique Production systems place significant restrictions on which nodes can be instrumented as well as the level of intrusiveness permitted Is it possible to achieve effective performance debugging using low instrumentation and low intrusiveness?

Doing More With Less: An Example

A Production Data Center: Characteristics and Constraints

469 nodes – Each node represents an application component that processes trading orders and forwards them to downstream node 2,072 links 39,567 unique paths SLO: end-to-end latency for processing each equity trade should not exceed 7-10ms

A Production Data Center: Characteristics and Constraints 469 nodes – Each node represents an application component that processes trading orders and forwards them to downstream node 2,072 links 39,567 unique paths SLO: end-to-end latency for processing each equity trade should not exceed 7-10ms Environment imposes severe restrictions on the permitted instrumentation and intrusiveness No instrumentation of intermediate nodes purely for performance debugging SLA compliance is monitored at exit nodes by time-stamping request entry and exit Available information Per-hop graph SLO compliance information at the monitors at exit nodes No additional information is available

Problem Definition Given: – System graph depicting application component interactions – Instrumentation at the entry and exit nodes that timestamp requests Determine: – The root cause of SLO violations when one more exit nodes observe such violations

Straw-man Approaches Signature-based localization Online signature matching via graph coloring

Signature-Based Localization Node signature: – Set of all monitors that are reachable from the node – K-bit string where each bit represents the accessibility of a monitor In presence of a failure some monitors will observe SLO violation, thus creating a violation signature Fault localization task is to determine the node that could have generated the violation signature Query exit points (SLA validation)

Signature-Based Localization Applying signature-based localization on equity trade plant system Monitors on 112 exit nodes generated 112-bit signatures Generated 137 unique signatures for 357 non-exit nodes (38%) Generated 71 unique signatures for 121 source nodes (58%)

Online signature matching Graph coloring technique SLA violation Mark suspect nodes Clear suspect nodes that lead to a valid request execution Root cause of SLA violation

Opportunities and Challenges

Deriving a System Model Objective: – Real production systems are too large and complex to manually derive a system model Need for automatic generation and maintenance of model Challenges: – Need for reasonably low instrumentation and intrusiveness – Several low-cost mechanisms can be considered here Network packet sniffing to derive component communication pattern Examining application logs – to derive component communication pattern – to derive request flows

Monitor Placement Objective: – Place monitors at suitable locations to measure end- to-end performance metrics Challenges – Deployment of monitors involves instrumentation overhead Need to minimize the number of monitors – Tradeoff between number of monitors and accuracy of fault detection and localization Smaller number of monitors increases chances of signature collisions

Monitor Placement – Structure of graph affects the distribution of signatures across nodes In the ideal case n unique signatures can be generated using log(n) monitors Nodes with same signature Nodes with same signature

Real-Time Failure Detection Objective – Quick and accurate detection of the presence of failures based on observation at the monitor nodes Challenges: – Differentiate between the effect due to workload change and failure – Deal with scenario where a node failure affects only few of the requests passing through the node – Transient failures

Fault Localization Objective: – Identification of the root-cause of the problem after detecting failure at one or more monitor nodes (SLO violation signature) Challenges: – Presence of multiple failures leads to composite signature – Edges from the failed node to the monitors are traversed in a non-uniform manner leading to partial signature – Transient failures – Inherent non-determinism in real systems (e.g. presence of load balancers)

Conclusions Detecting and localizing performance faults in data centers has become a pressing need and a challenge Performance debugging can become practical and effective only if it requires low levels of instrumentation and intrusiveness We proposed straw man approaches for performance debugging and presented issues and challenges for building practical and effective solutions

Instrumentation and Intrusiveness

Observation 1: The instrumentation intrusiveness is a direct function of the performance metric of interest Instrumentation for Failure Detection – End-to-end latency: difference of the timestamps of arrival and departure of requests High instrumentation intrusiveness – Throughput: number of requests departing the system within a defined interval Low instrumentation intrusiveness

Instrumentation for Fault Localization Simple solution: Measure performance metrics and resource utilization at all servers – High instrumentation – High overhead (monitoring and data management) Sophisticated solutions: Collect operational semantics of the system (e.g., request component dependencies) – Low instrumentation (not each node needs to be instrumented) – High intrusiveness (modifications at system, middleware, application level)

Instrumentation for Fault Localization Collection of different system information require different level of intrusiveness Per-hop graph indicating component interactions: simple network sniffing Derivation of flow of requests: application aware monitoring (e.g. by insertion of transaction-id in the requests)

Characterizing State-of-the-art Observation 2: Most techniques require high instrumentation or high intrusiveness or both