CLUE: SYSTEM TRACE ANALYTICS FOR CLOUD SERVICE PERFORMANCE DIAGNOSIS Hui Zhang 1, Junghwan Rhee 1, Nipun Arora 1, Sahan Gamage 2, Guofei Jiang 1, Kenji.

Slides:



Advertisements
Similar presentations
Seyedehmehrnaz Mireslami, Mohammad Moshirpour, Behrouz H. Far Department of Electrical and Computer Engineering University of Calgary, Canada {smiresla,
Advertisements

1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.
INTROPERF: TRANSPARENT CONTEXT- SENSITIVE MULTI-LAYER PERFORMANCE INFERENCE USING SYSTEM STACK TRACES Chung Hwan Kim*, Junghwan Rhee, Hui Zhang, Nipun.
Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.
An Analytics Approach to Traffic Analysis in Network Virtualization Hui Zhang, Junghwan Rhee, Nipun Arora, Qiang Xu, Cristian Lumezanu, Guofei Jiang
Xen , Linux Vserver , Planet Lab
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/15 EICE team Model-Level Debugging of Embedded Real-Time Systems Wolfgang Haberl, Markus.
1 Building with Assurance CSSE 490 Computer Security Mark Ardis, Rose-Hulman Institute May 10, 2004.
Mining Behavior Models Wenke Lee College of Computing Georgia Institute of Technology.
School of Computer Science and Information Systems
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Intrusion Detection System Marmagna Desai [ 520 Presentation]
INTRUSION DETECTION SYSTEMS Tristan Walters Rayce West.
INFO 355Week #61 Systems Analysis II Essentials of design INFO 355 Glenn Booker.
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
CS490D: Introduction to Data Mining Prof. Chris Clifton April 14, 2004 Fraud and Misuse Detection.
© 2010 IBM Corporation © 2011 IBM Corporation September 6, 2012 NCDHHS FAMS Overview for Behavioral Health Managed Care Organizations.
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.
(C) 2009 J. M. Garrido1 Object Oriented Simulation with Java.
1. There are different assistant software tools and methods that help in managing the network in different things such as: 1. Special management programs.
Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories.
Bug Localization with Machine Learning Techniques Wujie Zheng
©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
C. André, J. Boucaron, A. Coadou, J. DeAntoni,
9 Systems Analysis and Design in a Changing World, Fourth Edition.
Peeping Tom in the Neighborhood Keystroke Eavesdropping on Multi-User Systems USENIX 2009 Kehuan Zhang, Indiana University, Bloomington XiaoFeng Wang,
1 Modeling System Requirements with Use Cases. 2 Why Do We Need Use Cases? Primary challenge in a system design process –ability to elicit correct and.
Performance evaluation on grid Zsolt Németh MTA SZTAKI Computer and Automation Research Institute.
Object-Oriented Modeling: Static Models. Object-Oriented Modeling Model the system as interacting objects Model the system as interacting objects Match.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Consider the program fragment below left. Assume that the program containing this fragment executes t1() and t2() on separate threads running on separate.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Dapper, a Large-Scale Distributed System Tracing Infrastructure
Performance Testing Test Complete. Performance testing and its sub categories Performance testing is performed, to determine how fast some aspect of a.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.
Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference.
COMP2322 Lab 1 Introduction to Wireshark Weichao Li Jan. 22, 2016.
ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.
2009/6/221 BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure- Independent Botnet Detection Reporter : Fong-Ruei, Li Machine.
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
A Validation System for the Complex Event Processing Directives of the ATLAS Shifter Assistant Tool G. Anders (CERN), G. Avolio (CERN), A. Kazarov (PNPI),
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
iProbe: A Lightweight User- Space Instrumentation Tool
OPERATING SYSTEMS CS 3502 Fall 2017
HybNET: Network Manager for a Hybrid Network Infrastructure
An Analytics Approach to Traffic Analysis in Network Virtualization
Grid Computing.
Business Intelligence Design and Development Michael A. Fudge, Jr.
Reference-Driven Performance Anomaly Identification
Prof. Leonardo Mostarda University of Camerino
Performance Problems Diagnosis in Cloud Computing Systems by Mining Request Trace Logs Haibo Mi
Embedded Development Tools
Presentation transcript:

CLUE: SYSTEM TRACE ANALYTICS FOR CLOUD SERVICE PERFORMANCE DIAGNOSIS Hui Zhang 1, Junghwan Rhee 1, Nipun Arora 1, Sahan Gamage 2, Guofei Jiang 1, Kenji Yoshihira 1, Dongyan Xu

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Cloud Service Performance Diagnosis Era of Cloud Computing Many vendors are providing Cloud Services. 2 Our focus: How to diagnose performance problems of cloud service systems?

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Background: Kernel Event-driven System Monitoring Kernel events represent an application’s interaction with the host system. Well-defined Independent of applications. Application performance anomaly may be associated with unusual kernel events. Localizing unusual events and making them comprehensible is an important step for performance diagnosis of cloud systems. 3 Cloud Platform Kernel Libraries Application Traces

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Research Challenges Massive traces in distributed systems Thousands of processes, millions of kernel events in minute periods. Limited application information Common event types for all processes. Limited information for differentiating application behaviors Tradeoff between run-time tracing overhead and diagnosis capability Demand for a fast analytic tool for performance diagnosis using massive trace events 4

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Motivation Example Performance problem in an Internet gateway transaction application. Unexpected low transaction throughput in the deployment on a HP-UX high-end server with 16 cores. Manual Problem Diagnosis Found nondeterministic scheduling delays. Huge manual efforts to find the symptoms Research question How to describe and locate such symptoms in massive OS kernel events? 5 Many processes are forked from a common parent Visualized process activities Children show idle time without execution.

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Overview of CLUE CLUE is a trace analytic tool for Cloud service performance diagnosis using OS kernel event traces. Event sketch modeling on massive kernel event traces. Mining and performance analysis based on event sketches. 6 TracingAnalytics

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Service Model Event Sketch Modeling Extract event sketches, groups of kernel event sequences having causality relationship. Explicitly closed event slices Event sequence formed on the basis of request-reply communication patterns. Implicitly closed event slices Event sequence formed on the basis of general producer/consumer communication patterns such as IPCs. Explicit and implicit closed event slices are used to understand the behaviors of multi-stage services. 7

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Event Sketch Modeling 8 Traces httpdjavamysqlhttpdjavamysql Markers Event Slicing Event Slice Stitching Event Sketches Causality Relationship

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Kernel Event Record Definition A kernel event is a 6-tuple record: Owner ID: the ID of the event owner (e.g., a process X in host Y). Time begin: the time when this kernel event starts. Time end: the time when this kernel event ends. CPU ID: the ID of the CPU processor/core where this event occurs. Event type: the kernel event type. Event data: the extra information associated with kernel event types (e.g., parameters). Trace example: Apache httpd server 9 Owner ID Time begin Time end CPU ID Event type Event data

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Marking Event Definition A event slice mark is a 4-tuple record : Begin event type: the event type that the first event of an event slice must exactly match. End event type: the event type that the last event of an event slice must exactly match. Owner filter: the owner ID that the first and last events of an event slice must (partially or exactly) match. Event data filter: the event data that the first and last events of an event slice must (partially or exactly) match. 10 Implicitly closed event slices markers Explicitly closed event slices markers

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis An Event Slice of Apache In the event sequence of an apache webserver, one event slice is detected. 11 User’s web request Send the reply back Close the connection

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Causality Relationship Definition One causality relationship is presented as a 5-tuple record: Causing event type: a type of events that can cause the occurrence of other events. Caused event type: a type of events that are caused by other events. Time rule: the rule that a causing event type event and a caused event type event can be associated based on their temporal relationships. Owner rule: this defines the rule that a causing event type event and a caused event type event can be associated based on their owner IDs. Event data rule: this defines the rule that a causing event type event and a caused event type event can be associated based on their event data. 12 Send … Receive … Send Event Slice of Webserver Event Slice of Application Server Causing Caused Match of src and dest ports?

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Event Sketch Analysis Kernel Event Feature Generation Event sketches still have numerous events. It is costly to analyze event sketches in each event level. We extract concise properties of event sketches showing the characteristics of events for data analysis (More details in the poster this afternoon) Clustering and Conditional Data Mining Unsupervised learning to correlate similar event sketches Narrow down the focus of analysis by applying analysis conditions 13 Kernel Feature Generation Event Sketches Analysis Result Clustering, Conditional Data mining

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis System Resource Feature Kernel Event Features We use two kernel event features to infer the characteristics of event sketches in a black box way. Program Behavior Feature (PBF) PBF is a system call distribution vector. PBF is used to infer application logics behind the kernel events. System Resource Feature (SRF) SRF is a vector of resource descriptions of system calls. e.g., connect : network, stat : file 14 System call categorization Program Behavior Features 2 socket 3 send … 1 brk Time, event, info 33324, syscall, brk 35323, syscall, write 35634, syscall, socket 42345, interrupt 51234, context switch 88234, syscall, read 92345, syscall, socket … … Network 3 File … 1 Latency Resource categorization Event slice

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Conditional Data Mining For black box trace analysis, it is important to narrow down the focus of analysis to a relevant set of event sketches to determine anomaly. Essentially this is an iterative filtering process with successive applications of filter conditions. We model it as a conditional probability. P(C 2 |C 1 ) where C 1, C 2 are conditions. Examples of conditions: performance, application context, etc. A cluster based on program behavior features Event sketch marker type (e.g., Marker = TCP_ACCEPT) Latency, idle time (e.g., Latency > mean value) Process name (e.g., Process name = httpd.exe) 15

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Case Study : Inefficient Gateway Service Symptom Internet gateway transaction application in HP-UX server with 16 CPU cores Low transaction throughput Blackbox analysis Direct access to the real machine or software is not available. Got the traces recorded by owners Trace Analysis kernel events, 82 event sketches 78 sketches (over 95%) are constructed using implicitly closed event slices. Markers: kwakeup and ksleep system calls used for synchronization in HP-UX operating system. Clustering based on PBF (system call patterns) produced 7 clusters 16

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Clustering based on System Call Patterns Different clusters show distinct behavior in idle time and time stamp. Application logics behind the kernel events are captured using system call patterns. 7 Clusters are illustrated. X axis: Time, Y axis: Idle time 2 clusters have idleness below the mean and are spread over 0~6 seconds. 5 clusters have higher idleness than the average and their events occurred around 2.7 seconds. 17 Mean of idle time Time stamp Idle time

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Conditional Probability Clusters are further ranked with mean and variance of idle time. Top clusters localize the problematic symptoms with high idleness in execution. Manual inspection confirmed correct detection of anomaly patterns in the traces. 18 1) Conditional Probability : P(PBF) 2) Conditional Probability : P(PBF| )

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Conclusion We present a black-box (requiring no source code) method to monitor Cloud service environments and analyze performance problems. We have expanded the trace modeling of previous approaches by introducing inexplicitly closed event slices. We applied unsupervised learning with statistical analysis on the structured data to localize performance problems. 19

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Thank you 20