Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories.

Slides:



Advertisements
Similar presentations
Intrusion Detection Systems (I) CS 6262 Fall 02. Definitions Intrusion Intrusion A set of actions aimed to compromise the security goals, namely A set.
Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Learning Rules from System Call Arguments and Sequences for Anomaly Detection Gaurav Tandon and Philip Chan Department of Computer Sciences Florida Institute.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco.
1 NETE4631 Cloud deployment models and migration Lecture Notes #4.
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Connect. Communicate. Collaborate NTUA/GRNET Interdomain SLAs Enforcement Framework in Real QoS-Enabled Networks C. Marinos, A. Polyrakis, V. Pouli, M.
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
Monitoring a Large-Scale Network: Selecting the Right Tool Sayadur Rahman United International University & Network Manager, Financial Service.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
IPOEM: A GPS Tool for Integrated Management in Virtualized Data Centers Hui Zhang 1, Kenji Yoshihira 1, Ya-Yunn Su 2, Guofei Jiang 1, Ming Chen 3, Xiaorui.
Leveraging User Interactions for In-Depth Testing of Web Applications Sean McAllister, Engin Kirda, and Christopher Kruegel RAID ’08 1 Seoyeon Kang November.
1 Digital Libraries and Evidence in the Developing World Context Dr. Jon Ferguson Senior Health Database Scientist IMMPACT Project University of Aberdeen.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
Multiple testing correction
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Component-Level Energy Consumption Estimation for Distributed Java-Based Software Systems Sam Malek George Mason University Chiyoung Seo Yahoo! Nenad Medvidovic.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
An approach to Intelligent Information Fusion in Sensor Saturated Urban Environments Charalampos Doulaverakis Centre for Research and Technology Hellas.
P.1Service Control Technologies for Peer-to-peer Traffic in Next Generation Networks Part2: An Approach of Passive Peer based Caching to Mitigate P2P Inter-domain.
ITEC224 Database Programming
DIS Helsinki University of Technology Multi-Agent System Enhanced Supervision of Process Automation Teppo Pirttioja 1, Antti Pakonen 2, Ilkka.
Peer-to-Peer Data Integration Using Distributed Bridges Neal Arthorne B. Eng. Computer Systems (2002) Supervisor: Babak Esfandiari April 12, 2005 Candidate.
1 Introduction to Database Systems. 2 Database and Database System / A database is a shared collection of logically related data designed to meet the.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.
Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience Zhichun Li, Manan Sanghi, Yan Chen, Ming-Yang Kao and Brian.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Introduction to the Adapter Server Rob Mace June, 2008.
Deploy With Confidence Minimize risks Improve business output Optimize resources.
1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.
Laboratoire LIP6 The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble ACI MD.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.
Nordic Process Control Workshop, Porsgrunn, Norway Application of the Enhanced Dynamic Causal Digraph Method on a Three-layer Board Machine Cheng.
Part4 Methodology of Database Design Chapter 07- Overview of Conceptual Database Design Lu Wei College of Software and Microelectronics Northwestern Polytechnical.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Injecting Realistic Burstiness to.
A Genetic Algorithm Approach To Interactive Narrative Generation TeongJoo Ong and John Leggett Texas A&M University.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
SQL Based Knowledge Representation And Knowledge Editor UMAIR ABDULLAH AFTAB AHMED MOHAMMAD JAMIL SAWAR (Presented by Lei Jiang)
A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong.
INM 2008 Orlando, Florida A Hidden Markov Model Approach to Available Bandwidth Estimation and Monitoring Cesar D. Guerrero Miguel A. Labrador Department.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Selected Semantic Web UMBC CoBrA – Context Broker Architecture  Using OWL to define ontologies for context modeling and reasoning  Taking.
SOFTWARE TESTING Sampath Kumar Vuyyuru. INTRODUCTION Software Testing is a way of executing the software in a controlled manner to check whether the software.
Research Direction Introduction Advisor: Frank, Yeong-Sung Lin Presented by Hui-Yu, Chung 2011/11/22.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
A Validation System for the Complex Event Processing Directives of the ATLAS Shifter Assistant Tool G. Anders (CERN), G. Avolio (CERN), A. Kazarov (PNPI),
Biao Wang 1, Ge Chen 1, Luoyi Fu 1, Li Song 1, Xinbing Wang 1, Xue Liu 2 1 Shanghai Jiao Tong University 2 McGill University
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.
1 © Agitar Software, 2007 Automated Unit Testing with AgitarOne Presented by Eamon McCormick Senior Solutions Consultant, Agitar Software Inc. Presented.
A statistical anomaly-based algorithm for on-line fault detection in complex software critical systems A. Bovenzi – F. Brancati Università degli Studi.
RESERVOIR Service Manager NickTsouroulas Head of Open-Source Reference Implementations Unit Juan Cáceres
HybNET: Network Manager for a Hybrid Network Infrastructure
Cloud based linked data platform for Structural Engineering Experiment
Martin Rajman, Martin Vesely
Rocky Mountain CMG Spring? ‘09 Forum
Dr. Sudha Ram Huimin Zhao Department of MIS University of Arizona
An Adaptive Middleware for Supporting Time-Critical Event Response
Firewalls Jiang Long Spring 2002.
About Thetus Thetus develops knowledge discovery and modeling infrastructure software for customers who: Have high value data that does not neatly fit.
Mark McKelvin EE249 Embedded System Design December 03, 2002
Presentation transcript:

Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories America, Princeton

Outline Introduction – Motivation & Goal System Invariants – Invariants extraction – Value propagation Collaborative peer review mechanism – Rules & Fault model – Ranking alerts Experiment result Conclusion ICAC 2009 : 6/16/20092

Motivation ICAC 2009 : 6/16/ Large & complex systems are deployed by integrating many heterogeneous components: – servers, routers, storage & software from multiple vendors. – Hidden dependencies Log/Performance data from components – Operators set many rules to check it and trigger alerts. E.g. Web > 70% – Rule setting: independent & isolated Operator’s own system knowledge.

Goal ICAC 2009 : 6/16/20094 Which alerts should we analyze first? - Get more consensus from others - Blend system management knowledge from multiple operators We introduce “Peer-review” mechanism – To rank the importance of alerts. Operators can prioritize problem determinations process. > 70% Alert 1 > 150 Alert 2 > 60% Alert 3 > 35k Alert 4 Alert 3Alert 1Alert 2Alert 4

Full automation Alerts Ranking Process ICAC 2009 : 6/16/ t t t Off line > 70% Alert 1 > 150 Alert 2 > 60% Alert 3 > 35k Alert 4 1. Extract Invariants from monitoring data Invariants model Operators (w/ domain knowledge) Large system Alert 1 Alert 2 Alert 3 Alert 4 2. Define alert rules3. Sort alert rules [ICAC 2006] [TDSC 2006] [TKDE 2007] [DSN 2006] 4. Rank alerts Online At time of alerts received Alert 1 Alert 4 Real alerts Domain information

System Invariants ICAC 2009 : 6/16/20096 m1m1 m2m2 m4m4 m3m3 mimi m i+1 m i+2 mnmn t t t t t t t any constant relationship ??? mnmn Flow intensity: the intensity with which internal monitoring data reacts to the volume of user requests. Target System User requests t t User requests flow through system endlessly and many internal monitoring data react to the volume of user requests accordingly. We search the relationships among these internal measurements collected at various points. If modeled relationships continue to hold all the time, they can be regarded as invariants of the system.

Invariant Examples ICAC 2009 : 6/16/20097 Check implicit relationships, but not real values of flow intensities, which are always changing. However many relationships are constant !! – Example: x, y are changing but the equation y=f (x) is constant. Load Balancer Load Balancer I1 O1 O2 O3 I1 = O1+O2+O3 Database Server Database Server Packet volume V1 SQL query number N1 V1 = f(N1) Invariant

Automated Invariants Search ICAC 2009 : 6/16/20098 model library f Target System observation data pick any two measurements i, j to learn f ij f ij: Invariant candidates with new data [t1-t2], do f ij hold ? drop the variants f ij P i : Confidence Score NO Sequential validation [t0-t1] Monitoring observation data [t1-t2] with new data [t k -t k+1 ], do f ij hold ? observation data [tk-tk+1] P0P0 P1P1 Yes drop the variants f ij NO PKPK Yes Template

One example in model library ICAC 2009 : 6/16/20099 We use an AutoRegressive model with eXogenous (ARX) to learn the relationship between two flow intensity measurements. Define Given a sequence of real observations, using LMS, we learn the model by minimizing the error. A fitness function can be used to evaluate how well the learned model fits the real data.

Value Propagation with Invariants ICAC 2009 : 6/16/ x y=f(x) y z z=g(y) u v u=h(x) v=s(u) Extract invariants Converged Set z=g(f(x)) v=s(h(x)) With ARX Model Multi hops

Rules and Fault Model ICAC 2009 : 6/16/ Rule PredicateAction Probability of fault occurrence x 1 0 xTxT Fault model for each rule False positive False negative Ideal model Realistic model

Probability of Reporting a True Positive Alert Importance of an alert: ICAC 2009 : 6/16/ Probability of Reporting a True Positive (PRTP) generated by value x A very small false positive rate leads to large number of false positive repots. Ex. One measurement is checked every minute and its FP rate is 0.1% => 60x24x365x0.1% = 526 FP reports for a year! => What if thousands of measurements are there!!! Ex. Real operation support system: 80% of reports are FPs

Local Context Mapping to Global Context ICAC 2009 : 6/16/ > 70% Alert 1 > 150 Alert 2 > 60% Alert 3 > 35k Alert 4 WebAP DB Different semantics Global context CPU%Web = CPU%Web = CPU%Web = Fault model of CPU%Web PRTP x 1 0 xTxT x x x = = = Prob(true|X ) > Prob(true|X T ) > Prob(true|X ) > Prob(true|X ) Alert 3 Alert 1 Alert 2 Alert 4

Local Context Mapping to Global Context ICAC 2009 : 6/16/ > 70% Alert 1 > 150 Alert 2 > 60% Alert 3 > 35k Alert 4 WebAP DB Fault model of Network%AP PRTP x 1 0 x x x xTxT Prob(true|X ) > Prob(true|X ) > Prob(true|X ) > Prob(true|X T ) Alert 3 Alert 1 Alert 2 Alert 4 Alert ranking: No Change

Alerts Ranking Process ICAC 2009 : 6/16/ Rank alerts Online At time of alerts received Alert 1 Alert 4 Real alerts

Ranking Alerts (Case I) ICAC 2009 : 6/16/ Sorted alert rules Alert 6 Alert 2 Alert 3 Alert 7 Alert 5 Alert 9 Alert 1 Alert 8 Alert 4 Case I: Receive ONLY ALERTS, no monitoring data from components Alert 2 Alert 3 Alert 7 Alert 5 Alert 1 Alerts ranking alerts generated Operator’s knowledge & configuration System Invariants Network

Ranking Alerts (Case II) ICAC 2009 : 6/16/ Case II: Receive both alerts and monitoring data from components Fault model of CPU%Web PRTP x 1 0 xTxT x x x = = Observed Value X(CPU%Web) Number of Threshold Violations (NTV) NTV=3 Fault model of Network%AP PRTP x 1 0 x x x xTxT Observed Value X(Network%AP) NTV=2 Alert by CPU%Web is more important than one from Network%AP.

Index Introduction – Motivation & Goal System Invariants – Invariants extraction – Value propagation Collaborative peer review mechanism – Rules & Fault model – Ranking alerts Experiment result Conclusion ICAC 2009 : 6/16/200918

Experimental system ICAC 2009 : 6/16/ Flow Intensities: : the number of EJB created at time t. : the JVM processing time at time t. : the number of SQL queries at time t. Flow Intensities: : the number of EJB created at time t. : the JVM processing time at time t. : the number of SQL queries at time t. A D C B BAD C Invariant Examples:

Extracted Invariants Network ICAC 2009 : 6/16/ m1m1 m3m3 m5m5 m2m2 m4m4 m6m6

Thresholds of Measurements ICAC 2009 : 6/16/ m1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T

Thresholds of Measurements ICAC 2009 : 6/16/ m1m1 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T m2m m3m m4m m5m m6m

Ranking Alerts with NTVs (1) ICAC 2009 : 6/16/ m1m1 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T m2m m3m m4m m5m m6m Observed value NTVs

Ranking Alerts with NTVs (1) ICAC 2009 : 6/16/200924

Ranking Alerts with NTVs (2) ICAC 2009 : 6/16/ m1m1 m1m1 T m2m2 T m3m3 T m4m4 T m5m5 T m6m6 T m2m m3m m4m m5m m6m Observed value NTVs

Ranking Alerts with NTVs (2) ICAC 2009 : 6/16/ Inject a problem (SCP copy) to Web server

Conclusion We introduce a peer review mechanism to rank alerts from heterogeneous components – By mapping local thresholds of various rules into their equivalent values in a global context – Based on system invariants network model To support operators’ consultation for prioritization of problem determination. ICAC 2009 : 6/16/200927

Thank You! Questions? ICAC 2009 : 6/16/200928