1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,

Slides:



Advertisements
Similar presentations
Evaluation of performance aspects of the Auto-ID Infrastructure
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
A Survey of Botnet Size Measurement PRESENTED: KAI-HSIANG YANG ( 楊凱翔 ) DATE: 2013/11/04 1/24.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
G. Alonso, D. Kossmann Systems Group
Dynamic Bayesian Networks (DBNs)
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Yoshiharu Ishikawa (Nagoya University) Yoji Machida (University of Tsukuba) Hiroyuki Kitagawa (University of Tsukuba) A Dynamic Mobility Histogram Construction.
Hiperspace Lab University of Delaware Antony, Sara, Mike, Ben, Dave, Sreedevi, Emily, and Lori.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
SESSION 10 MANAGING KNOWLEDGE FOR THE DIGITAL FIRM.
TRADING OFF PREDICTION ACCURACY AND POWER CONSUMPTION FOR CONTEXT- AWARE WEARABLE COMPUTING Presented By: Jeff Khoshgozaran.
Evaluating Hypotheses
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
SIM5102 Software Evaluation
SOWK 6003 Social Work Research Week 10 Quantitative Data Analysis
Mining Behavior Models Wenke Lee College of Computing Georgia Institute of Technology.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Intrusion Detection Systems. Definitions Intrusion –A set of actions aimed to compromise the security goals, namely Integrity, confidentiality, or availability,
Real Time Abnormal Motion Detection in Surveillance Video Nahum Kiryati Tammy Riklin Raviv Yan Ivanchenko Shay Rochel Vision and Image Analysis Laboratory.
Bottom-Up Integration Testing After unit testing of individual components the components are combined together into a system. Bottom-Up Integration: each.
New Challenges in Cloud Datacenter Monitoring and Management
This chapter is extracted from Sommerville’s slides. Text book chapter
Securing Legacy Software SoBeNet User group meeting 25/06/2004.
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.
ESB Guidance 2.0 Kevin Gock
Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.
Intrusion Detection for Grid and Cloud Computing Author Kleber Vieira, Alexandre Schulter, Carlos Becker Westphall, and Carla Merkle Westphall Federal.
Copyright © 2006, SAS Institute Inc. All rights reserved. Enterprise Guide 4.2 : A Primer SHRUG : Spring 2010 Presented by: Josée Ranger-Lacroix SAS Institute.
UML - Development Process 1 Software Development Process Using UML (2)
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 8: Quantitative.
Chapter 1 Introduction to Data Mining
Retrieving Relevant Reports from a Customer Engagement Repository Dharmesh Thakkar Zhen Ming Jiang Ahmed E. Hassan School of Computing, Queen’s University,
Analyzing and Interpreting Quantitative Data
Bug Localization with Machine Learning Techniques Wujie Zheng
CS 584. Performance Analysis Remember: In measuring, we change what we are measuring. 3 Basic Steps Data Collection Data Transformation Data Visualization.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Knowledge Modeling, use of information sources in the study of domains and inter-domain relationships - A Learning Paradigm by Sanjeev Thacker.
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
Chapter 13 Logical Architecture and UML Package Diagrams 1CS6359 Fall 2012 John Cole.
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University July 21, 2008WODA.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
7 Strategies for Extracting, Transforming, and Loading.
Automatic Discovery and Processing of EEG Cohorts from Clinical Records Mission: Enable comparative research by automatically uncovering clinical knowledge.
1 Experience from Studies of Software Maintenance and Evolution Parastoo Mohagheghi Post doc, NTNU-IDI SEVO Seminar, 16 March 2006.
Typing Pattern Authentication Techniques 3 rd Quarter Luke Knepper.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
© 2002 IBM Corporation IBM Research 1 Policy Transformation Techniques in Policy- based System Management Mandis Beigi, Seraphin Calo and Dinesh Verma.
Data Mining and Decision Support
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
OGCE Workflow and LEAD Overview Suresh Marru, Marlon Pierce September 2009.
Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004.
Experience Report: System Log Analysis for Anomaly Detection
DATA MINING © Prentice Hall.
Computer Aided Software Engineering (CASE)
Data Warehousing and Data Mining
Dynamic Authentication of Typing Patterns
Request Behavior Variations
Presentation transcript:

1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle, EPFL and HUG Karl Aberer, EPFL Sarunas Girdzijauskas, EPFL Alexander Lamb, HUG

2 VLDB 2006, Seoul Overview Background – Dependency Models Approaches –L1: Analyzing general service activity –L2: Analyzing user sessions –L3: Analyzing textual content Evaluation Conclusion

3 VLDB 2006, Seoul Background - A Moving Landscape Distributed clinical system of University Hospital Geneva (HUG) –2000 beds, 4500 PCs, records accessed per day Relevant features –Communication is web service based Service Directory: about 50 service groups –Centralized Logging System with a standard XML format 10 Mio log messages/day, 1 TeraByte/year –Quite homogeneous infrastructure Severe Availability Requirements (24 x 7 x 365) ➱ Need for automated support for problem diagnosis

4 VLDB 2006, Seoul Dependency Model Service Orientation allows for easy reuse and integration, but has resulted into a complex dependency structure Dependency model is not clear –DM difficult to obtain, impossible to keep up-to-date manually –Infrastructure for manual documentation of the dependency structure is available, but not used …

5 VLDB 2006, Seoul Goal - Automated Dependency Model Goal: Automated creation of a model of the system’s dependency structure (DM) –Non-intrusive and low-cost –Focus on invocation dependencies between high-level objects Applications –Support for Fault Localization Algorithms –Prediction of Impact of Management Operations –Support for Architectural Decisions –Detection of Abnormal Behavior “you don’t want to interrupt a surgery because of DB maintenance”

6 VLDB 2006, Seoul Possible Approaches Static approaches –Capture dependencies at “compile time” by scanning configuration files, code etc. Dynamic approaches –Capture dependencies at runtime –Approaches include: Code instrumentation (standards like JMX or ARM exist but are not yet applied broadly) Middleware instrumentation (eg. request tagging) Active perturbation of system operation Time series analysis of activity measures, eg. using Neural Networks, (network communication, cpu usage, …) [Ensel02] Generality Accuracy & Precision

7 VLDB 2006, Seoul State of the Art Research –Focuses on how to exploit a dependency model, little work on how to obtain it –No generally applicable solution providing sufficiently correct dependency models seems to exist Commercial Products –Most focus on low-level objects and visualization –(Few) existing dynamic approaches: high configuration effort!

8 VLDB 2006, Seoul Overview Background – Dependency Models Approaches –L1: Analyzing general service activity –L2: Analyzing user sessions –L3: Analyzing textual content Evaluation Conclusion

9 VLDB 2006, Seoul Technique L1: Logs as a General Activity Measure Key idea –Activity of dependent objects is likely to be correlated in some sense –Use logs as an activity measure Earlier work –Neural networks on CPU usage, traffic volume, … [Ensel02] –Drawback: supervised training Our approach –statistical approach (no training) –inspired by [LM04] (“Mining Temporal Patterns without Predefined Time Windows”)

10 VLDB 2006, Seoul Statistical Approach Tests for association of spatial point processes –Compare the typical distance of a random point R in time to the closest timestamp of a log from B, to the one of a timestamp of a log from A Approach –Obtain distances by sampling from R and A –Determine median for distances A-B and R-B –If median for A-B lower than for R-B → correlation/dependence –Use confidence intervals

11 VLDB 2006, Seoul Example confidence interval for median of x 1,…,x n : median falls with probability 95% into this interval, interval [x j, x k ] s.t. B n,½ (k-1)- B n,½ (j-1) > 0.95

12 VLDB 2006, Seoul Observations for L1 Observations from preliminary experimental evaluation –True dependencies found, but clearly incomplete –Few “random” errors –However, correlation also if no invocation dependency exists Limit analysis to shorter time windows –Eliminate common dependency on time Transitive dependencySimultaneous use

13 VLDB 2006, Seoul Technique L2: Logs in a User Session One main difficulty is heavy parallelism in system ➱ execution sequences get overshadowed Reconstruct user sessions ➱ eliminates parallelism due to multiple users Then, adapt a procedure from NLP [Evert04] Two independent steps 1.Extraction of consecutive log-source pairs [APP i, APP j ] and creation of contingency tables 2.Statistical test for association on these tables

14 VLDB 2006, Seoul Construction of Contingency Table Session Log Bigrams (u, v) Contingency table for A-B u = Au ≠ A v = B 11 v ≠ B 01 (A,B) - (B,C) - (C,B)

15 VLDB 2006, Seoul Expected vs. Observed Frequencies Expected frequencies under the hypothesis that u and v are statistically independent

16 VLDB 2006, Seoul Statistical Test for Association Log-likelihood test (Dunning) Works well for heavily skewed tables (O 11 << N) For an excellent discussion of statistical tests for correlation see [Evert04]

17 VLDB 2006, Seoul Observations for L2 Observations from preliminary experimental evaluation –Many true dependencies found –Interestingly, a few similar errors as in L1 transitivity and simultaneous use –Main problem only a small subset of logs can be assigned to a session, and many interactions can thus not be observed

18 VLDB 2006, Seoul Technique L3: Exploiting Textual Content in Logs Observation –Invocation of a remote service is typically logged by the caller –One could identify such logs and process log content to find callee The other way round –Find logs mentioning directory entry contents for a given service –Infer a dependency of the log’s source, the caller, on the service Example: service s calls notify on server myserver ●Possible content of free text in log entry Invoke externalService [fct [notify] server [myserver.hguge:9999/myurl]] or (DPINOTIFICATION) notify ($myparams)

19 VLDB 2006, Seoul Overview Background – Dependency Models Approaches –L1: Analyzing general service activity –L2: Analyzing user sessions –L3: Analyzing textual content Evaluation Conclusion

20 VLDB 2006, Seoul Experiments on Logs: Setting Test data: 56.8 Mio logs from 1 week Reference model (RM) –Created with help of more than a dozen system experts and developers –178 dependencies out of 1431 possible dependencies (54 services) Strategy 1.Validate L1, L2 and L3 against static reference model 2.Validate L1 and L2 against L3 and study influence of load

21 VLDB 2006, Seoul Experiment: Validation against RM L level CI: [0.63, 0.73] L level CI: [0.71, 0.78] L level CI: [0.93, 0.96] True Positives detected Small classification error for L1 –about 2% in negative case False Positives (FP) for L1 –transitive and simultaneous use (e.g. administrative patient data and laboratory results) True Positives detected FP for L2 –asynchronous communication Sessions in L2 –only 10% of all logs can be assigned to a session True Positives detected –10 False Negatives on the whole week

22 VLDB 2006, Seoul Experiment: Influence of Load on Detection Realizations of dependency relationships computed with L3 Percentage of False Positives is not influenced by load CI for linear factors L1: [-0.284, ] L2: [-0.025, 0.002]

23 VLDB 2006, Seoul Overview Background – Dependency Models Approaches –L1: Analyzing general service activity –L2: Analyzing user sessions –L3: Analyzing textual content Evaluation Conclusion

24 VLDB 2006, Seoul Comparison of Log-based Approaches L3. Logs as TextL2. Logs in SessionsL1. Logs as Activity Measure Accuracy and Precision of Result  Concurrency  Correlation Implementation and Maintenance  Parametrization Performance and security impact  Required Structure and Content of Logs (Scope)  Service directory  Session info  Only source and timestamp All techniques can be implemented in linear complexity w.r.t. #logs Invocation direction  functional dependency direction Solution for HUG –Centralized logging system ➱ little effort for log-based methods –L3 is a viable solution

25 VLDB 2006, Seoul Conclusion Three new approaches to use logs for DM generation with a large scope All have been shown to discover useful dependency information in real-world environment Seems to be first study on use of logs and first real- world experiment for DM generation Sniffing –Applicable for web service oriented systems Simple and efficient solution for HUG