LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.

Slides:



Advertisements
Similar presentations
Evaluation of a Scalable P2P Lookup Protocol for Internet Applications
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Analyzing System Logs: A New View of What's Important Sivan Sabato Elad Yom-Tov Aviad Tsherniak Saharon Rosset IBM Research SysML07 (Second Workshop on.
Hinrich Schütze and Christina Lioma
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
An Example Use Case Scenario
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
04/30/13 Last class: summary, goggles, ices Discrete Structures (CS 173) Derek Hoiem, University of Illinois 1 Image: wordpress.com/2011/11/22/lig.
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Amy Dai Machine learning techniques for detecting topics in research papers.
Chapter 6: Information Retrieval and Web Search
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
1 Computing Relevance, Similarity: The Vector Space Model.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
Clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Introduction to Compiling
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Outline Problem Background Theory Extending to NLP and Experiment
Session 1 Module 1: Introduction to Data Integrity
LogSig: Generating System Events from Raw Textual Logs Liang Tang 1, Tao Li 1, Chang-Shing Perng 2 1 Florida International University 2 IBM T.J. Watson.
University of the Aegean AI – LAB ESWC 2008 From Conceptual to Instance Matching George A. Vouros AI Lab Department of Information and Communication Systems.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.
Data mining in web applications
Experience Report: System Log Analysis for Anomaly Detection
Big Data Infrastructure
Modified from Stanford CS276 slides Lecture 4: Index Construction
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
CSE 4705 Artificial Intelligence
Information Organization: Clustering
Searching Similar Segments over Textual Event Sequences
6. Implementation of Vector-Space Retrieval
Presentation transcript:

LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International University Miami, 33199, USA

2 Outline 1. Problem Statement 2. Motivation 3. Semi-structural Log Message Clustering 4. Message Segment Table 5. Evaluation

3 Problem Statement (1) 1. System log analysis is widely used for anomaly detection, fault prevention. 2. Many systems only generate textual log messages. Raw textual log messages are difficult to analyze.

4 Problem Statement (2) 1. Most temporal pattern mining algorithms are based on system events. We try to generate events from system log messages.

5 Problem Statement (3) 1. Traditional solution : Writing a full log parser. 2. Weaknesses: 1.Only famous systems, such as Apache Web Server, Microsoft IIS has well developed log parsers. 2.Time consuming to read documents and understand each type of log messages to write a parser by our own. 3.Many document is incomplete or only in the developer’s brain. 4.System is constantly updated, its log is constantly updated as well.

6 Outline 1. Problem Statement 2. Motivation 3. Semi-structural Log Message Clustering 4. Message Segment Table 5. Evaluation

Motivation (1)  Similar log messages describe the same event.  We can use data clustering algorithm on log messages.  However, how to define the similarity between two log messages? 7

Similarity between two sequences of terms: 1. Cosine similarity on Tf-idf vector 2. Jaccard Index Similarity. 3. Word Sequence Matching. Motivation (2) 8

Similarity between two sequences of terms: 1. Cosine similarity on Tf-idf vector 2. Jaccard Index Similarity. 3. Word Sequence Matching. Motivation (3) 9 How if two log messages have two different sets of words(terms)?

In PVFS2 log files, the two following log messages both belong to status event. However, none of terms are identical ! Motivation (4) 10

In PVFS2 log files, the two following log messages both belong to status event. However, none of terms are identical ! Motivation (4) 11 But, they have similar format. Format may be more useful than terms.

12 Outline 1. Problem Statement 2. Motivation 3. Semi-structural Log Message Clustering 4. Message Segment Table 5. Evaluation

13 Semi-structural Log Message Clustering (1) Step 1: Convert into semi-structural log messages ( log tree). Step 2: Compute similarities between pair-wise log trees. Step 3: Apply data clustering on the similarity matrix.

14 Semi-structural Log Message Clustering (2) Step 1: Convert into semi-structural log messages ( log tree).

15 Semi-structural Log Message Clustering (2) Step 1: Convert into semi-structural log messages ( log tree). Accomplished by a simple log parser.  It is only a context-free grammar parser.  It separates log message by comma, TAB, etc.  It does NOT identify the meaning of terms (words).  It can be automatically created by JLex and JCup (or JAVACC) tools.

16 Semi-structural Log Message Clustering (3) Step 2: Compute similarities between pair-wise log trees. s 1, s 2 are two log messages. Recursive Function for weight w

17 Semi-structural Log Message Clustering (3) Step 2: Compute similarities between pair-wise log trees. s 1, s 2 are two log messages. Root node of s 1 Root node of s 2 Message Segment at node v 1 Message Segment at node v 2

18 Semi-structural Log Message Clustering (3) Step 2: Compute similarities between pair-wise log trees. s 1, s 2 are two log messages. Root node of s 1 Root node of s 2 Message Segment at node v 1 Message Segment at node v 2 Best matching between subtree v 1 ’s nodes with subtree v 2 ’s nodes

19 Semi-structural Log Message Clustering (3) Step 2: Compute similarities between pair-wise log trees. s 1, s 2 are two log messages. Root node of s 1 Root node of s 2 Message Segment at node v 1 Message Segment at node v 2 Best matching between subtree v 1 ’s nodes with subtree v 2 ’s nodes Decrease weight for lower layer < 1

20 Semi-structural Log Message Clustering (3) Step 2: Compute similarities between pair-wise log trees. s 1, s 2 are two log messages. Root node of s 1 Root node of s 2 Message Segment at node v 1 Message Segment at node v 2 Best matching between subtree v 1 ’s nodes with subtree v 2 ’s nodes Decrease weight for lower layer < 1 Distance of Message Segment

Two message segments: m 1 =p 1 …p n1, m 2 =q 1 …q n2 t(.) is the type of a term, types={number, separator, word, hostname…} 21 Semi-structural Log Message Clustering (3) Distance of Message Segment m 1 and m 2 Type of a term

Why this similarity is better? 1. We use format information, take account the format similarity. 2. Similarity is computed based on best matched pair of message segments. For example, two message s 1 and s 2 both contain,. It is not fair to compute similarity of s 1 ’s and s 2 ’s. 22 Semi-structural Log Message Clustering (4)

Comparing to Tree Kernel:  Our similarity function is similar to tree kernel. However, –Tree kernel doesn’t assign importance weights for different layers of nodes. –Tree kernel compute every pair-wise nodes at each layer, very time-consuming. For our clustering, we don’t need similarity function to be a kernel function. 23 Semi-structural Log Message Clustering (5)

24 Outline 1. Problem Statement 2. Motivation 3. Semi-structural Log Message Clustering 4. Message Segment Table 5. Evaluation

25 Message Segment Table (1) 1. A lot of message segments are duplicated. 2. Duplicated computation for the similarity of two message segments have been seen before? 3. Therefore, we build a data structure in memory to maintain high frequent appeared message segments.

26 Message Segment Table (2) 1. Message Segment Table is composed by a hash table and a similarity matrix. Occurrences (For keeping track of the frequency) Column index Similarity Matrix

27 Message Segment Table (3)  MST Building: 1.Scan one pass, pick up high frequent message segments. 2.Put into Column Hash Table and similarity matrix. 3.Compute entries of the matrix.  Looking up MST: 1.Search Column Hash Table to find the column index. 2.Extract the value from the similarity matrix by column index.  Updating MST: 1.Search Column Hash Table to find the occurrence. 2.Insert/Remove Column Hash Table according to frequencies. 3.Then, modify similarity matrix… See details in the paper

28 Outline 1. Problem Statement 2. Motivation 3. Semi-structural Log Message Clustering 4. Message Segment Table 5. Evaluation

Experiment Machines, Data Collection : Evaluation (1) 29

 Comparative Methods: –Two traditional clustering algorithms: k-means and single-link hierarchical clustering. –We implements all by Java 1.5  Comparing Metric: –F1-Score Evaluation (2) 30

 Accuracy Result: Evaluation (3) 31 TF-IDF and Jaccard perform badly. Sometimes, Tree kernel is better than LogTree. But, it is much slower.

 Efficiency Result: Note the running time of LogTree includes the time for building Message Segment Table. Evaluation (4) 32 TF-IDF is fastest, but the accuracy is very bad. Tree Kernel and Jaccard are slow. LogTree is the second fastest one.

 Time Scalability: This experiment is done in the second machine ( 64-bits Linux server), and up to 10K log messages. Evaluation (5) 33

 Memory Space Scalability: f min = Evaluation (6) 34 Number of Entries in Message Segment Table

 A Case Study: for detecting configuration error in Apache Web Server. Evaluation (7) 35 An configuration error will case a series of continuous errors.

36 The End Thank you! Authors’ contact information: Liang Tang: Tao Li: