10/23/2017 10/23/2017 Presented at KDD’09 Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach David Lo1,

Slides:

Advertisements

Similar presentations

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.

Advertisements

ADAPTIVE FASTEST PATH COMPUTATION ON A ROAD NETWORK: A TRAFFIC MINING APPROACH Hector Gonzalez, Jiawei Han, Xiaolei Li, Margaret Myslinska, John Paul Sondag.

Random Forest Predrag Radenković 3237/10

A lightweight framework for testing database applications Joe Tang Eric Lo Hong Kong Polytechnic University.

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Frequent Closed Pattern Search By Row and Feature Enumeration

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.

Fast Algorithms for Association Rule Mining

1 Software Testing and Quality Assurance Lecture 5 - Software Testing Techniques.

Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

What Is Sequential Pattern Mining?

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

Identifying Reasons for Software Changes Using Historic Databases The CISC 864 Analysis By Lionel Marks.

Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.

An Automated Approach to Predict Effectiveness of Fault Localization Tools Tien-Duy B. Le, and David Lo School of Information Systems Singapore Management.

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.

1 † Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current: (Sch. of Info. Systems, Singapore Management Uni.) Efficient Mining.

Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.

Bug Localization with Machine Learning Techniques Wujie Zheng

Mining High Utility Itemset in Big Data

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.

Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

1 Efficient Mining of Iterative Patterns for Software Specification Discovery David Lo † Joint work with: Siau-Cheng Khoo † and Chao Liu ‡ † Prog. Lang.

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

Temporal Database Paper Reading R 資工碩一馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:

Experience Report: System Log Analysis for Anomaly Detection

Evaluating Classifiers

G10 Anuj Karpatne Vijay Borra

Association rule mining

Learning Software Behavior for Automated Diagnosis

Augmented Sketch: Faster and More Accurate Stream Processing

KDD 2004: Adversarial Classification

An Enhanced Support Vector Machine Model for Intrusion Detection

White-Box Testing.

Supporting Fault-Tolerance in Streaming Grid Applications

Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.

Mining Frequent Itemsets over Uncertain Databases

SEG 4630 E-Commerce Data Mining — Final Review —

White-Box Testing.

Discriminative Frequent Pattern Analysis for Effective Classification

Discriminative Pattern Mining

View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.

Topological Signatures For Fast Mobility Analysis

Summarizing Itemset Patterns: A Profile-Based Approach

Function-oriented Design

Discovering Frequent Poly-Regions in DNA Sequences

Presentation transcript:

10/23/2017 10/23/2017 Presented at KDD’09 Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach David Lo1, Hong Cheng2, Jiawei Han3, Siau-Cheng Khoo4, and Chengnian Sun4 1Singapore Management University, 2Chinese University of Hong Kong, 3University of Illinois at Urbana-Champaign, 4National University of Singapore 1

Software, Its Behaviors and Bugs 10/23/2017 Software, Its Behaviors and Bugs Software is ubiquitous in our daily life Many activities depend on correct working of software systems Program behaviors could be collected An execution trace: a sequence of events A path that programs take when executed A program contains many behaviors Some correspond to good ones, others to bad ones (i.e., failures) Bugs caused the loss of billions of dollars (NIST) Overall Question

Can Data Mining Help ? Pattern mining tool for program behaviors Recent development of the pattern-based classification approach In this work, we extend the above work to: Detect bad behaviors or failures in software traces Propose a new pattern definition which could be more efficiently mined (closed unique iterative pattern) Develop a new pattern-based classification on sequential data (iter. pattern-based classification)

10/23/2017 Our Goal “Based on historical data of software and known failures, we construct a pattern-based classifier working on sequential software data to generalize the failures and to detect unknown failures.” Failure detection is an important step in software quality assurance process. Focus: Non crashing failures Could be chained/integrated with other work on: Bug localization Test case augmentation Bug/malware signature generation Specific SE Question

Usage Scenarios Normal <e1,e2,e3,e4..,en> Unknown Trace or Sequence of Events Trained Sequential Classifier Failure Discriminative Features Test Suite Augmentation Tool Failure Detector Failure Detector Fault Localization

10/23/2017 Related Studies Lo et al. has proposed an approach to mine for iterative patterns capturing series of events appearing within a trace and across many traces. (LKL-KDD’07) Cheng et al., Yan et al. have proposed a pattern based classification method on transaction and graph datasets. (CYHH-ICDE’07, YCHY-SIGMOD’08) Technical Data Mining Question

Research Questions How to build a pattern-based classifier on sequential data which contains many repetitions ? How to ensure that the classification accuracy is good ? How to improve the efficiency of the classifier building process ?

Software Behaviors & Traces Each trace can be viewed as a sequence of events Denoted as <e1,e2,e3,…,en> An event, is a unit behavior of interest Method call Statement execution Basic block execution in a Control Flow Graph (CFG) Input traces -> a sequence database

Overall View of The Pattern-Based Classification Framework Sequence Database Iterative Pattern Mining Feature Selection Classifier Building Classifier Failure Detection

Iterative Patterns Series of events repeated (above a min_sup threshold) within a sequence and across multiple sequences Based on MSC & LSC (software spec. formalisms) Given a pattern P and a sequence database DB, instances of P in DB could be computed Given a pattern P (e1e2…en), a substring SB is an instance of P iff SB = e1;[-e1,…,en]*;e2;…;[-e1,…,en]*;en

Iterative Patterns Consider the pattern P = <A,B> The set of instances of P (seq-id, start-pos, end-pos) {(1,3,5), (1,6,8), (2,3,5), (2,8,9)} The support of P is 4

Frequent vs. Closed Patterns

Closed Unique Iterative Patterns |closed patterns| could be too large Due to “noise” in the dataset (e.g., the As in the DB) At min_sup = 2, patterns <A,C>, <A,A,C>, <A,A,A,C> and <A,A,A,A,C> would be reported. Due to random interleavings of different noise, number of closed patterns at times is too large

<A,B> is a closed unique pattern. <C,D> is unique but not closed due to <C,E,D>

Closure & Uniqueness Checks Main Method Mining Algorithm Recursive Pattern Growth Closure & Uniqueness Checks Pruning

Patterns As Features Software traces do not come with pre-defined feature vectors One could take occurrences of every event as a feature However, this would not capture: Contextual relationship Temporal ordering We could use mined closed unique patterns as features

Feature Selection Select good features for classification purpose Based on Fisher score ni = number of traces in class i (normal/failure) μi = average feature value in class i σi = variance of the feature value in class i the value of a feature in a trace/sequence is its num. of instances 2

Strategy: Select top features so that all traces or sequences are covered at least δtimes. Feature Selection

Classifier Building Based on the selected discriminative features Each trace or sequence is represented as: A feature vector (x1,x2,x3,…) Based on selected iterative patterns The value of xi is defined as Train an SVM model Based on two contrasting sets of feature vectors

Experiment: Datasets Synthetic Datasets Trace generators QUARK [LK-WCRE’06] Input software models with injected errors Output a set of traces with labels Real traces (benchmark programs) Siemens dataset (4 largest programs) Used for test-adequacy study – large number of test cases, with injected bugs, correct output available Inject multiple bugs, collect labeled traces Real traces (real program, real bug) MySQL server datarace bug Experiment: Datasets

Experiments: Eval. Details Compare event- to pattern-based classification Performance measures used Classification accuracy Area under ROC curve 5 Fold-Cross Validation Mining, feature selection and model building done for each fold separately Handling skewed distribution Failure training data is duplicated many times Test set distribution is retained Three types of bugs Addition, omission and ordering Experiments: Eval. Details

Experimental Results: Synthetic 10/23/2017 Experimental Results: Synthetic 1. Explain what is EVT

Experimental Results: Siemens & MySQL

Experimental Results: Varying Min-Sup Replace dataset

Experimental Results: Mining Time Replace dataset Mining Closed Unique Iterative Patterns Mining Closed Patterns: Cannot run at support 100% (Out of memory exception, 1.7GB memory, 4 hours)

Related Work Pattern-based classification Itemsets: Cheng et al. [ICDE’07, ICDE’08] Connected Graphs: Liu et al. [SDM’05], Yan et al. [SIGMOD’08] Mining episodes & repetitive sub-sequences Mannila et al. [DMKD’97], Ding et al. [ICDE’09] Fault localization Liu et al. [FSE’05] Dickinson et al. [ICSE’01] Clustering program behaviors Detection of failures by looking for small clusters Bowring et al. [ISSTA’04] Model failing trace and correct trace as first order Markov model to detect failures Related Work

Conclusion & Future Work New pattern-based classification approach Working on repetitive sequential data Applied for failure detection Classification accuracy improved by 24.68% Experiments on different datasets Different bug types: omission, addition, ordering Future work Direct mining of discriminative iterative patterns Application of the classifier to other form of sequential data: Textual data, genomic & protein data Historical data Pipelining to SE tools: fault localization tools, test suite augmentation tools Conclusion & Future Work

Questions, Comments, Advice ? 10/23/2017 10/23/2017 Thank You Questions, Comments, Advice ? 28