1 Efficient Mining of Iterative Patterns for Software Specification Discovery David Lo † Joint work with: Siau-Cheng Khoo † and Chao Liu ‡ † Prog. Lang.

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić Department of Computer Engineering and Computer Science, School of Electrical.
Swarm: Mining Relaxed Temporal Moving Object Clusters
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
gSpan: Graph-based substructure pattern mining
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Frequent Closed Pattern Search By Row and Feature Enumeration
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Data Mining Association Analysis: Basic Concepts and Algorithms
Rakesh Agrawal Ramakrishnan Srikant
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Christoph F. Eick Questions and Topics Review Dec. 1, Give an example of a problem that might benefit from feature creation 2.Compute the Silhouette.
Multi-dimensional Sequential Pattern Mining
Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.
Creating Difficult Instances of the Post Correspondence Problem Presenter: Ling Zhao Department of Computing Science University of Alberta March 20, 2001.
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
David Lo Siau-Cheng Khoo Chao Liu DASFAA 2008 Efficient Mining of Recurrent Rules from a Sequence Database 1.
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
What Is Sequential Pattern Mining?
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
ICMLC2007, Aug. 19~22, 2007, Hong Kong 1 Incremental Maintenance of Ontology- Exploiting Association Rules Ming-Cheng Tseng 1, Wen-Yang Lin 2 and Rong.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
1 † Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current: (Sch. of Info. Systems, Singapore Management Uni.) Efficient Mining.
Sequential PAttern Mining using A Bitmap Representation
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
Mining High Utility Itemset in Big Data
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.
Sequential Pattern Mining
CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data Yi-Cheng Chen, Wen-Chih Peng and Suh-Yin Lee ICDM 2011.
1 David Lo 1,2 Siau-Cheng Khoo 2 Chao Liu 3 1 Singapore Management University 2 National University of Singapore 3 Microsoft Research, Redmond Mining Past-Time.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Association Rules: Advanced Concepts and Algorithms
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.
Data Mining Find information from data data ? information.
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Temporal Database Paper Reading R 資工碩一 馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
Gspan: Graph-based Substructure Pattern Mining
10/23/ /23/2017 Presented at KDD’09 Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach David Lo1,
Algorithms and Problem Solving
Frequent Pattern Mining
Jiawei Han Department of Computer Science
CARPENTER Find Closed Patterns in Long Biological Datasets
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
Association Rule Mining
Programming Fundamentals (750113) Ch1. Problem Solving
Lectures on Graph Algorithms: searching, testing and sorting
FP-Growth Wenlong Zhang.
Programming Fundamentals (750113) Ch1. Problem Solving
Programming Fundamentals (750113) Ch1. Problem Solving
Presentation transcript:

1 Efficient Mining of Iterative Patterns for Software Specification Discovery David Lo † Joint work with: Siau-Cheng Khoo † and Chao Liu ‡ † Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current: (Sch. of Info. Systems, Singapore Management Uni.) ‡ Data Mining Group Department of Computer Science Uni. of Illinois at Urbana- Champaign Current: (Microsoft Research, Redmond)

2 Motivation o Specification: Description on what a software is supposed to behave - Locking Protocol [YEBBD06]: o Existing problems in specification: Lack, incomplete and outdated specifications [LK06,ABL02,YEBBD06, DSB04, etc.] o Cause difficulty in understanding an existing system o Contributes to high software cost – Prog. maintenance : 90% of soft. cost [E00,CC02] – Prog. understanding : 50% of maint. cost [ S84,CC02] – US GDP software component: $214.4 billion [US BEA] o Solution: Specification Discovery

3 Our Specification Discovery Approach o Analyze program execution traces o Discover patterns of program behavior, e.g.: –Locking Protocol [YEBBD06]: –Telecom. Protocol [ITU], etc. - see paper o Address unique nature of prog. traces: – Pattern is repeated across a trace – A program generates different traces – Interesting events might not occur close together

4 Need for a Novel Mining Strategy oSequential Pattern Mining [AS95,YHA03,WH04] - A series of events (itemsets) supported by (i.e. sub- sequence of) a significant number of sequences. oEpisode Mining [MTV97,G03] - A series of closely- occurring events recurring frequently within a sequence Required Extension: Consider multiple occurrences of patterns in a sequence Required Extension: Consider multiple sequences; Remove the restriction of events occurring close together.

5 Iterative Patterns – Semantics o A series of events supported by a significant number of instances: - Repeated within a sequence - Across multiple sequences. o Follow the semantics of Message Seq. Chart (MSC) [ITU] and Live Seq. Chart (LSC) [ DH01]. o Describe constraints between a chart and a trace segment obeying it: - Ordering constraint [ITU,KHPLB05] - One-to-one correspondence [KHPLB05]

6 Iterative Patterns – Semantics off_hook Switching Sys Calling Party Called Party seizure dial_tone_on dial_tone_off ack ring_tone answer connection [ITU] TS1: off_hook, seizure, ack, ring_tone, answer, ring_tone, connection_on TS2: off_hook, seizure, ack, ring_tone, answer, answer, answer, connection_on X X TS3: off_hook, seizure, ack, ev1, ring_tone, ev1, answer, connection_on X X X

7 Iterative Patterns – Semantics o Given a pattern P (e 1 e 2 …e n ), a substring SB is an instance of P iff SB = e 1 ;[-e 1,…,e n ]*;e 2 ;…;[-e 1,…,e n ]*;e n Pattern: S1: off_hook, ring_tone, seizure, answer, ring_tone, connection_on S2: off_hook, seizure, ring_tone, answer, answer, answer, connection_on S3: off_hook, seizure, ev1, ring_tone, ev1, answer, connection_on S4: off_hook, seizure, ev1, ring_tone, ev1, answer, connection_on, off_hook, seizure_int, ev2, ring_tone, ev3, answer, connection_on X X X X X

8 Mining Algorithm

9 Projected Database Operations o Projected-all of SeqDB wrt pattern P – Return: All suffixes of sequences in SeqDB where for each, its infix is an instance of pattern P S1 S2 (Seq,Start,End)Sequence (1,1,2) (1,4,5) (2,1,2) o Support of a pattern = size of its proj. DB o SeqDB ev is formed by considering occurrences of ev o SeqDB P++ev can be formed from SeqDB P all

10 Pruning Strategies Apriori Property If a pattern P is not frequent, P++evs can not be frequent. Closed Pattern Definition: A frequent pattern P is closed if there exists no super-sequence pattern Q where: P and Q have the same support and corresponding instances Sketch of Mining Strategy 1.Depth first search 2. Cut search space of non-frequent and non- closed patterns

11 Closure Checks and Pruning – Definitions o Prefix, Suffix Extension (PE) (SE) - An event that can be added as a prefix or suffix (of length 1) to a pattern resulting in another with the same support o Infix Extension (IE) - An event that can be inserted as an infix (one or more times) to a pattern resulting in another with the same support and corresponding instances S1 S2 S3 Pattern: Prefix Ext: { } Suffix Ext: { } Infix Ext: { }

12 Closure Checks and Pruning – Theorems o Closure Checks: If a pattern P has no (PE, IE and SE) then it is closed otherwise it is not closed o InfixScan Pruning Property: If a pattern P has an IE and IE  SeqDB P,then we can stop growing P. S1 S2 S3 Pattern: Prefix Ext: { } Infix Ext: { } Suffix Ext: { } is not closed and we can stop growing it. No need to check for all

13 Recursive Pattern Growth Closure Checks InfixScan Pruning Main Method

14 Performance & Case Studies

15 Performance Study o Dataset TCAS - Program traces from Siemens dataset - commonly used for benchmark in error localization

16 Case Study o JBoss App Server – Most widely used J2EE server – A large, industrial program: more than 100 KLOC – Analyze and mine behavior of transaction component of JBoss App Server o Trace generation – Weave an instrumentation aspect using AOP – Run a set of test cases – Obtain 28 traces of 2551 events and an average of 91 events o Mine using min_sup set at 65% of the |SeqDB| - 29s vs >8hrs

17 Case Study o Post-processings & Ranking – 44 patterns o Top-ranked patterns correspond to interesting patterns of software behavior : – Top Longest Patterns Most Observed Pattern

18 Longest Iter. Pattern from JBoss Transaction Component

19 Conclusion o Addressing specification problem reduce software cost and save expensive resources. o Novel formulation & technique to mine closed set of iterative patterns : – Extends closed sequential pattern & episode mining – Based on the ordering and one-to-one correspondence requirement of MSC & LSC o Mining other forms of specification commonly used by software engineers: Live Sequence Charts [OOPSLA’07 (Poster), ASE’07], Linear Temporal Logic Rules [PLDI’07 (SRC)], etc. o Case study, improve mining speed, constraints o Other uses, post-mining step Future Work

20 Acknowledgement o Jiawei Han, UIUC o Shahar Maoz, Weizmann, Israel o Gazelle dataset, Blue Martini Software

21 Thank you for your attention Questions ? Advices ? Comments?