1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.

Slides:



Advertisements
Similar presentations
Lecture 7. Network Flows We consider a network with directed edges. Every edge has a capacity. If there is an edge from i to j, there is an edge from.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Indexing DNA Sequences Using q-Grams
Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Longest Common Subsequence
gSpan: Graph-based substructure pattern mining
Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 2 Some of the sides are exported from different sources.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Bioinformatics III1 We will present an algorithm that originated by Ford and Fulkerson (1962). Idea: increase the flow in a network iteratively until it.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign.
1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
The max flow problem
Lecture 11. Matching A set of edges which do not share a vertex is a matching. Application: Wireless Networks may consist of nodes with single radios,
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Mining Association Rules
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, and Jiawei Han SIGMOD 2002 Presented by: Eddie Date: 2002/12/23.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Lecture 11. Matching A set of edges which do not share a vertex is a matching. Application: Wireless Networks may consist of nodes with single radios,
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Efficient Gathering of Correlated Data in Sensor Networks
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
Combinatorial Algorithms Reference Text: Kreher and Stinson.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan.
An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)
Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Association Rule Mining
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Relation. Combining Relations Because relations from A to B are subsets of A x B, two relations from A to B can be combined in any way two sets can be.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
By Pavan kumar V.V.N. Introduction  Brain’s has extraordinary computational power which is determined in large part by the topology and geometry of its.
Graph Indexing From managing and mining graph data.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
Approximation Algorithms based on linear programming.
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.
April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.
Cohesive Subgraph Computation over Large Graphs
Research in Computational Molecular Biology , Vol (2008)
Design and Analysis of Algorithm
The Taxi Scheduling Problem
On Efficient Graph Substructure Selection
Fast Sequence Alignments
Presented by: Chang Jia As for: Pattern Recognition
Efficient Subgraph Similarity All-Matching
Market Basket Analysis and Association Rules
SEG5010 Presentation Zhou Lanjun.
gApprox: Mining Frequent Approximate Patterns from a Massive Network
Approximate Graph Mining with Label Costs
Fragment Assembly 7/30/2019.
Clustering.
Dynamically Maintaining Frequent Items Over A Data Stream
Presentation transcript:

1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007

2 Introduction  Frequent sequential pattern mining remains one of the most important data mining tasks.  It has been shown that the capacity to accommodate approximation in the mining process has become critical due to inherent noise and imprecision in data.

3 Introduction  Mining consensus patterns provides one way to produce compact mining result under general distance measures.  It remains a challenge: how to efficiently mine the complete set of approximate sequential patterns under some distance measure.

4 Introduction  Break-down-and-build-up :  Although for an approximate pattern, the sequences may have different patterns of substitutions, they can in fact be classified into groups, which we call strands.  Each strand is a set of sequences sharing a unified pattern representation together with its support.

5 Introduction  Break-down-and-build-up :  These initial strands will then be iteratively assembled into longer strands in a local search fashion.  In the second “build-up” stage, different strands are then grouped based on their constituting sequences to form a support set so that the frequent approximate patterns would be identified.

6 Problem Formulation  In our model, two sequential patterns are considered approximately the same if and only if they are of equal length and their distance is within a user-specified error tolerance.

7 Problem Formulation  U is represented as a set of indices of S as all substrings in U share the same length as P.

8 Problem Formulation

9  A frequent approximate substring will be abbreviated as a FAS.  The frequent approximate substring mining problem is called FASM.

10 Algorithm Design  Pattern(S 1, S 2 ) = 〈 ATCCG, 1,ACAG, 1, TCAGTTGCA 〉.  If the error tolerance threshold δ = 0.1 and minimum frequency threshold θ = 4, then S 2 is a FAS since the other three substrings are within Hamming distance 2 from S 2.

11 Algorithm Design  All the substrings in a strand U share the same alternating sequence of maximal matching substrings and gaps of mismatches, which we call Pat(U).  We use |Pat(U)| to denote the length of the substrings in U. We use Gap(U) to denote the number of gaps in Pat(U) and Miss(U) to denote the number of total mismatches in Pat(U).

12  We call a strand U valid if the distance between any two substrings of U satisfy the user- specified error tolerance threshold, i.e., Miss(U) ≤ |Pat(U)|δ.  Find all the closed valid strands of P and let the union of them be X. P is a FAS if and only if the cardinality of X is at least θ.  Both strands {S 1, S 2 } and {S 2, S 3, S 4 } are valid. Suppose these two strands are also closed, then combining them we get a support set of size 4, satisfying the frequency requirement. As such, S 2 is a FAS. Algorithm Design

13  Growing Strand :  We scan the entire tape and, for each strand encountered, checks on both ends to see if the current strand can be grown by assembling neighboring strands. Let the result set be X. Algorithm Design

14  Grouping Strand :  Construct a substring relation graph G from X. The vertex set is all the substrings in the strands of X.  There is an edge between two substrings if and only if the Hamming distance between two substrings is within the error tolerance.  A substring is a frequent approximate substring if and only if the degree of the corresponding vertex is greater than or equal to the minimum frequency threshold. Algorithm Design

15 Local Search  |Pat(U 1 )| = 20,Miss(U 1 ) = 0 and |Pat(U 2 )| = 40,Miss(U 2 ) = 0. There is a gap of 7 mismatches between them.  Error tolerance is δ = 0.1. A valid strand U can accommodate further mismatches on either ends up to a distance of |Pat(U)|δ − Miss(U).  U 1 can accommodate d 1 = 2 extra mismatches and U 2 can accommodate d 2 = 4 extra mismatches.

16 Local Search  Any strand encountered within a distance of d = (|Pat(U)|δ − Miss(U))/(1− δ) can be assembled with the current strand to form a new valid strand.

17 Performance Study  Database : a real soybean genomic DNA sequence, CloughBAC : 103,334 base pairs in length  δ = 0.1, θ = 3

18 Performance Study  The approximate sequences are dense around the size of 10 and become sparse from size 15 to form a long tail.

19 Performance Study  It is compared against the one without the local search technique to demonstrate its importance in boosting the mining efficiency.

20 Performance Study  More lenient error tolerance results in more output sequences and consequently a longer running time.

21  As the minimum frequency threshold increases, the output size decreases sharply while the running time almost remains the same. Performance Study