Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Clustering Categorical Data The Case of Quran Verses
PREFIXSPAN ALGORITHM Mining Sequential Patterns Efficiently by Prefix- Projected Pattern Growth
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Mutual Information Mathematical Biology Seminar
Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 1 Efficient Maintenance of Semistructured Schema Katsaros Dimitrios Aristotle University.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, and Jiawei Han SIGMOD 2002 Presented by: Eddie Date: 2002/12/23.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
What Is Sequential Pattern Mining?
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Zvi Kohavi and Niraj K. Jha 1 Memory, Definiteness, and Information Losslessness of Finite Automata.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
黃福銘 (Angus F.M. Huang) ANTS Lab, IIS, Academia Sinica TrajPattern: Mining Sequential Patterns from Imprecise Trajectories.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.
Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Confidence Intervals Cont.
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts
Presentation transcript:

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Outline Introduction Model Algorithm Evaluation Conclusion

Introduction Pattern discovery in long sequences has many applications. The common metric used to qualify a significant pattern is support. Noise usually exists. –A symbol is misrepresented by another symbol. –An occurrence of a pattern cannot be recognized. E.g: when a sequence: d 1 d 3 d 4 d 5 is misrepresented by: d 1 d 2 d 4 d 5, pattern ‘d 1 d 3 ’ cannot be found.

Introduction –The observed support of a pattern may be less than the real support of it. –Some frequent patterns cannot be discovered due to the noise. The result of failing to find a frequent pattern is more critical when the pattern is long. –Long patterns are more vulnerable to distortions. If there are noises and long patterns in the database, support is not a very suitable measure for significant patterns.

Introduction –E.g: gene sequence analysis with amino acid as the granularity, the length of a gene expression is usually a few thousands. Noise is common: some mutation of amino acids occurs with a non-negligible probability. Compatibility Matrix –a matrix whose element shows the probability of the observed value being a real underlying substance. –each observed symbol is interpreted as an occurrence of a set of symbols with various probabilities.

Introduction An example of the compatibility matrix. Prob(d 1 |d 1 )=0.9 Prob(d 2 |d 1 )=0.05 Prob(d 3 |d 1 )=0.05 Prob(d 4 |d 1 )=0 Prob(d 5 |d 1 )=0 Based on the compatibility matrix, a new measurement, called match, is proposed to qualify important patterns. Observed value d1d2d3d4d5 true value d d d d d

Model I = {d 1, d 2, …, d m }. A sequence (pattern) of length n is an ordered list of n symbols in I. –E.g: d 1 d 2 d 1 is a sequence (pattern) of length 3. Given a sequence S=s 1 s 2 …s l s, a pattern P=d 1 d 2 …d l p is a subsequence (subpattern) of S –if there exist a list of integers 1  i 1 <i 2 <…<i l p  l s such that d j =s i j for 1  j  l p. S is also called a supersequence (superpattern) of P. –E.g: d 1 d 4 d 5 is a subpattern of d 1 d 3 d 4 d 5.

Model Given I={d 1,d 2,…,d m }, the compatibility matrix is an m  m matrix –Its element C(d i,d j )=Prob(true_value = d i | observed_value = d j ), where 1  i,j  m. –The compatibility matrix is assumed to be provided by an expert in the area. Given a pattern P= d 1 d 2 …d l, and a sequence s= d 1 ’d 2 ’…d l ’, the match of P in s, denoted by M(P,s): –is defined as the conditional probability that s corresponds to an occurrence of P.

Model If each observed symbol is generated independently, then M(P,s) =Prob(P|s) =  1  i  l C(d i,d i ’). –E.g: if P=d 1 d 2, s=d 1 d 3, then M(P,s)=C(d 1,d 1 )  C(d 2,d 3 )=0.9  0.05= –Here P is not a subpattern of s, but M(P,s) >0. Given a sequence S of length l s and a pattern P of length l p where l s  l p, the match of P in S –is defined as the maximal match of P in every distinct subsequence (of length l p ) of S. –i.e. M(P,S)=max s is a subpattern of S M(P,s).

Model –as many as distinct subsequences –dynamic programming: O(l p  l s ) time –optimization to nearly O(l s ) time Given a pattern P and a database D of N sequence, the match of P in D is defined as M(P,D)=  S  D M(P,S) / N. A minimum match threshold  match is specified by a user. All patterns that meet the  match threshold are called to frequent patterns.

Model The match model can accommodate misrepresentation of symbols due to noise. The Apriori property also holds on the match metric. –The match of a pattern P in a database D  the match of any subpattern of P. In a noise-free environment, match model can represent support model. –Let the compatibility matrix be an identity matrix (C(d i,d j )=1 if i=j and is 0 otherwise). –The match of a pattern is equal to the support of a pattern.

Algorithm Problem to tackle –Large number of frequent patterns with match metric. –Long length of frequent patterns. Technique used: sampling, Chernoff bound estimation and border collapsing. Three steps: –Phase 1: finding match of individual symbols and sampling –Phase 2: ambiguous pattern discovery on samples –Phase 3: border collapsing

Algorithm Phase 1: finding match of each symbols and sampling –For a sequence D i in the database, the match of a symbol d in D i is: M(d, D i )=max d i  D i C(d,d i ). –e.g: if D i =d 2 d 3, then M(d 1, D i )=max{0.1, 0}=0.1 M(d 2, D i )=max{0.8, 0.05}=0.8 M(d 3, D i )=max{0, 0.7}=0.7 M(d 4, D i )=max{0.1, 0.1}=0.1 M(d 5, D i )=max{0, 0.15}=0.15 –Draw a sample of the whole database and store it in memory. Observed value d1d2d3d4d5 true value d d d d d

Algorithm Phase 2: ambiguous pattern discovery on the sample dataset Chernoff bound estimation –If n is the size of the sample,  is the match of a pattern P= d 1 d 2 …d l in the sample, then P is frequent in the whole db with probability 1-  if  >  match +  infrequent in the whole db with probability 1-  if  <  match -  ambiguous if  (  match - ,  match +  ) where, R is the spread of , R=min 1  i  l match[d i ].  can be selected by users, e.g.  =0.001, 1-  =99.9%.

Algorithm Phase 2: ambiguous pattern discovery on the sample dataset –Use an existing algorithm to mine the sample. –For a pattern discovered in the sample, label it as frequent, ambiguous, or infrequent according to Chernoff bound estimation. –Find the border (denoted by FQT) between frequent and ambiguous patterns, and the border (denoted by INFQT) between the ambiguous and infrequent patterns.

Algorithm Phase 2: ambiguous pattern discovery on the sample dataset FQT={p | p is frequent  immediate superpatterns of p are either ambiguous or infrequent} INFQT={p | p is ambiguous  the superpatterns of p are all infrequent}

Algorithm Phase 3: Border Collapsing input output –Scan the database to count the matches of ambiguous patterns to see whether they are frequent or infrequent. Infrequent patterns INFQT Ambiguous patterns FQT Frequent patterns Infrequent patterns Border Frequent patterns processing

Algorithm Phase 3: border Collapsing –If memory can hold the counters associated for all ambiguous patterns, one database scan is ok. –Sometimes, there is a huge amount of ambiguous patterns, and the database have to be scanned several times. Selects a set of ambiguous patterns until the memory is filled up by the counters, scans the database to get their matches, and collapses the border. Repeat the select-scan-collapse procedure until the two borders become one. Try to minimize the No. of I/O passes needed. The ambiguous patterns which have high border collapsing power are selected.

Algorithm Phase 3: Border Collapsing –How to select patterns? ---- like binary search

Algorithm Phase3: Border Collapsing

Algorithm Phase 3: Border Collapsing –If there are x levels of ambiguous patterns A level-wise method needs to scan the database O(x) times; while border collapsing method only needs to scan the database O(log x) times. –For some previously ambiguous patterns, their labels (whether they are frequent or infrequent) are known, but their matches remain unknown after the step.

Evaluation Database –Standard database a protein database consists of 600K sequences of amino acids. the average length of a sequence is around different symbols –Test databases are generated from the standard database with random noises.  controls the degree of noise. A symbol d in the standard database remains the same in the test database with a probability of 1- , changes to any one of the other 19 symbols with a probability of  /19.

Evaluation Robustness of Match Model –Mine standard database R M ={frequent patterns found by match model} R s ={frequent patterns found by support model} R M =Rs –Mine test database R M‘ R s‘ Accuracy: |R M‘  R M | / |R M‘ |, |R s‘  R s | / |R s‘ | Completeness: |R M‘  R M | / |R M |, |R s‘  R s | / |R s |

Evaluation Robustness of Match Model—different noise degrees Match model: accuracy and completeness are more than 95% Support model: vulnerable to the noise

Evaluation Robustness of Match Model—different pattern lengths Match model: unaffected by the pattern length Support model: degrades as the pattern length becomes long

Evaluation Robustness of Match Model– when there is some error in the compatibility matrix When the error is 10%, match model can still achieve 88% accuracy and 85% completeness.

Evaluation Sample size –Patterns whose match follows in the range (  match - ,  match +  ) are ambiguous. larger sample size -> smaller  -> fewer ambiguous patterns

Evaluation Spread of Match R –R(P)=minimum match of its involved symbols Longer length, tighter R Higher degree of noise, smaller R

Evaluation Effects of Confidence 1-  –Previous experiments: 1-  =0.9999

Evaluation Missing Patterns

Evaluation Performance of Border Collapsing Algorithm –Compared with Max-miner, one of the fastest algorithm for mining frequent long patterns; A sampling method, which uses level-wise search to finalize the border. –Experiment result CPU time vs.  match No. of the database scans vs.  match No. of the database scans vs. length of the longest patterns

Evaluation Performance of Border Collapsing Algorithm

Evaluation Scalability w.r.t to the No. of distinct symbols m –synthetic database: 100K sequences, average length of 1000 –a larger m leads to less frequent patterns –a larger m leads to a larger size (m  m) of compatibility matrix

Evaluation Scalability w.r.t to the No. of distinct symbols

Conclusion In a noise environment, symbols observed may be different from the real ones. Compatibility matrix can provide a probabilistic connection from the observation to the underlying true value. A new metric, match, is proposed to measure significant patterns. Experiment results shows that –The match model is robust w.r.t. the noise. –Border collapsing algorithm is very efficient for finding long patterns.

End ?