Download presentation
Presentation is loading. Please wait.
Published byJemimah Welch Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology CONTOUR: an efficient algorithm for discovering discriminating subsequences Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis, Charu C. Aggarwal DMKD, Vol. 18, No. 1, 2009, pp. 1-29. Presenter : Wei-Shen Tai 2009/3/11
2
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 2 Outline Introduction Problem formulation Efficiently mining summarization subsequences Summarization subsequence based clustering Empirical results Conclusions Comments
3
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 3 Motivation Make frequent sequence mining more efficient It is very time consuming to mine the complete set of frequent subsequences for large sequence databases. A subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm.
4
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 4 Objective Effective search space pruning methods Finding the summarization subsequence to represent original input sequence.
5
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 5 Problem formulation Subsequence If sequence S α is contained in sequence S β, S α is called a subsequence of S β. Absolute support of sequence The number of input sequences in SDB that contain Sα, denoted by supSDB(Sα). Summarization subsequences A set of representative subsequences as a concise summarization of the input sequences, Internal similarity of micro-cluster C λ CABAC→BAC
6
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 6 Efficiently mining summarization subsequences Frequent subsequence enumeration For each prefix, the mining algorithm builds its projected database, and computes the set of locally frequent events. min_sup = 2 SDB| AA ={C,A}, but they cannot be used to extend the prefix AA. (AAC, AAA )
7
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 7 Closed sequence-based optimization BackScan search space pruning Semi-maximum period A subsequence between the first instance and the last instance of subsequence P. (for example, prefix BB) First, and second to m semi-maximum period An event A appears in each of the first semi-maximum periods of BB. It means ABB and BB exist simultaneously, ABB is the longer one. ABCBA →ABCB ABCBACBBABCB
8
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 8 Unpromising projected sequence pruning Current Frequent Covering Subsequence An input sequence Si that has the largest weight and was discovered so far. Trivial projected sequence Short projected sequences may not contain sufficient number of events to generate any summarization subsequence. For example, prefix p=C:5 SDB|p = {PS 1 =ABAC, PS 3 = B, PS 4 = BAC, PS 5 = BBA, PS 6 = BC}, CFCS 1 =ABA:3, CFCS 3 =ABCB:2, CFCS 4 =BAC:2, CFCS 5 =ABA:3, and CFCS 6 =ABCB:2.
9
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 9 Further discussions Event weight assignment It is similar to TFIDF concept Multiple summarization subsequence mining An input sequence may support multiple summarization subsequences.
10
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 10 Summarization subsequence based clustering Micro-cluster generation Input sequences with the same summarization subsequence are grouped together. Macro-cluster creation Agglomerative hierarchical clustering paradigm to create K macro- clusters. ABA ABCBCBAC
11
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 11 Empirical results
12
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 12 Conclusions CONTOUR A set of summarization subsequences is a concise representation of the original sequence database. It preserves much structural information, and can be used to efficiently cluster the input sequences with a high clustering quality.
13
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 13 Comments Advantage This method provides more concise representation of original sequences than feature selection methods. Those summarization subsequences can be efficiently adopted in most of conventional sequence mining methods. Drawback In equation 1 and 2, the internal similarity is computed under one summarization subsequence. Whereas, the multiple summarization subsequences may not be suitable for these equations. Application Sequence pattern mining and clustering.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.