Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology CONTOUR: an efficient algorithm for discovering discriminating.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology CONTOUR: an efficient algorithm for discovering discriminating subsequences Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis, Charu C. Aggarwal DMKD, Vol. 18, No. 1, 2009, pp. 1-29. Presenter : Wei-Shen Tai 2009/3/11

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 2 Outline Introduction Problem formulation Efficiently mining summarization subsequences Summarization subsequence based clustering Empirical results Conclusions Comments

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 3 Motivation Make frequent sequence mining more efficient  It is very time consuming to mine the complete set of frequent subsequences for large sequence databases.  A subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm.

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 4 Objective Effective search space pruning methods  Finding the summarization subsequence to represent original input sequence.

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 5 Problem formulation  Subsequence If sequence S α is contained in sequence S β, S α is called a subsequence of S β.  Absolute support of sequence The number of input sequences in SDB that contain Sα, denoted by supSDB(Sα).  Summarization subsequences A set of representative subsequences as a concise summarization of the input sequences,  Internal similarity of micro-cluster C λ CABAC→BAC

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 6 Efficiently mining summarization subsequences Frequent subsequence enumeration  For each prefix, the mining algorithm builds its projected database, and computes the set of locally frequent events. min_sup = 2 SDB| AA ={C,A}, but they cannot be used to extend the prefix AA. (AAC, AAA )

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 7 Closed sequence-based optimization BackScan search space pruning  Semi-maximum period A subsequence between the first instance and the last instance of subsequence P. (for example, prefix BB) First, and second to m semi-maximum period An event A appears in each of the first semi-maximum periods of BB. It means ABB and BB exist simultaneously, ABB is the longer one. ABCBA →ABCB ABCBACBBABCB

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 8 Unpromising projected sequence pruning Current Frequent Covering Subsequence  An input sequence Si that has the largest weight and was discovered so far. Trivial projected sequence  Short projected sequences may not contain sufficient number of events to generate any summarization subsequence.  For example, prefix p=C:5 SDB|p = {PS 1 =ABAC, PS 3 = B, PS 4 = BAC, PS 5 = BBA, PS 6 = BC}, CFCS 1 =ABA:3, CFCS 3 =ABCB:2, CFCS 4 =BAC:2, CFCS 5 =ABA:3, and CFCS 6 =ABCB:2.

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 9 Further discussions  Event weight assignment It is similar to TFIDF concept  Multiple summarization subsequence mining An input sequence may support multiple summarization subsequences.

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 10 Summarization subsequence based clustering  Micro-cluster generation Input sequences with the same summarization subsequence are grouped together.  Macro-cluster creation Agglomerative hierarchical clustering paradigm to create K macro- clusters. ABA ABCBCBAC

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 11 Empirical results

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 12 Conclusions CONTOUR  A set of summarization subsequences is a concise representation of the original sequence database.  It preserves much structural information, and can be used to efficiently cluster the input sequences with a high clustering quality.

N.Y.U.S.T. I. M. Intelligent Database Systems Lab 13 Comments Advantage  This method provides more concise representation of original sequences than feature selection methods.  Those summarization subsequences can be efficiently adopted in most of conventional sequence mining methods. Drawback  In equation 1 and 2, the internal similarity is computed under one summarization subsequence. Whereas, the multiple summarization subsequences may not be suitable for these equations. Application  Sequence pattern mining and clustering.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology CONTOUR: an efficient algorithm for discovering discriminating.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology CONTOUR: an efficient algorithm for discovering discriminating."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology CONTOUR: an efficient algorithm for discovering discriminating.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology CONTOUR: an efficient algorithm for discovering discriminating."— Presentation transcript:

Similar presentations

About project

Feedback