CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

Slides:



Advertisements
Similar presentations
Sequential PAttern Mining using A Bitmap Representation
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
gSpan: Graph-based substructure pattern mining
PREFIXSPAN ALGORITHM Mining Sequential Patterns Efficiently by Prefix- Projected Pattern Growth
Frequent Closed Pattern Search By Row and Feature Enumeration
LOGO Association Rule Lecturer: Dr. Bo Yuan
1 Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.
1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Sequential Pattern Mining
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
Performance and Scalability: Apriori Implementation.
SEG Tutorial 2 – Frequent Pattern Mining.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
What Is Sequential Pattern Mining?
Abrar Fawaz AlAbed-AlHaq Kent State University October 28, 2011
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Sequential PAttern Mining using A Bitmap Representation
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Anthony K.H. Tung Hongjun Lu Jiawei Han Ling Feng 國立雲林科技大學 National.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference.
Mining Serial Episode Rules with Time Lags over Multiple Data Streams Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P.
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.
Sequential Pattern Mining
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data Yi-Cheng Chen, Wen-Chih Peng and Suh-Yin Lee ICDM 2011.
Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01
Generalized Sequential Pattern Mining with Item Intervals Yu Hirate Hayato Yamana PAKDD2006.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者:林靜怡.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
1 Mining the Smallest Association Rule Set for Predictions Jiuyong Li, Hong Shen, and Rodney Topor Proceedings of the 2001 IEEE International Conference.
1 Summarizing Sequential Data with Closed Partial Orders Gemma Casas-Garriga Proceedings of the SIAM International Conference on Data Mining (SDM'05) Advisor.
Gspan: Graph-based Substructure Pattern Mining
Mining Sequential Patterns With Item Constraints
Data Mining: Principles and Algorithms Mining Sequence Patterns
Sequential Pattern Mining
Reducing Number of Candidates
Sequential Pattern Mining Using A Bitmap Representation
Data Mining: Concepts and Techniques
A Parameterised Algorithm for Mining Association Rules
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
Data Mining: Concepts and Techniques — Chapter 8 — 8
Approximate Frequency Counts over Data Streams
Data Warehousing Mining & BI
Presentation transcript:

CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp , San Fransisco, CA, May Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp , San Fransisco, CA, May Advisor: Professor Hsin-Hsi Chen Reporter: Clarence Min-Chi Hsieh Natural Language Processing Laboratory, Dept. of Computer Science and Info. Engineering, NTU 2006/01/10

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 2 Outline Introduction Introduction Search Space Pruning Search Space Pruning CloSpan CloSpan Experimental Results Experimental Results Conclusions Conclusions

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 3 Introduction Apriori-like algorithm will generate a huge set of candidate sequences. Apriori-like algorithm will generate a huge set of candidate sequences. Ex. There are 1000 frequent sequences of length-1  1000×1000+(1000×999)/2=1,499,500 candidate sequences Many scans of databases in mining. Many scans of databases in mining. Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}  The Apriori-based method must scan the database at least 15 times. Difficulties at mining long sequential patterns. Difficulties at mining long sequential patterns. Ex. There is only a single sequence of length 100, min_sup=1 length-1 candidate sequences: 100, length-2: 14950, … total = 2^100-1  10^30

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 4 Introduction (Cont.) Definition Definition – Sequence, Elements, Subsequence and Sequential Pattern A sequence : Elements items within an element are listed alphabetically is a subsequence of Given support threshold min_sup_count =2, is a sequential pattern A sequence database <eg(af)cbc>40 30 <(ad)c(bc)(ae)>20 10 sequenceSID

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 5 Introduction (Cont.) Definition Definition – Frequent Sequential Pattern (FS) Include all the sequences whose support is no less than min_sup Include all the sequences whose support is no less than min_sup – Closed Frequent Sequential Pattern (CS) Include no sequence which has a super- sequence with the same support Include no sequence which has a super- sequence with the same support CS  FS CS  FS

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 6 Introduction (Cont.) Example – FS & CS Example – FS & CS IDSequence (af)dea eab e(abf)(bde) min_sup_count = 2 FS: CS: a:3, b:2, d:2, e:3, f:2, ab:2, ad:2, ae:2, (af):2, ea:3, eb:2, fd:2, fe:2, (af)d:2, (af)e:2, eab:2 ea:3, (af)d:2, (af)e:2, eab:2

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 7 Introduction (Cont.) Definition Definition – Prefix and Postfix (Projection),, and are prefixes of sequence,, and are prefixes of sequence Given sequence Given sequence Prefix Postfix /Projection <a><(abc)(ac)d(cf)> <aa><(_bc)(ac)d(cf)> <ab><(_c)(ac)d(cf)>

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 8 Introduction (Cont.) Definition Definition – sequence s = – sequence s = – an item  – I-Step extension s  i  = s  i  = Ex: is an I-Step extension of Ex: is an I-Step extension of – S-Step extension s  s  = s  s  = Ex: is an S-Step extension of Ex: is an S-Step extension of

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 9 Introduction (Cont.) Definition Definition – Prefix Search Tree <> asasasas bibibibi asasasas bsbsbsbs asasasas bsbsbsbs bsbsbsbs didididi cicicici <><(a)><(b)> <(ab)><(a)(a)><(a)(b)> <(ab)(a)><(ab)(b)><(a)(bc)><(a)(bd)>

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 10 Search Space Pruning Definition Definition – Common Prefix Example Example –D s = {de(af), de(fg)} –s  not closed  unnecessary to extend s  –s  not closed  unnecessary to extend s  – Partial Order Example Example –Before projecting D into D a, D b, D d, D e, D f –a is always before the f in all the sequences –Need not search any sequence beginning with f

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 11 Search Space Pruning (Cont.) Definition Definition –  (D) Total number of items in D Total number of items in D – Equivalence of Projected Database Two sequences s and s’, s  s’ Two sequences s and s’, s  s’ D s = D s’   (D s ) =  (D s’ ) D s = D s’   (D s ) =  (D s’ ) Example Example –D (af) = D f = {de, (de)} –  (D (af) ) =  (D f ) = 4 IDSequence (af)dea eab e(abf)(bde) 0 1 2

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 12 Search Space Pruning (Cont.) Definition Definition – Early Termination by Equivalence Two sequences s and s’, s  s’ Two sequences s and s’, s  s’ And also  (D s ) =  (D s’ ) And also  (D s ) =  (D s’ ) Then , support(s   ) = support(s’   ) Then , support(s   ) = support(s’   ) Example Example –  (D (af) ) =  (D f ) –(af)d & (af)e are frequent –support((af)d) = support(fd) –support((af)e) = support(fe) –don’t know the support of fd and fe

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 13 Search Space Pruning (Cont.) Definition Definition – Backward Sub-Pattern sequence s < s’ and s  s’ sequence s < s’ and s  s’  (D s ) =  (D s’ )  (D s ) =  (D s’ ) Stop searching any descendant of s’ in the prefix search tree Stop searching any descendant of s’ in the prefix search tree a f f ss’ a ff

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 14 Search Space Pruning (Cont.) Definition Definition – Backward Super-Pattern sequence s < s’ and s  s’ sequence s < s’ and s  s’  (D s ) =  (D s’ )  (D s ) =  (D s’ ) Transplanting the descendants of s to s’ instead of searching any descendant of s’ in the prefix search tree Transplanting the descendants of s to s’ instead of searching any descendant of s’ in the prefix search tree b b e s s’ bb e

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 15 Search Space Pruning (Cont.) Definition Definition – Partial Prefix Sequence Lattice Search space Search space<> fifififi fsfsfsfs asasasas eseseses bsbsbsbs bsbsbsbs asasasas bsbsbsbs bsbsbsbs dsdsdsds eseseses  (D eb ) =  (D b )  (D eab ) =  (D ab )  (D af ) =  (D f )

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 16 CloSpan CloSpan(s, D s, min_sup, L) CloSpan(s, D s, min_sup, L) – Input: A sequence s, a projectd DB D s, and min_sup – Output: The prefix search lattice L – Check whether a discovered sequence s’ exist s.t. either s  s’ or s’  s, and  (D s ) =  (D s’ ); – if such super-pattern or sub-pattern exists then Modify the link in L, return; Modify the link in L, return; – else insert s into L; – scan D s once, find every frequent item  such that s can be extended to (s  i  ), or s can be extended to (s  i  ), or s can be extended to (s  s  ); s can be extended to (s  s  ); – if no valid  available then return; return; – for each valid  do  I-Step Call CloSpan(s  i , D s  i , min_sup, L ); Call CloSpan(s  i , D s  i , min_sup, L ); – for each valid  do  S-Step Call CloSpan(s  s , D s  s , min_sup, L ); Call CloSpan(s  s , D s  s , min_sup, L ); – return;

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 17 CloSpan (Cont.) Hash for Fast Condition Checking Hash for Fast Condition Checking <> fifififi asasasas eseseses bsbsbsbs asasasas dsdsdsds eseseses Hash Table: Hash Table: nil nil

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 18 CloSpan (Cont.) Example Example IDSequence (af)dea eab e(abf)(bde) min_sup_count = 2 Hash Function  Mod 4 a:3, b:2, d:2, e:3, f:2

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 19 CloSpan (Cont.) Example (Cont.) Example (Cont.) DaDaDaDa DbDbDbDb DdDdDdDd DeDeDeDe DfDfDfDf (_f)dea, b, (_bf)(bde) (_f)(bde) ea, (_e) a, ab, (abf)(bde) dea, (bde) <> nil nil nil nil (_f)de, b, (_f)(bde) 8  (D s ) DaDaDaDa (_f):2, b:2, d:2, e:2 a:3, b:2 6 DeDeDeDe a, ab, (ab)b  (D s ) de, (de) 4 DfDfDfDf d:2, e:2  (D s ) X0 DbDbDbDbX X0 DdDdDdDdX

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 20 CloSpan (Cont.) Example (Cont.) Example (Cont.)<> nil a s :3 (_f)de, b, (_f)(bde) 8  (D s ) DaDaDaDa (_f):2, b:2, d:2, e:2 0 Mod 4

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 21 CloSpan (Cont.) Example (Cont.) Example (Cont.) D (af) de, (bde) D ab de D ad e, e D ae de, (de) 4  (D s ) D (af) d:2, e:2 X0  (D s ) D ab X e, e 2  (D s ) D ad e:2 X0  (D s ) D ae X

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 22 CloSpan (Cont.) Example (Cont.) Example (Cont.) de, (de) 4  (D s ) D (af) d:2, e:2 0 Mod 4 <> nil a s :3 4 f i :2

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 23 CloSpan (Cont.) Example (Cont.) Example (Cont.) D (af)d e, (_e) D (af)e X 0  (D s ) D (af)d X X 0  (D s ) D (af)e X

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 24 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D (af)d X 0 Mod 4 <> a s :3 4 f i :2 nil d s :2

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 25 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D (af)e X 0 Mod 4 <> a s :3 4 f i :2 nil d s :2 0 e s :2

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 26 CloSpan (Cont.) Example (Cont.) Example (Cont.)<> a s :3 4 f i :2 nil d s :2 0 e s :2 0 b s :2 X0  (D s ) D ab X 0 Mod 4

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 27 CloSpan (Cont.) Example (Cont.) Example (Cont.) X0  (D s ) DbDbDbDbX 0 Mod 4 <> a s :3 4 f i :2 nil d s :2 0 e s :2 0 b s :2

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 28 CloSpan (Cont.) Example (Cont.) Example (Cont.) X0  (D s ) DdDdDdDdX 0 Mod 4 <> a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 nil

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 29 CloSpan (Cont.) Example (Cont.) Example (Cont.) a, ab, (ab)b 6  (D s ) DeDeDeDe a:3, b:2 2 Mod 4 <> a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 nil6 e s :3 nil

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 30 CloSpan (Cont.) Example (Cont.) Example (Cont.) D ea b, (_b)b b, b 2  (D s ) D ea b:2 X 0  (D s ) D eb X

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 31 CloSpan (Cont.) Example (Cont.) Example (Cont.) b, b 2  (D s ) D ea b:2 2 Mod 4 <> a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 nil

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 32 CloSpan (Cont.) Example (Cont.) Example (Cont.) D eab X 0  (D s ) D eab X 0 Mod 4

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 33 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D eab X 0 Mod 4 <> a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 nil

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 34 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D eab X 0 Mod 4 <> a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 b s :2 nil

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 35 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D eb X 0 Mod 4 <> a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 b s :2 nil

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 36 CloSpan (Cont.) Example (Cont.) Example (Cont.) X 0  (D s ) D eb X 0 Mod 4 <> a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 b s :2 nil

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 37 CloSpan (Cont.) Example (Cont.) Example (Cont.) <> a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 b s :2 nil de, (de) 4 DfDfDfDf d:2, e:2  (D s ) 0 Mod 4

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 38 CloSpan (Cont.) Example (Cont.) Example (Cont.) de, (de) 4 DfDfDfDf d:2, e:2  (D s ) 0 Mod 4 <> a s :3 4 f i :2 d s :2 0 e s :2 0 b s :2 26 e s :3 nil a s :3 b s :2 nil

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 39 CloSpan (Cont.) Example (Cont.) Example (Cont.)<> a s :3 f i :2 d s :2 e s :2 b s :2 e s :3 a s :3 b s :2 (af)d:2(af)e:2eab:2 ea:3

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 40 Experimental Results Synthetic Data Synthetic Data – Parameters D : Number of sequences in 000s D : Number of sequences in 000s C : Average itemsets per sequence C : Average itemsets per sequence T : Average items per itemset T : Average items per itemset N : Number of different items in 000s N : Number of different items in 000s S : Average itemsets in maximal sequences S : Average itemsets in maximal sequences I : Average items in maximal sequences I : Average items in maximal sequences – Two Data Set D10 C10 T2.5 N10 S6 I2.5 D10 C10 T2.5 N10 S6 I2.5 D5 C20 T20 N10 S20 I20 D5 C20 T20 N10 S20 I20 Real world datasets Real world datasets – KDDCup2000 – Gazelle Click Stream

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 41 Experimental Results (Cont.) Synthetic Data Synthetic Data D10 C10 T2.5 N10 S6 I2.5 D10 C10 T2.5 N10 S6 I2.5

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 42 Experimental Results (Cont.) Synthetic Data Synthetic Data D5 C20 T20 N10 S20 I20 D5 C20 T20 N10 S20 I20

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 43 Experimental Results (Cont.) Real world datasets Real world datasets – KDDCup ,369 sequences 29,369 sequences 35,722 sessions 35,722 sessions 87,546 page views 87,546 page views The average number of sessions in a sequence is around 1 The average number of sessions in a sequence is around 1 The average number of pageviews in a session is 2 The average number of pageviews in a session is 2 The largest session contains 342 views The largest session contains 342 views The longest sequence has 140 sessions The longest sequence has 140 sessions The largest sequence contains 651 page views The largest sequence contains 651 page views

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 44 Experimental Results (Cont.)

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 45 Conclusions Clospan to mine frequent closed sequences efficiently. Clospan to mine frequent closed sequences efficiently. Clospan outperforms PrefixSpan. Clospan outperforms PrefixSpan.

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 46 Lexicographic Order Definition Definition – Lexicographic Order t = {i 1, i 2, …,i k }, i 1  i 2  …  i k t = {i 1, i 2, …,i k }, i 1  i 2  …  i k t’ = {j 1, j 2, …,j l }, j 1  j 2  …  j l t’ = {j 1, j 2, …,j l }, j 1  j 2  …  j l t<t’ iff either of the following is true: t<t’ iff either of the following is true: –For some h, 0  h  min{k,l}, we have i r = j r for r < h, and i h < j h, or –k < l, and i 1 = j 1, i 2 = j 2, …,i k = j k Example Example –(a,f) < (b,f) –(a,b) < (a,b,c) –(a,b,c) < (b,c)

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 47 Sequence Lexicographic Order Definition Definition – Sequence Lexicographic Order If s’ = s  p, then s < s’ If s’ = s  p, then s < s’ If s =   i p and s’ =   s p’, no matter what the order relation between p and p’ is, s < s’ If s =   i p and s’ =   s p’, no matter what the order relation between p and p’ is, s < s’ If s =   i p and s’ =   i p’, p<p’, indicates s<s’ If s =   i p and s’ =   i p’, p<p’, indicates s<s’ If s =   s p and s’ =   s p’, p<p’, indicates s<s’ If s =   s p and s’ =   s p’, p<p’, indicates s<s’ Example Example –(ab) < (ab)(a) –(ac) < (a)(d), (ad) < (a)(c) –(ab) < (ac) –(a)(b) < (a)(c)

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 48 Lexicographic Sequence Tree Definition Definition – Lexicographic Sequence Tree <><(a)><(b)> <(ab)><(a)(a)><(a)(b)> <(ab)(a)><(ab)(b)><(a)(bc)><(a)(bd)>

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 49 Search Space Pruning Definition Definition – Common Prefix a subsequence s, projected database D s a subsequence s, projected database D s if ,  is a common prefix for all the sequence with the same extension type (either itemset- extension or sequence-extension) in D s if ,  is a common prefix for all the sequence with the same extension type (either itemset- extension or sequence-extension) in D s , if s   is closed,  must be a prefix of  , if s   is closed,  must be a prefix of  , we need not search s   and its descendants except the branch of s   , we need not search s   and its descendants except the branch of s   Example Example –D s = {de(af), de(fg)} –s  not closed  unnecessary to extend s  –s  not closed  unnecessary to extend s 

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 50 Search Space Pruning (Cont.) CommonPrefix CommonPrefix – An intermediate algorithm – Developed which adopts the PrefixSpan framework plus the common prefix pruning technique – Outperforms PrefixSpan

Copyright © Natural Language Processing Lab., NTU, 2006 Reporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential Patterns in Large Datasets Slide - 51 Search Space Pruning (Cont.) Definition Definition – Partial Order A sequence s, projected database D s A sequence s, projected database D s if among all the sequences in D s, an item  does always occur before an item  (either in the same itemset for all sequences in D s or in a different itemset but not both), then D s  = D s  if among all the sequences in D s, an item  does always occur before an item  (either in the same itemset for all sequences in D s or in a different itemset but not both), then D s  = D s  , s  is not closed. Need not search any sequence in the branch of s  , s  is not closed. Need not search any sequence in the branch of s  Example Example –Before projecting D into D a, D b, D d, D e, D f –a is always before the f in all the sequences –Need not search any sequence beginning with f