你的一小步，我的一大步 Jen-Wei Huang 黃仁暐 National Taiwan University.

你的一小步，我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University

2015/10/7Jen-Wei Huang2

2015/10/7Jen-Wei Huang3 * http://www.wretch.cc/blog/EtudeBIKE

2015/10/7Jen-Wei Huang4 * http://www.giant-bicycles.com/zh-TW/

2015/10/7Jen-Wei Huang7 * http://cape7.pixnet.net/blog

2015/10/7Jen-Wei Huang10 * http://www.wretch.cc/blog/orzboyz * http://blog.sina.com.tw/9winds/ * http://atomcinema.pixnet.net/blog

2015/10/7Jen-Wei Huang12 * http://www.amazon.com

2015/10/7Jen-Wei Huang13 * http://www.amazon.com

2015/10/7Jen-Wei Huang14 * http://www.hq.nasa.gov/office/pao/History/ap11ann/kippsphotos/apollo.html

A General Model for Sequential Pattern Mining with a Progressive Database Jen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou and Ming-Syan Chen National Taiwan University * IEEE Trans. on Knowledge and Data Engineering, Vol. 20, No. 6, June 2008

2015/10/7Jen-Wei Huang16 Outlines Introduction Preliminaries Algorithm Pisa Experiments Conclusions Q & A 16

2015/10/7Jen-Wei Huang17 Introduction to SPM “Mining of frequently occurring patterns related to time or other sequences.” J. Han, Data Mining – Concepts and Techniques “Given a set of sequences, find the complete set of frequent subsequences” J. Pei, PrefixSpan Ex) What items one will buy if he/she has bought some certain items 17

2015/10/7Jen-Wei Huang18 Time-related data Customers’ buying behavior Natural phenomena Sensor network data Web access patterns Stock price changes DNA sequence applications 18

2015/10/7Jen-Wei Huang19 Definition Let I = {x 1, x 2,..., x n } be a set of different items. An element e, denoted by (x i x j...), is a subset of items ⊆ I of which items appear in a sequence at the same time. A sequence s, denoted by, is an ordered list of elements. A sequence database Db contains a set of sequences and |Db| represents the number of sequences in Db. 19

2015/10/7Jen-Wei Huang20 Definition A sequence α = is a subsequence of another sequence β = if there exists a set of integers, 1 ≤ i1 < i2 <... < in ≤ m, such that a 1 ⊆ b i1, a 2 ⊆ b i2,..., and a n ⊆ b in. 20

2015/10/7Jen-Wei Huang21 Definition The sequential pattern mining can be defined as "Given a sequence database, Db, and a user- defined minimum support, min_sup, find the complete set of subsequences whose occurrence frequencies ≥ min_sup ∗ |Db|." 21

2015/10/7Jen-Wei Huang22 Three Categories Depending on the management of the corresponding database, sequential pattern mining can be divided into three categories, namely sequential pattern mining with a static database. an incremental database. a progressive database. 22

How To Do Sequential Pattern Mining on a Static Database An Overview

2006/03/24jwhuang National Taiwan University24 How? Apriori-like algorithms AprioriAll – by Agrawal et al GSP – by R. Srikant et al Partition-based algorithms FreeSpan – by J. Han et al PrefixSpan – by J. Pei et al Vertical format algorithms SPADE – by Zaki et al SPAM – by Ayres et al

2006/03/24jwhuang National Taiwan University25 Apriori-like Algorithms 1.Sort phase Sort the database Customer id as the primary key and time as the second key 2.Litemset phase Count the frequency of each itemset The fraction of customers who bought the itemset

2006/03/24jwhuang National Taiwan University26 Apriori-like Algorithms 3.Transformation phase Transform each tx to all litemsets in the form of C01: C02: C03: C04: C05:

2015/10/7Jen-Wei Huang27 CIDItems 2 10 20 5 90 2 30 2 40 60 70 4 30 3 30 50 70 1 30 1 90 4 40 70 4 90 3 10 5 1 40 70 5 20 2 90 3 20 CIDItems 1 30 90 {40 70} 2 {10 20} 30 {40 60 70} 90 3 {30 50 70} 10 20 4 30 {40 70} 90 5 90 10 20 Itemset# 10 3 20 3 30 4 40 3 50 1 60 1 70 4 90 4 {10 20} 1 {40 60} 1 {40 70} 3 {60 70} 1 {40 60 70} 1 {30 50} 1 {30 70} 1 {50 70} 1 {30 50 70} 1

2015/10/7Jen-Wei Huang28 Itemset#New 10 31 20 32 30 43 40 34 70 45 90 46 {40 70} 37 CIDItems 1 3 6 {4, 5, 7} 2 {1, 2} 3 {4, 5, 7} 6 3 {3, 5} 1 2 4 3 {4, 5, 7} 6 5 6 1 2

2006/03/24jwhuang National Taiwan University29 Apriori-like Algorithms 4.Mining phase Apriori-like algorithm 5.Maximal phase Find the maximum patterns

2015/10/7Jen-Wei Huang30 CIDItems 1 3 6 {4, 5, 7} 2 {1, 2} 3 {4, 5, 7} 6 3 {3, 5} 1 2 4 3 {4, 5, 7} 6 5 6 1 2 Itemset# 1 2 2 1 3 1 1 4 1 1 5 1 1 6 1 1 7 1 2 1 0 2 3 1 2 4 1 2 5 1 2 6 1 2 7 1 3 1 1 3 2 1 Itemset# 3 4 3 3 5 3 3 6 3 3 7 3 4 1 0 4 2 0 4 3 0 4 5 0 4 6 2 4 7 0 5 1 1 5 2 1 5 3 0 5 4 0 Itemset# 5 6 2 5 7 0 6 1 1 6 2 1 6 3 0 6 4 1 6 5 1 6 7 1 7 1 0 7 2 0 7 3 0 7 4 0 7 5 0 7 6 2

2015/10/7Jen-Wei Huang31 CIDItems 1 3 6 {4, 5, 7} 2 {1, 2} 3 {4, 5, 7} 6 3 {3, 5} 1 2 4 3 {4, 5, 7} 6 5 6 1 2 Itemset# 3 4 6 2 3 5 6 2 3 7 6 2 Therefore, frequent sequential patterns are: Itemset# 10 31 20 32 30 43 40 34 70 45 90 46 {40 70} 37 According to mappings, original frequent sequential patterns are:

2015/10/7Jen-Wei Huang32 According to mappings, original frequent sequential patterns are: Because and are contained by and are contained by, final maximal sequential patterns are:

2015/10/7Jen-Wei Huang33 Related Works Static database AprioriAll – by Agrawal et al GSP – by R. Srikant et al SPADE – by Zaki et al FreeSpan – by J. Han et al PrefixSpan – by J. Pei et al SPAM – by Ayres et al 33

2015/10/7Jen-Wei Huang34 Related Works Incremental database ISM – by Parthasarathy et al IncSP – by Lin et al ISE – by Masseglia et al IncSpan – by Cheng et al MILE – by Chen et al 34

2015/10/7Jen-Wei Huang35 Motivation The assumption of having a static database may not hold in practice. The data in real world change on the fly. Finding sequential patterns in an incremental database may lack of interest to the users. It is noted that users are usually more interested in the recent data than the old ones. 35

2015/10/7Jen-Wei Huang36 Motivation If a certain sequence does not have any newly arriving elements, this sequence will still stay in the database and undesirably contribute to |Db|. New sequential patterns which appear frequently in the recent sequences may not be considered as frequent sequential patterns. 36

2015/10/7Jen-Wei Huang37 Definition -- Period of Interest Period of Interest (abbreviated as POI) is a sliding window whose length is a user-specified time interval, continuously advancing as the time goes by. The sequences having elements whose timestamps fall into this period, POI, contribute to the |Db| for current sequential patterns. 37

time A CAD t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … BB ADB BCDC CDBDA A A BC AA C S01 S02 S03 S04 S05 S06 AC BD C D D Db 1,5 Db 2,6 Db 3,7 Db 4,8 Db 5,9 Db 6,10 SID POI=5, min_supp=0.5 38

2015/10/7Jen-Wei Huang40 Progressive Sequential Pattern Progressive sequential pattern mining problem is defined as follows "Given a progressive sequence database, a user-specified period of interest, and a user- defined minimum support threshold, find the complete set of frequent subsequences whose occurrence frequencies are greater than or equal to the minimum support times the number of sequences in every period of interest of the database." 40

2015/10/7Jen-Wei Huang41 Naïve Algorithm Use conventional static sequential pattern mining algorithms to mine sequential patterns separately from all combination of POIs e.g., Db 1,5, Db 2,6, Db 3,7, Db 4,8, Db 5,9, etc. For the sequence database which has the elements appearing in the interval of n timestamps, the total number of POIs in this interval is equal to (n − POI +1). 41

2015/10/7Jen-Wei Huang42 Prior Work The only prior work on progressive database is GSP+ and MFS+ proposed by Zhang based on static algorithms GSP and MFS (also derived by the same authors). However, these algorithms still have to re-mine each sub-database using the static algorithms GSP and MFS. Nevertheless, the performance improvement of GSP+ and MFS+ over GSP and MFS is only within 15% as reported by their authors. 42

2015/10/7Jen-Wei Huang43 Algorithm DirApp Stands for Direct Append. Consists of two procedures Progressively Updating abbreviated as PrUp Immediately Filtering abbreviated as ImFi 43

2015/10/7Jen-Wei Huang44 Procedure PrUp When progressively reading newly incoming elements, Procedure PrUp can update each sequence in the sequence database generate candidate sequential patterns calculate occurrence frequencies of all candidate equential patterns in the current POI. 44

2015/10/7Jen-Wei Huang45 Procedure ImFi DirApp uses Procedure ImFi to filter out obsolete data from the existing sequence database prune away obsolete candidate sequential patterns from the candidate set. report the most up-to-date frequent sequential patterns to the user in every POI 45

A B C AD B time A CAD t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … BB ADB BCDC CDBDA A A BC AA C S01 S02 S03 S04 S05 S06 AC BD C D D SID time t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … S01 46

2015/10/7Jen-Wei Huang47 Example time A B C AD B t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … 47

Db 1,1 A1A1 Db 1,4 A1A1 B2B2 AB 1 C4C4 AC 1 BC 2 ABC 1 Db 1,2 A1A1 B2B2 AB 1 Db 1,3 A1A1 B2B2 AB 1 (1)(4) (2) (3) A B C AD B t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … 48

Db 1,4 A1A1 B2B2 AB 1 C4C4 AC 1 BC 2 ABC 1 (4) A B C AD B t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … Db 1,5 A5A5 B(AD) 2 B2B2 ABD 1 AB 1 AB(AD) 1 C4C4 CA 4 AC 1 CD 4 BC 2 C(AD) 4 ABC 1 ACD 1 D5D5 AC(AD) 1 (AD) 5 BCA 2 AD 1 BCD 2 A(AD) 1 BC(AD) 2 BA 2 ABCD 1 BD 2 ABC(AD) 1 (5) 49

A B C AD B t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … Db 1,5 A5A5 B(AD) 2 B2B2 ABD 1 AB 1 AB(AD) 1 C4C4 CA 4 AC 1 CD 4 BC 2 C(AD) 4 ABC 1 ACD 1 D5D5 AC(AD) 1 (AD) 5 BCA 2 AD 1 BCD 2 A(AD) 1 BC(AD) 2 BA 2 ABCD 1 BD 2 ABC(AD) 1 (5) Db 2,6 A5A5 B2B2 C4C4 BC 2 D5D5 (AD) 5 BA 2 BD 2 B(AD) 2 CA 4 CD 4 C(AD) 4 BCA 2 BCD 2 BC(AD) 2 (6) 50

A B C AD B t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … Db 2,6 A5A5 B2B2 C4C4 BC 2 D5D5 (AD) 5 BA 2 BD 2 B(AD) 2 CA 4 CD 4 C(AD) 4 BCA 2 BCD 2 BC(AD) 2 (6) Db 3,7 A5A5 C4C4 D5D5 (AD) 5 CA 4 CD 4 C(AD) 4 B7B7 AB 5 CB 4 DB 5 (AD)B 5 CAB 4 CDB 4 C(AD)B 4 (7) … 51

Db 1,1 A1A1 Db 1,5 A5A5 B(AD) 2 B2B2 ABD 1 AB 1 AB(AD) 1 C4C4 CA 4 AC 1 CD 4 BC 2 C(AD) 4 ABC 1 ACD 1 D5D5 AC(AD) 1 (AD) 5 BCA 2 AD 1 BCD 2 A(AD) 1 BC(AD) 2 BA 2 ABCD 1 BD 2 ABC(AD) 1 Db 1,4 A1A1 B2B2 AB 1 C4C4 AC 1 BC 2 ABC 1 Db 1,2 A1A1 B2B2 AB 1 Db 2,6 A5A5 B2B2 C4C4 BC 2 D5D5 (AD) 5 BA 2 BD 2 B(AD) 2 CA 4 CD 4 C(AD) 4 BCA 2 BCD 2 BC(AD) 2 Db 1,3 A1A1 B2B2 AB 1 Db 3,7 A5A5 C4C4 D5D5 (AD) 5 CA 4 CD 4 C(AD) 4 B7B7 AB 5 CB 4 DB 5 (AD)B 5 CAB 4 CDB 4 C(AD)B 4 (1)(4)(5)(6) (7) (2) (3) 52

Db 1,2 A1A1 B2B2 AB 1 S01 time A CAD t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … BB ADB BCDC CDBDA A A BC AA C S01 S02 S03 S04 S05 S06 AC BD C D D SID Db 1,2 A1A1 AB 1 D1D1 DB 1 (AD) 1 (AD)B 1 B2B2 S02S03 Db 1,2 A1A1 AB 1 B2B2 AC 1 C2C2 A(BC) 1 (BC) 2 Db 1,2 (4) AB 1 3 A(BC) 1 1 AC 1 1 (AD)B 1 1 DB 1 1 AB 1 (3) Db 1,2 D2D2 S04 53

Db 1,2 (4) AB 1 3 A(BC) 1 1 AC 1 1 (AD)B 1 1 DB 1 1 Db 1,3 (5) AB 1 3 A(BC) 1 1 AC 1 1 (AD)B 1 1 DB 1 1 A(BC)B 1 1 ACB 1 1 (BC)B 2 1 CB 2 1 DC 2 1 AB 1 (3) DA 3 (3)BA 4 (3) (2)(3) (4)(5) Db 1,4 (5) AB 1 3A(BC)BC 1 1 A(BC) 1 1A(BC)C 1 1 AC 1 2(AD)A 1 1 (AD)B 1 1(AD)BA 1 1 DB 3 2BA 2 1 A(BC)B 1 1BC 3 2 ACB 1 1(BC)BC 2 1 (BC)B 2 1(BC)C 2 1 CB 2 1DA 1 1 DC 2 1DBA 1 1 ABC 1 2 Db 1,5 (5) AB 1 3ABC 1 2DBA 3 2BCA 2 1 A(BC) 1 1A(BC)BC 1 1A(AD) 1 1BC(AD) 2 1 AC 1 2A(BC)C 1 1AB(AD) 1 1BCD 2 1 (AD)B 1 1(AD)A 1 1ABC(AD) 1 1BD 2 1 DB 3 2(AD)BA 1 1ABCD 1 1CA 4 2 A(BC)B 1 1BA 4 3ABD 1 1C(AD) 4 1 ACB 1 1BC 3 2AC(AD) 1 1CD 4 1 (BC)B 2 1(BC)BC 2 1ACD 1 1DCA 2 1 CB 2 1(BC)C 2 1AD 1 1 DC 2 1DA 3 3B(AD) 2 1 54

Db 2,6 (5) DB 3 1BC(AD) 2 1 (BC)B 2 1BCD 2 1 CB 2 1BD 2 1 DC 2 1CA 4 3 BA 4 4C(AD) 4 1 BC 3 2CD 4 1 (BC)BC 2 1DCA 2 1 (BC)C 2 1(BC)A 2 1 DA 3 2(BC)BA 2 1 DBA 3 1(BC)BCA 2 1 B(AD) 2 1(BC)CA 2 1 BCA 3 2CBA 2 1 BA 4 (4)CA 4 (3) (6) Db 3,7 (5) DB 5 2(AD)B 5 1 BA 4 2BAC 4 1 BC 4 2CAB 4 2 DA 3 1CA(BC) 3 1 DBA 3 1C(AD)B 4 1 BCA 3 1CB 4 2 CA 4 3C(BC) 3 1 C(AD) 4 1CDB 4 1 CD 4 1DAC 3 1 AB 5 2DBAC 3 1 A(BC) 5 1DBC 3 1 AC 5 2DC 3 1 (7) Db 4,8 (6) DB 5 1BAC 4 1 BA 4 1CAB 4 1 BC 7 2C(AD)B 4 1 CA 4 2CB 4 1 C(AD) 4 1CDB 4 1 CD 4 1ABC 5 1 AB 5 2(AD)BC 5 1 A(BC) 5 1(AD)C 5 1 AC 6 4DBC 5 1 (AD)B 5 1DC 5 1 (8) Db 5,9 (5) DB 5 1 BC 7 1 AB 5 2 A(BC) 5 1 AC 8 5 (AD)B 5 1 ABC 5 1 (AD)BC 5 1 (AD)C 5 1 DBC 5 1 DC 5 1 ACD 6 2 AD 6 2 CD 8 2 AC 6 (4) (9) CA 4 (3) AC 8 (5) 55

2015/10/7Jen-Wei Huang56 The Advantages of DirApp DirApp needs only one scan of newly arriving elements and the candidate set at each timestamp rather than quadratic scans by conventional algorithms. DirApp can maintain latest data sequences find the complete set of up-to-date sequential patterns delete obsolete data and patterns rapidly 56

2015/10/7Jen-Wei Huang57 The Disadvantages of DirApp DirApp needs lots of working space to store the candidate sets for all sequences. Scanning all candidate sets induces huge computation in execution time. DirApp needs another data structure to calculate the occurrence frequencies of all candidate sequential patterns. 57

2015/10/7Jen-Wei Huang59 Algorithm Pisa Pisa stands for Progressive mIning of Sequential pAtterns Pisa utilizes a Progressive Sequential tree (abbreviated as PS-tree) to maintain the information of all sequences in each POI to update each sequence find up-to-date sequential patterns 59

2015/10/7Jen-Wei Huang60 PS-tree The nodes in PS-tree can be divided into two different types Root node Common nodes Each common node stores two information Node label = element in a sequence Sequence list sequence IDs containing this element marked by corresponding timestamps Root Sequence ID Timestamp Label 60

2015/10/7Jen-Wei Huang61 PS-tree Whenever there are a series of elements appearing in the same sequence, there will be a series of nodes labeled by each element with the same sequence IDs in their sequence lists. The first node will be connected to the Root node representing the first element. The other nodes will be connected to the first node analogously. 61

2015/10/7Jen-Wei Huang62 PS-tree Root A C AD t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … BB ADB BCDC CDBDA A A BC AA C S01 S02 S03 S04 S05 S06 AC BD C D D SID Root Sequence ID Timestamp Label 01 1 A 1 B 1 C 62

2015/10/7Jen-Wei Huang63 PS-tree The path from Root node to any other node represents the candidate sequential pattern appearing in this sequence. The appearing timestamp for each candidate sequential pattern will be marked in the node labeled by the last element. 63

2015/10/7Jen-Wei Huang64 PS-tree Root A C AD t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … BB ADB BCDC CDBDA A A BC AA C S01 S02 S03 S04 S05 S06 AC BD C D D SID Root Sequence ID Timestamp Label 01 1 A 2 B 4 C 1 B 1 C 2 C 1 C 64

2015/10/7Jen-Wei Huang65 Algorithm Pisa When receiving elements at timestamp t+1, Pisa traverses the PS-tree in post-order to delete the obsolete elements from update current sequences in insert newly arriving elements into the PS-tree of timestamp t and transforms it into PS-tree of timestamp t+1. 65

2015/10/7Jen-Wei Huang66 For a common node Pisa deletes the obsolete sequences in the sequence list of this node If there is no sequence ID left in the sequence list, Pisa prunes this node away from its parent Pisa checks the sequence IDs left in the sequence list to see if there is newly arriving element of the sequences If there is no newly arriving element, Pisa goes to the next node 66

2015/10/7Jen-Wei Huang67 For a common node Otherwise, Pisa generates all combination of candidate elements from the arriving element Ex) ABC -> A, B, C, AB, AC, BC, ABC For each candidate element that does not exist on the path from Root to the current node : If there is a child of the same label, Pisa updates the timestamp of this sequence to the timestamp of the same sequence in parent’s sequence list. Otherwise, Pisa creates a new child of this element with the sequence ID and the timestamp of the same sequence in parent’s sequence list. 67

2015/10/7Jen-Wei Huang68 For Root node Instead of checking the sequence list, Pisa examines all sequences that have newly arriving elements. After Pisa generates all combination of candidate element, for each of them : If there is a child of the same label, Pisa updates the timestamp of this sequence to t+1. Otherwise, Pisa creates a new child of this element with sequence ID and timestamp t+1. 68

2015/10/7Jen-Wei Huang69 Algorithm Pisa After Pisa processes a common node, if the number of sequence IDs in the sequence list is larger than the min_supp*|Db p,q |, the path from Root to this node will be outputted as a frequent sequential pattern. 69

2015/10/7Jen-Wei Huang70 PS-tree Root A C AD t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … BB ADB BCDC CDBDA A A BC AA C S01 S02 S03 S04 S05 S06 AC BD C D D SID Root Sequence ID Timestamp Label 01 1 A 2 B 4 C 1 B 1 C 2 C 1 C 70

Root A C AD t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 … BB ADB BCDC CDBDA A A BC AA C S01 S02 S03 S04 S05 S06 AC BD C D D SID Sequence ID Timestamp Label POI=5, min_supp=0.5 71

Db 1,1 (3) 03 1 02 1 01 1 A 02 1 D 1 AD A t1t1 A S01 S02 S03 S04 S05 S06 SID Sequence ID Timestamp Label 72

Db 1,2 (4) 03 1 02 1 01 1 A 04 2 02 1 D 03 1 02 1 01 1 B 03 1 C 1 BC 02 1 B 1 B 1 AD 03 2 C 2 BC 03 2 02 2 01 2 B AB 1 (3) B B D BC t2t2 S01 S02 S03 S04 S05 S06 SID Sequence ID Timestamp Label Db 1,1 (3) 03 1 02 1 01 1 A 02 1 D 1 AD 73

Db 1,3 (5) 03 1 02 1 01 1 A 03 1 02 1 01 1 B 03 1 C 1 BC 02 1 B 1 B 1 AD 03 2 BC 03 3 02 2 01 2 B 03 1 B 1 B 04 2 C 05 3 04 2 02 1 D 04 3 03 2 C 2 B AB 1 (3) 03 2 B t3t3 S01 S02 S03 S04 S05 S06 SID Sequence ID Timestamp Label C D B 74

Db 1,4 (5) 03 1 02 4 01 1 A 03 1 02 1 01 1 B 03 1 BC 02 1 B 1 AD 03 2 BC 03 1 B 1 B 04 2 C 05 3 04 2 02 1 D 03 2 B 1 01 1 C 03 1 01 1 C 03 1 C 02 1 A 03 1 C 02 1 A 05 3 02 1 B 1 A 1 A 03 3 01 2 C 02 2 A 03 3 02 2 01 2 B 05 4 03 2 C 2 C 04 3 03 4 01 4 C AB 1 (3) 03 2 B t4t4 S01 S02 S03 S04 S05 S06 SID Sequence ID Timestamp Label C B A C 75

Db 1,5 (5) 03 1 02 1 01 1 B 03 1 BC 02 1 B 03 2 BC 03 1 B 1 B 04 2 C 03 2 B 1 01 1 C 03 1 01 1 C 03 1 C 1 C 05 3 02 1 B 1 A 1 A 03 3 01 2 C 03 3 02 2 01 2 B 05 4 03 2 C 2 C 04 3 03 4 01 4 C AB 1 (3) 01 1 D 1 AD 01 1 D 1 AD 01 1 D 1 AD 01 1 D 1 AD 05 3 02 1 A 04 2 A 03 1 02 4 01 5 A 04 5 05 5 04 2 02 1 01 5 D 05 3 02 1 01 5 AD 01 2 D 2 AD 01 2 A 2 D 2 AD 05 4 02 2 01 2 A 4 D 4 AD 04 3 01 4 A BA 4 (3) 05 3 04 2 02 1 A DA 3 (3) 03 2 B Sequence ID Timestamp Label 76

Db 2,6 (5) 03 2 BC 04 2 C 03 2 B 3 01 2 C 03 3 02 2 01 2 B 05 4 03 2 C 2 C 04 3 03 4 01 4 C CA 4 (3) 04 2 A 05 3 04 2 A 01 2 D 2 AD 01 2 D 2 AD 01 4 D 4 AD BA 4 (4) 05 3 B 3 A 3 04 2 01 5 D 5 AD 03 3 01 2 A 03 3 02 2 01 2 A 05 4 04 3 03 4 01 4 A 03 2 A 2 A 2 A 2 A 6 02 4 01 5 A 04 5 05 5 03 2 B 2 A t6t6 S01 S02 S03 S04 S05 S06 SID Sequence ID Timestamp Label A 77

Db 3,7 (5) 04 7 03 3 01 7 B 05 4 CA 4 (3) 01 4 D 4 AD 05 3 A 01 5 AD 04 3 03 4 01 4 A 04 5 BC 05 5 04 5 C 05 3 C 3 C 4 03 3 A 05 3 A 3 C 3 01 5 D 03 3 A 3 C 05 4 C 04 3 BC 04 3 BC 05 3 01 5 B 5 B 04 3 01 4 B 4 B 4 B 04 3 01 4 B 04 7 03 4 01 4 C 05 7 04 7 BC 03 6 02 4 01 5 A 04 5 05 5 04 5 01 5 B 05 3 C t7t7 S01 S02 S03 S04 S05 S06 SID Sequence ID Timestamp Label B BC C 78

Db 4,8 (6) AC 6 (3) 01 4 D 4 AD 01 5 AD 04 5 BC 01 5 C 05 3 01 5 D 05 4 C 01 4 B 5 B 4 B 4 B 04 7 03 8 01 8 C 05 7 04 7 BC 01 5 C 5 C 5 C 05 5 04 5 03 6 C 05 4 A 4 04 7 01 7 B 4 B 03 4 01 4 A 7 C 04 5 03 8 02 4 A 05 5 06 8 01 5 04 5 01 5 B 5 C 5 B t8t8 S01 S02 S03 S04 S05 S06 SID Sequence ID Timestamp Label C C A 79

Db 5,9 (5) AC 8 (4) 01 5 AD 04 5 BC 01 5 B 5 B 04 7 03 8 01 8 C 05 7 04 7 BC 01 5 C 5 C 5 C 7 C 05 5 03 6 D 05 5 03 6 D 05 5 04 5 03 6 C 06 8 05 9 03 9 01 5 D 04 7 01 7 B 05 7 03 8 D 04 5 01 5 B 5 C 04 5 03 8 01 5 A 05 5 06 8 t9t9 S01 S02 S03 S04 S05 S06 SID Sequence ID Timestamp Label D C D 80

Db 6,10 (5) CD 8 (4) 01 10 BD 04 7 03 8 01 8 C 05 7 04 7 BC 01 7 C 04 7 01 10 B 06 8 03 6 C 6 D 6 D 06 8 03 8 A 01 7 BD 01 7 D 04 10 03 9 01 10 D 05 9 01 8 BD 01 8 B 04 7 03 8 01 8 D 05 7 04 7 D 7 D 01 7 BD t 10 S01 S02 S03 S04 S05 S06 SID Sequence ID Timestamp Label BD D 81

2015/10/7Jen-Wei Huang82 The Advantages of Pisa Pisa needs only one scan of newly arriving elements and the PS-tree at each timestamp rather than quadratic scans by conventional algorithms. Pisa can maintain latest data sequences find the complete set of up-to-date sequential patterns delete obsolete data and patterns rapidly 82

2015/10/7Jen-Wei Huang83 The Advantages of Pisa Each path from Root to any other node on PS-tree forms a unique candidate sequential pattern. Thus Pisa combines the same candidate patterns together and all patterns do not have to store their prefix elements. PS-tree consumes smaller space. Dealing with the same sequential patterns together is also very efficient in execution time. Fast Pisa with approximation results. 83

2015/10/7Jen-Wei Huang85 Experiments Comparative algorithms GSP+ -- re-mining version of GSP SPAM+ -- re-mining version of SPAM DirApp Environment Pentium 4 — 3GHz CPU and 2GB RAM Coded in C++ 85

2015/10/7Jen-Wei Huang86 Experiments The synthetic datasets are generated in the way similar to the IBM data generator designed for testing sequential pattern mining algorithms. 86

2015/10/7Jen-Wei Huang87 Experiments We divide the target dataset into n timestamps. According to the POI, the first m timestamps (m = POI and m < n) are viewed as the original database and the rest of transactions in the dataset are received by the system incrementally. 87

2015/10/7Jen-Wei Huang88 Experiments The first run of the experiments mines the first POI from the beginning m timestamps of the dataset. After that, we shift the POI forward t (t<<m) timestamps forward for the following runs. 88

2015/10/7Jen-Wei Huang89 Experiments The real data sets are from KDDCUP’07. We randomly choose successive 120 days for the performance evaluation. A timestamp is set as 3 days in order to obtain sufficient frequent sequential patterns. Therefore, there are total 40 timestamps and POI is set as 10. The new datasets contain more than 5000 sequences and 2000 different items. 89

2015/10/7Jen-Wei Huang90 Cumulative Execution Time 90

2015/10/7Jen-Wei Huang91 Minimum Support 91

2015/10/7Jen-Wei Huang92 Length of POI 92

2015/10/7Jen-Wei Huang93 Number of Sequences 93

2015/10/7Jen-Wei Huang94 Scalability of Pisa 94

2015/10/7Jen-Wei Huang95 Real Data Set 95

2015/10/7Jen-Wei Huang96 Improvement of FastPisa 96

2015/10/7Jen-Wei Huang97 Information Lose of FastPisa 97

2015/10/7Jen-Wei Huang99 Conclusions We proposed a progressive algorithm Pisa to handle the progressive sequential pattern mining problem without re-mining all sub- databases at each timestamp. Pisa needs only one scan of newly arriving elements and the PS-tree at each timestamp rather than quadratic scans by conventional algorithms. 99

2015/10/7Jen-Wei Huang100 Conclusions Pisa can maintain the latest information of sequences find the complete set of up-to-date sequential patterns delete obsolete data and patterns rapidly Pisa also consumes less space has high efficiency possesses great scalability 100

2015/10/7Jen-Wei Huang101 References R. Srikant and R.Agrawal, “Mining Sequential Patterns: Generalizations and Performance Improvements.” Proc. of ICDE, 1995 J. Ayres, J. Gehrke, T. Yiu, and J. Flannick. “Sequential pattern mining using a bitmap representation.” Proc. of ACM SIGKDD, 2002. M. Zhang, B. Kao, D. W.-L. Cheung, and C. L. Yip. “Efficient algorithms for incremental update of frequent sequences.” Proc. of PAKDD, 2002. 101

2015/10/7Jen-Wei Huang102 Thank You ! Q & A 102

你的一小步，我的一大步 Jen-Wei Huang 黃仁暐 National Taiwan University.

Similar presentations

Presentation on theme: "你的一小步，我的一大步 Jen-Wei Huang 黃仁暐 National Taiwan University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

你的一小步，我的一大步 Jen-Wei Huang 黃仁暐 National Taiwan University.

Similar presentations

Presentation on theme: "你的一小步，我的一大步 Jen-Wei Huang 黃仁暐 National Taiwan University."— Presentation transcript:

Similar presentations

About project

Feedback