1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference on Information and Knowledge Management (CIKM 2001), Atlanta. 碩專二 阮士峰
2 Outline Why multidimensional sequential pattern mining? Problem definition UniSeq Algorithms Dim-Seq and Seq-Dim Experimental results Conclusions
3 Why Sequential Pattern Mining? Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences) Many data and applications are time-related Customer shopping patterns, telephone calling patterns Natural disasters (e.g., earthquake, hurricane) Disease and treatment Stock market fluctuation Weblog click stream analysis DNA sequence analysis
4 Sequential Pattern: Basics SequenceSeq. ID A sequence database A sequence : Elements is a subsequence of Given support threshold min_sup =2, is a sequential pattern
5 Multi-Dimenesion Sequence Database cidCust_grpCityAge_grpsequence 10BusinessBostonMiddle 20ProfessionalChicagoYoung 30BusinessChicagoMiddle 40EducationNew YorkRetired If support =2, P is a MD sequential pattern P=(*,Chicago,*, ) matches tuple 20 and 30
6 Problem definition Sequential patterns are useful “try a 100 hour free internet access package” “subscribe to 15 hours/mouth package” “ upgrade to 30 hours/mouth package” “upgrade to unlimited package” Marketing, product design & development Problems: lack of focus Various groups of customers may have different patterns MD-sequential pattern mining: integrate multi- dimensional analysis and sequential pattern mining
7 UniSeq Embed MD information into sequences cidCust_grpCityAge_grpsequence 10BusinessBostonMiddle 20ProfessionalChicagoYoung 30BusinessChicagoMiddle 40EducationNew YorkRetired cidMD-extension of sequences Mine the extended sequence database using sequential pattern mining methods Table1 SDB Table2 SDB MD
8 UniSeq(cont.) Sequence database SDB MD can be mined using PrefixSpan. First scan the database, PrefixSpan finds all the single-item frequent sequence. these are :2, :2, :2, :2, :4, :3, :2 and :2. The complete set of sequential patterns can then be partitioned into 8 subsets. cidMD-extension of sequences
9 UniSeq(cont.) Ex: the -projected database contains two postfix sequences: and. cidMD-extension of sequences Then print out the sequential pattern, and find this projected database. They are : and, which form the sequential paterns “ :2” and “ :2” respectively. However, -projected database contains postfix sequences for: and with one frequent item between them find “” :2” (*,Chicago,*, )
10 Mine Sequential Patterns by Prefix Projections Step 1: find length-1 sequential patterns,,,,, Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: The ones having prefix ; … The ones having prefix SIDsequence
11 Find Seq. Patterns with Prefix Only need to consider projections -projected database:,,, Find all the length-2 seq. pat. Having prefix :,,,,, Further partition into 6 subsets Having prefix ; … Having prefix SIDsequence
12 Completeness of PrefixSpan SIDsequence SDB Length-1 sequential patterns,,,,, -projected database Length-2 sequential patterns,,,,, Having prefix -proj. db … Having prefix -projected database … Having prefix Having prefix, …, …
13 Efficiency of PrefixSpan No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing projected databases
14 Dim-Seq First find MD-patterns E.g. (*,Chicago,*) Form projected sequence database and for (*,Chicago,*) Find seq. pat in projected database E.g. (*,Chicago,*, ) cidCust_grpCityAge_grpsequence 10BusinessBostonMiddle 20ProfessionalChicagoYoung 30BusinessChicagoMiddle 40EducationNew YorkRetired
15 Seq-Dim Find sequential patterns E.g. Form projected MD-database E.g. (Professional,Chicago,Young) and (Business,Chicago,Middle) for Mine MD-patterns E.g. (*,Chicago,*, ) cidCust_grpCityAge_grpsequence 10BusinessBostonMiddle 20ProfessionalChicagoYoung 30BusinessChicagoMiddle 40EducationNew YorkRetired
16 Dim-Seq and Seq-Dim The problem of multi-dimensional sequential pattern mining problem can reduced to two sub-problem: sequential pattern mining and MD-pattern mining As introduced before, sequential pattern mining can be done efficiently by PrefixSpan. For MD-pattern mining, we adopt a BUC-like algorithm.
17 BUC algorithm Kevin Beyer, Raghu Ramakrishnan, Bottom-up computation of sparse and Iceberg CUBE, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p , May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
18 Mining MD-Patterns(BUC-like) All (cust-grp,*,*) (*,city,*)(*,*,age-grp) (cust-grp,city)Cust-grp,*,age-grp) (cust-grp,city,age-grp) cidCust_grpCityAge_grpsequence 10BusinessBostonMiddle 20ProfessionalChicagoYoung 30BusinessChicagoMiddle 40EducationNew YorkRetired BUC processing
19 Experimental results Run on Pentium III pc with 1G main memory. Using Microsoft Visual C In this dataset, the number of items is set to 10,000, while the number of sequence is 10,000. The average number of items within each element is 2.5. The average number of elements in one sequence is 8.
20 Scalability Over Dimensionality
21 Scalability Over Cardinality
22 Scalability Over Support Threshold
23 Scalability Over Database Size
24 Pros & Cons of Algorithms Seq-Dim is efficient and scalable Fastest in most cases UniSeq is also efficient and scalable Fastest with low dimensionality Dim-Seq has poor scalability
25 Conclusions MD seq. pat. mining are interesting and useful Mining MD seq. pat. efficiently Uniseq, Dim-Seq, and Seq-Dim Future work Applications of sequential pattern mining
報告結束
27 References (1) R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94, pages R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, pages Kevin Beyer, Raghu Ramakrishnan, Bottom-up computation of sparse and Iceberg CUBE, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p , May 31- June 03, 1999, Philadelphia, Pennsylvania, United States C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32-38, M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. VLDB'99, pages J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, pages J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern-projected sequential pattern mining. KDD'00, pages
28 References (2) J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, pages H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional intertransaction association rules. DMKD'98, pages 12:1-12:7. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1: , B. "Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, pages J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix- projected pattern growth. ICDE'01, pages R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.