Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

Data Mining Techniques Association Rule
PREFIXSPAN ALGORITHM Mining Sequential Patterns Efficiently by Prefix- Projected Pattern Growth
LOGO Association Rule Lecturer: Dr. Bo Yuan
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
Rakesh Agrawal Ramakrishnan Srikant
1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Our New Progress on Frequent/Sequential Pattern Mining We develop new frequent/sequential pattern mining methods Performance study on both synthetic and.
Multi-dimensional Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
Sequence Databases & Sequential Patterns
Mining Sequential Patterns Dimitrios Gunopulos, UCR.
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
1 Mining Association Rules in Large Databases Association rule mining Algorithms for scalable mining of (single-dimensional Boolean) association rules.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques 1 Mining Sequence Patterns in Transactional Databases CS240B --UCLA Notes by Carlo Zaniolo Based on those by J. Han.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
Performance and Scalability: Apriori Implementation.
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
A Short Introduction to Sequential Data Mining
What Is Sequential Pattern Mining?
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference.
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011.
Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.
Sequential Pattern Mining
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Data Mining: Principles and Algorithms Mining Sequence Patterns
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Sequential Pattern Mining
Jian Pei and Runying Mao (Simon Fraser University)
Reducing Number of Candidates
Data Mining: Concepts and Techniques
Information Management course
Association rule mining
Data Mining: Concepts and Techniques
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
I don’t need a title slide for a lecture
Association Rule Mining
Data Mining: Concepts and Techniques — Chapter 8 — 8
Mining Frequent Patterns without Candidate Generation
Data Warehousing Mining & BI
Association Rule Mining
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Presentation transcript:

Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)

Outline Sequential pattern mining Pattern-growth methods Performance study Mining sequential patterns with regular expression constraints

Why Sequential Pattern Mining? Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences) Most data and applications are time-related Customer shopping patterns, telephone calling patterns E.g., first buy computer, then CD-ROMS, software, within 3 mos. Natural disasters (e.g., earthquake, hurricane) Disease and treatment Stock market fluctuation Weblog click stream analysis DNA sequence analysis

Sequential Pattern Mining Given a set of sequences, find the complete set of frequent subsequences A sequence database A sequence : Elements items within an element are listed alphabetically is a subsequence of Given support threshold min_sup =2, is a sequential pattern SIDsequence 10 abc abc 40

Sequential Pattern: Basics bdcb bdcb 10 SequenceSeq. ID sequence database A sequence database sequence A sequence : Elements subsequence adae is a subsequence of support threshold sequential pattern Given support threshold min_sup =2, is a sequential pattern

Apriori Property If a sequence S is not frequent  every super- sequence of S is not frequent E.g, is infrequent  so do, SequenceSeq. ID support threshold Given support threshold min_sup =2

Apriori-like Sequential Pattern Mining Methods Proposed by Agrawal and Srikant, ICDE’95 & EDBT’96 GSP (Generalized Sequential Pattern) algorithm Outline of the method Level-by-level do Generate candidate sequences Scan database to collect support counts Use Apriori property to prune candidates Only generate candidates satisfying Apriori property Advantages Candidate pruning, scalable

The GSP Mining Process … … … … 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold Cand. not in DB at all SequenceSeq. ID min_sup =2

Bottlenecks of Apriori–Like Methods A huge set of candidates could be generated 1,000 frequent length-1 sequences generate length-2 candidates! Many scans of database in mining Encounter difficulty when mining long sequential patterns Exponential number of short candidates A length-100 sequential pattern needs candidate sequences!

Mine Sequential Patterns by Prefix Projections Step 1: find length-1 sequential patterns,,,,, Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: The ones having prefix ; … The ones having prefix SIDsequence

Find Seq. Patterns with Prefix Only need to consider projections w.r.t. -projected database:,,, Find all the length-2 seq. pat. Having prefix :,,,,, Further partition into 6 subsets Having prefix ; … Having prefix SIDsequence

Completeness of PrefixSpan SIDsequence SDB Length-1 sequential patterns,,,,, -projected database Length-2 sequential patterns,,,,, Having prefix -proj. db … Having prefix -projected database … Having prefix Having prefix, …, …

Efficiency of PrefixSpan No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing projected databases Can be improved by bi-level projections

Pair-wise Checking Using S-matrix SIDsequence SDB Length-1 sequential patterns,,,,, a2 b(4, 2, 2)1 c 421 (4, 2, 1)(3, 3, 2)3 d(2, 1, 1)(2, 2, 0)(1, 3, 0)0 e(1, 2, 1)(1, 2, 0) (1, 1, 0)0 f(2, 1, 1)(2, 2, 0)(1, 2, 1)(1, 1, 1)(2, 0, 1)1 abcdef S-matrix happens twice happens 4 times happens twice All length-2 sequential patterns are found in S-matrix

Scaling-up by Bi-level Projection Partition search space based on length-2 sequential patterns Only form projected databases and pursue recursive mining over bi-level projected databases

Mining -projected Database SIDsequence SDB Length-1 sequential patterns,,,,, a2 b 4 ( 4, 2, 2) 1 c(4, 2, 1)(3, 3, 2)3 d(2, 1, 1)(2, 2, 0)(1, 3, 0)0 e(1, 2, 1)(1, 2, 0) (1, 1, 0)0 f(2, 1, 1)(2, 2, 0)(1, 2, 1)(1, 1, 1)(2, 0, 1)1 abcdef S-matrix -projected database -projected database Local length-1 sequential patterns sequential patterns:,, a0 c(1, 0, 1)1 (_c)  2 ( , 2,  )( , 1,  )  ac(_c) S-matrix No hope to form (_ac), so no need to count it. Lead to pattern

Benefits of Bi-level Projection More patterns are found in each shoot Much less projections In the example, there are 53 patterns. 53 level-by-level projections 22 bi-level projections

3-way Apriori Checking a2 b(4, 2, 2)1 c(4, 2, 1)(3, 3, 2)3 d(2, 1, 1)(2, 2, 0)(1, 3, 0)0 e(1, 2, 1)(1, 2, 0) (1, 1, 0)0 f(2, 1, 1)(2, 2, 0)(1, 2, 1)(1, 1, 1)(2, 0, 1)1 abcdef cannot be a pattern! cannot be a pattern! Exclude d from -projected database Using Apriori heuristic to prune items in projected databases Absorb goodness of Apriori-like algorithms

Speed-up by Pseudo-projection Major cost of PrefixSpan: projection Postfixes of sequences often appear repeatedly in recursive projected databases When the (projected) database fit in memory, use pointers to form projections Pointer to the sequence Offset of the postfix s= s| : (, 2) s| : (, 4)

Pseudo-Projection vs. Physical Projection Pseudo-projection avoids physically copying postfixes Efficient when database fits in main memory Not efficient when database cannot fit in main memory Disk-based random accessing is very costly Suggested Approach: Integration of physical and pseudo-projection Swapping to pseudo-projection when the data set fits in memory

Seeing is Believing: Experiments and Performance Analysis Comparing PrefixSpan with GSP and FreeSpan in large databases GSP (IBM Almaden, Srikant & Agrawal EDBT’96) FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, KDD’00) Prefix-Span-1 (single-level projection) Prefix-Scan-2 (bi-level projection) Comparing effects of pseudo-projection Comparing I/O cost and scalability

PrefixSpan Is Faster Than GSP and FreeSpan

Effect of Pseudo-Projection

I/O Cost: When It Cannot Fit in Memory

Scalability (When DB Is Large)

Major Features of PrefixSpan Both PrefixSpan and FreeSpan are pattern-growth methods Searches are more focused and thus efficient Prefix-projected pattern growth (PrefixSpan) is more elegant than frequent pattern-guided projection (FreeSpan) Apriori heuristic is integrated into bi-level projection PrefixSpan Pseudo-projection substantially enhances the performance of the memory-based processing

Regular Expression Constraints Constraints in the form of an automaton Deterministic finite automaton for regular expression a*(bb|bcd|dd) a bcd b d

PrefixSpan for Constrained Mining Any prefix failing an RE-constraint cannot lead to a valid pattern Prune invalid patterns immediately Only grow prefix satisfying a RE-constraint Only project items in the remaining of the RE

Conclusions PrefixSpan: an efficient sequential pattern mining method General idea: examine only the prefixes and project only their corresponding postfixes Two kinds of projections: level-by-level & bi- level Pseudo-projection Extending PrefixSpan to mine with RE- constraints Prune invalid prefix immediately

References (1) R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94, pages R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, pages C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32-38, M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. VLDB'99, pages J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, pages J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern-projected sequential pattern mining. KDD'00, pages

References (2) J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, pages H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional intertransaction association rules. DMKD'98, pages 12:1-12:7. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1: , B. "Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, pages J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix- projected pattern growth. ICDE'01, pages R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.