Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,

Slides:



Advertisements
Similar presentations
Mining Sequence Data. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Sequence Data ObjectTimestampEvents A102, 3, 5 A206, 1 A231 B114,
Advertisements

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Generalized Sequential Pattern (GSP) Step 1: – Make the first pass over the sequence database D to yield all the 1-element frequent sequences Step 2: Repeat.
Multi-dimensional Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequence Databases & Sequential Patterns
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques 1 Mining Sequence Patterns in Transactional Databases CS240B --UCLA Notes by Carlo Zaniolo Based on those by J. Han.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Data Mining Association Rules: Advanced Concepts and Algorithms
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
Data Mining Association Rules: Advanced Concepts and Algorithms
A Short Introduction to Sequential Data Mining
What Is Sequential Pattern Mining?
Advanced Association Rule Mining and Beyond. Continuous and Categorical Attributes Example of Association Rule: {Number of Pages  [5,10)  (Browser=Mozilla)}
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference.
Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.
Modul 8: Sequential Pattern Mining. Terminology  Item  Itemset  Sequence (Customer-sequence)  Subsequence  Support for a sequence  Large/frequent.
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Rules: Advanced Concepts and Algorithms
Data Mining Association Analysis Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Modul 8: Sequential Pattern Mining
Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes for Chapter 7 By Gun Ho Lee Intelligent Information Systems Lab.
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011.
Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.
Sequential Pattern Mining
Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.
Data Mining Association Rules: Advanced Concepts and Algorithms
1 1 MSCIT 5210: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Tan, Steinbach, Kumar.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
Sequential Pattern Mining
Data Mining Association Rules: Advanced Concepts and Algorithms
Association rule mining
Frequent Pattern Mining
Advanced Pattern Mining 02
Data Mining: Concepts and Techniques
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Sequential Patterns
Data Mining Association Rules: Advanced Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Warehousing Mining & BI
Association Rule Mining
Presentation transcript:

Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach, Kumar

Sequence Data ObjectTimestampEvents A102, 3, 5 A206, 1 A231 B114, 5, 6 B172 B217, 8, 1, 2 B281, 6 C141, 8, 7 Sequence Database:

Examples of Sequence Data Sequence Database SequenceElement (Transaction)Event (Item) CustomerPurchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web DataBrowsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Event dataHistory of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C Sequence E1 E2 E1 E3 E2 E3 E4 E2 Element (Transaction) Event (Item)

Formal Definition of a Sequence l A sequence is an ordered list of elements (transactions) s = –Each element contains a collection of events (items) e i = {i 1, i 2, …, i k } –Each element is attributed to a specific time or location l Length of a sequence, |s|, is given by the number of elements of the sequence l A k-sequence is a sequence that contains k events (items)

Examples of Sequence l Web sequence l Sequence of initiating events causing the nuclear accident at 3-mile Island: l Sequence of books checked out at a library:

Formal Definition of a Subsequence l A sequence is contained in another sequence (m ≥ n) if there exist integers i 1 < i 2 < … < i n such that a 1  b i1, a 2  b i2, …, a n  b in l The support of a subsequence w is defined as the fraction of data sequences that contain w l A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup) Data sequenceSubsequenceContain? Yes No Yes

Sequential Pattern Mining: Definition l Given: –a database of sequences –a user-specified minimum support threshold, minsup l Task: –Find all subsequences with support ≥ minsup

What Is Sequential Pattern Mining? l Given a set of sequences, find the complete set of frequent subsequences A sequence database A sequence : An element may contain a set of items. Items within an element are unordered and we list them alphabetically. is a subsequence of Given support threshold min_sup =2, is a sequential pattern SIDsequence

Challenges on Sequential Pattern Mining l A huge number of possible sequential patterns are hidden in databases l A mining algorithm should –find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold –be highly efficient, scalable, involving only a small number of database scans –be able to incorporate various kinds of user-specific constraints

Sequential Pattern Mining: Example Minsup = 50% Examples of Frequent Subsequences: s=60% s=80% s=60%

Studies on Sequential Pattern Mining l Concept introduction and an initial Apriori-like algorithm –R. Agrawal & R. Srikant. “Mining sequential patterns,” ICDE’95 l GSP—An Apriori-based, influential mining method (developed at IBM Almaden) –R. Srikant & R. Agrawal. “Mining sequential patterns: Generalizations and performance improvements,” EDBT’96 l From sequential patterns to episodes (Apriori-like + constraints) –H. Mannila, H. Toivonen & A.I. Verkamo. “Discovery of frequent episodes in event sequences,” Data Mining and Knowledge Discovery, 1997 l Mining sequential patterns with constraints –M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB 1999

Extracting Sequential Patterns l Given n events: i 1, i 2, i 3, …, i n l Candidate 1-subsequences:,,, …, l Candidate 2-subsequences:,, …,,, …, l Candidate 3-subsequences:,, …,,, …,,, …,,, …

A Basic Property of Sequential Patterns: Apriori l A basic property: Apriori (Agrawal & Sirkant’94) –If a sequence S is not frequent –Then none of the super-sequences of S is frequent –E.g, is infrequent  so do and SequenceSeq. ID Given support threshold min_sup =2

Generalized Sequential Pattern (GSP) l Step 1: –Make the first pass over the sequence database D to yield all the 1- element frequent sequences l Step 2: Repeat until no new frequent sequences are found –Candidate Generation:  Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items –Candidate Pruning:  Prune candidate k-sequences that contain infrequent (k-1)-subsequences –Support Counting:  Make a new pass over the sequence database D to find the support for these candidate sequences –Candidate Elimination:  Eliminate candidate k-sequences whose actual support is less than minsup

Finding Length-1 Sequential Patterns l Examine GSP using an example l Initial candidates: all singleton sequences –,,,,,,, l Scan database once, count support for candidates SequenceSeq. ID min_sup =2 CandSup

Candidate Generation l Base case (k=2): –Merging two frequent 1-sequences and will produce two candidate 2-sequences: and l General case (k>2): –A frequent (k-1)-sequence w 1 is merged with another frequent (k-1)-sequence w 2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w 1 is the same as the subsequence obtained by removing the last event in w 2  The resulting candidate after merging is given by the sequence w 1 extended with the last event of w 2. –If the last two events in w 2 belong to the same element, then the last event in w 2 becomes part of the last element in w 1 –Otherwise, the last event in w 2 becomes a separate element appended to the end of w 1

Candidate Generation Examples l Merging the sequences w 1 = and w 2 = will produce the candidate sequence because the last two events in w 2 (4 and 5) belong to the same element l Merging the sequences w 1 = and w 2 = will produce the candidate sequence because the last two events in w 2 (4 and 5) do not belong to the same element l We do not have to merge the sequences w 1 = and w 2 = to produce the candidate because if the latter is a viable candidate, then it can be obtained by merging w 1 with

GSP Example

Generating Length-2 Candidates 51 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates

Finding Length-2 Sequential Patterns l Scan database one more time, collect support count for each length-2 candidate l There are 19 length-2 candidates which pass the minimum support threshold –They are length-2 sequential patterns

Generating Length-3 Candidates and Finding Length-3 Patterns l Generate Length-3 Candidates –Self-join length-2 sequential patterns  Based on the Apriori property , and are all length-2 sequential patterns  is a length-3 candidate –46 candidates are generated l Find Length-3 Sequential Patterns –Scan database once more, collect support counts for candidates –19 out of 46 candidates pass support threshold

The GSP Mining Process … … … … 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold Cand. not in DB at all SequenceSeq. ID min_sup =2

The GSP Algorithm l Take sequences in form of as length-1 candidates l Scan database once, find F 1, the set of length-1 sequential patterns l Let k=1; while F k is not empty do –Form C k+1, the set of length-(k+1) candidates from F k ; –If C k+1 is not empty, scan database once, find F k+1, the set of length-(k+1) sequential patterns –Let k=k+1;

Bottlenecks of GSP l A huge set of candidates could be generated –1,000 frequent length-1 sequences generate length-2 candidates! l Multiple scans of database in mining l Real challenge: mining long sequential patterns –An exponential number of short candidates –A length-100 sequential pattern needs candidate sequences!