Mining Sequential Patterns

Slides:

Advertisements

Similar presentations

Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip.

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Data Mining Techniques Association Rule

Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.

Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Data e Web Mining Paolo Gobbo

Mining Multiple-level Association Rules in Large Databases

LOGO Association Rule Lecturer: Dr. Bo Yuan

IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department

10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.

Mining Sequential Patterns Authors: Rakesh Agrawal and Ramakrishnan Srikant. Presenter: Jeremy Dalmer.

Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.

Rakesh Agrawal Ramakrishnan Srikant

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int’l Conference on Data Engineering (ICDE) March 1995 Presenter: Phil Schlosser.

ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis)

Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.

Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.

Fast Algorithms for Association Rule Mining

Mining Association Rules

Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.

『 Data Mining 』 By Jung, hae-sun. 1.Introduction 2.Definition 3.Data Mining Applications 4.Data Mining Tasks 5. Overview of the System 6. Data Mining.

Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.

Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

Data Mining Techniques Sequential Patterns. Sequential Pattern Mining Progress in bar-code technology has made it possible for retail organizations to.

Ch5 Mining Frequent Patterns, Associations, and Correlations

Secure Incremental Maintenance of Distributed Association Rules.

Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.

Modul 8: Sequential Pattern Mining. Terminology  Item  Itemset  Sequence (Customer-sequence)  Subsequence  Support for a sequence  Large/frequent.

Modul 8: Sequential Pattern Mining

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang

Fast Algorithms For Mining Association Rules By Rakesh Agrawal and R. Srikant Presented By: Chirayu Modi.

CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.

Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.

1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.

CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.

Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining ( ): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining

Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International.

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.

Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: İlkcan Keleş.

Market Basket Analysis

Data Mining Find information from data data ? information.

Spring 2016 Presentation by: Julianne Daly

On the Discovery of Interesting Patterns in Association Rules

Association rule mining

Association Rules Repoussis Panagiotis.

Frequent Pattern Mining

Byung Joon Park, Sung Hee Kim

Mining Sequential Patterns

Market Basket Analysis and Association Rules

Market Basket Many-to-many relationship between different objects

Mining Sequential Patterns

DIRECT HASHING AND PRUNING (DHP) ALGORITHM

Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak

Association Rule Mining

A Parameterised Algorithm for Mining Association Rules

Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.

Association Rule Mining

Farzaneh Mirzazadeh Fall 2007

COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong

Market Basket Analysis and Association Rules

Mining Path Traversal Patterns with User Interaction for Query Recommendation 龚赛赛

15-826: Multimedia Databases and Data Mining

Fast Algorithms for Mining Association Rules

Presentation transcript:

Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int’l Conference on Data Engineering (ICDE) March 1995 Presenter: Qiang Jing

Mining Sequential Patterns Introduction Bar code data allows us to store massive amounts of sales data (basket data). We would like to extract sequential buying patterns. E.g. Video rental: Star Wars; Empire Strikes Back; Return of the Jedi. One element of the sequence can include multiple items. E.g. Sheets and pillow cases; comforter; drapes and ruffles. January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Introduction Compare to Association Rule Association Rule concerns about what items are bought together (at the same time) Intra-transaction patterns Sequential Pattern concerns about what items are bought at different times Inter-transaction patterns January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Outline Introduction Problem Statement Finding Sequential Patterns (The Main Algorithm) Sequence Phase (AprioriAll, AprioriSome, DynamicSome) Conclusion (Final exam questions) January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Problem Statement Given a database of transactions: Each transaction: Customer ID Transaction Time Set of items purchased * No two transactions have same customer ID and transaction time * Don’t consider quantity of items purchased. Item was either purchased or not purchased January 16, 2019 Mining Sequential Patterns

Problem Statement (terms) Itemset: non-empty set of items. Each itemset is mapped to an integer. Sequence: Ordered list of itemsets. Customer Sequence: List of customer transactions ordered by increasing transaction time. A customer supports a sequence if the sequence is contained in the customer-sequence. Support for a Sequence: Fraction of total customers that support a sequence. Maximal Sequence: A sequence that is not contained in any other sequence. Large Sequence: Sequence that meets minisup. January 16, 2019 Mining Sequential Patterns

Example Customer ID Transaction Time Items Bought 1 June 25 '93 30 June 30 '93 90 2 June 10 '93 10,20 June 15 '93 June 20 '93 40,60,70 3 30,50,70 4 40,70 July 25 '93 5 June 12 '93 Customer ID Customer Sequence 1 < (30) (90) > 2 < (10 20) (30) (40 60 70) > 3 < (30) (50) (70) > 4 < (30) (40 70) (90) > 5 < (90) > Maximal seq with support > 40% < (30) (90) > < (30) (40 70) > Note: Use Minisup of 40%, no less than two customers must support the sequence < (10 20) (30) > Does not have enough support (Only by Customer #2) < (30) >, < (70) >, < (30) (40) > … are not maximal. January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns The Algorithm Sort Phase Litemset Phase Transformation Phase Sequence Phase Maximal Phase January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Sort Phase Sort the database: Customer ID as the major key. Transaction-Time as the minor key. Convert the original transaction DB into a customer sequence DB. Customer ID Transaction Time Items Bought 1 June 25 '93 30 June 30 '93 90 2 June 10 '93 10,20 June 15 '93 June 20 '93 40,60,70 3 30,50,70 4 40,70 July 25 '93 5 June 12 '93 January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Litemset Phase Litemset (Large Itemset): Supported by fraction of customers greater than minisup. Customer ID Transaction Time Items Bought 1 June 25 '93 30 June 30 '93 90 2 June 10 '93 10,20 June 15 '93 June 20 '93 40,60,70 3 30,50,70 4 40,70 July 25 '93 5 June 12 '93 Large Itemsets Mapped To (30) 1 (40) 2 (70) 3 (40 70) 4 (90) 5 *Reason of mapping: treating litemsets as single entities - Compare two litemsets in constant time - Reduce the time to check if a sequence is contained in a customer sequence January 16, 2019 Mining Sequential Patterns

Transformation Phase Replace each transaction with all litemsets contained in the transaction. Transactions with no litemsets are dropped. (But empty customer sequences still contribute to the total count) Cust ID Original Cust Sequence Transformed Customer Sequence After Mapping 1 < (30) (90) > <{(30)} {(90)}> <{1} {5}> 2 < (10 20) (30) (40 60 70) > <{(30)} {(40),(70),(40 70)}> <{1} {2,3,4}> 3 < (30) (50) (70) > <{(30),(70)}> <{1,3}> 4 < (30) (40 70) (90) > <{(30)} {(40),(70),(40 70)} {(90)}> <{1} {2,3,4} {5}> 5 < (90) > <{(90)}> <{5}> Note: (10 20) dropped because of lack of support. (40 60 70) replaced with set of litemsets {(40),(70),(40 70)} (60 does not have minisup) January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Sequence Phase Finds Large Sequences: AprioriAll AprioriSome DynamicSome January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Maximal Phase Find maximal sequences among large sequences. k-sequence: sequence of length k S: the set of all large sequences for (k=n; k>1; k--) do for each k-sequence sk do Delete from S all subsequences of sk Authors claim data-structures and an algorithm exist to do this efficiently. (hash trees) January 16, 2019 Mining Sequential Patterns

Sequence Phase Sequence phase finds the large sequences. seed set of sequences candidate sequences scan data to find sup for candidates large sequences January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns AprioriAll Based on the normal Apriori algorithm Counts all the large sequences Prunes non-maximal sequences in the "Maximal phase" January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns AprioriAll L1 = {large 1-sequences} for (k = 2; Lk-1 ≠ {}; k++) do begin Ck = New candidates generated from Lk-1 foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. end Answer = Maximal Sequences in UkLk Notation: Lk: Set of all large k-sequences Ck: Set of candidate k-sequences January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns AprioriAll (Example) Candidate generation Join Lk-1 with itself to form Ck Delete c in Ck such that some (k-1)-sub sequence of c is not in Lk-1 L3 3-Seq <1 2 3> <1 2 4> <1 3 4> <1 3 5> <2 3 4> C4: after join 4-Seq <1 2 3 4> <1 2 4 3> <1 3 4 5> <1 3 5 4> C4: after pruning 4-Seq <1 2 3 4> January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns AprioriAll (Example) L1 1-Seq Sup <1> 4 <2> 2 <3> <4> <5> L2 2-Seq Sup <1 2> 2 <1 3> 4 <1 4> 3 <1 5> <2 3> <2 4> <3 4> <3 5> <4 5> L3 3-Seq Sup <1 2 3> 2 <1 2 4> <1 3 4> 3 <1 3 5> <2 3 4> Minisup = 40% Cust Sequences <{1 5} {2} {3} {4} > <{1} {3} {4} {3 5}> <{1} {2} {3} {4}> <{1} {3} {5}> <{4} {5}> <3 4 5> 1 L4 4-Seq Sup <1 2 3 4> 2 Answer: <1 2 3 4>, <1 3 5>, <4 5> <2 5> January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns AprioriSome Avoid counting sequences that are contained in longer sequences by counting the longer ones first, also avoid having to count many subsequences because their super-sequences are not large Forward phase: find all large sequences of certain lengths Backward phase: find all remaining large sequences January 16, 2019 Mining Sequential Patterns

AprioriSome (Forward Phase) L1 = {large 1-sequences} C1 = L1 last = 1 for (k = 2; Ck-1 ≠ {} and Llast ≠ {}; k++) do begin if (Lk-1 known) then Ck = New candidates generated from Lk-1 else Ck = New candidates generated from Ck-1 if (k==next(last)) then begin // (next k to count?) foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. last = k; end January 16, 2019 Mining Sequential Patterns

AprioriSome (Backward Phase) for (k--; k>=1; k--) do if (Lk not found in forward phase) then begin Delete all sequences in Ck contained in some Li i>k; foreach customer-sequence c in DT do Increment the count of all candidates in Ck that are contained in c Lk = Candidates in Ck with minimum support end else // lk already known Answer = UkLk //(Maximal Phase not Needed) *Notation: DT; Transformed database January 16, 2019 Mining Sequential Patterns

AprioriSome (Example) 3-Seq Sup <1 2 3> 2 <1 2 4> <1 3 4> 3 <1 3 5> <2 3 4> L2 2-Seq Sup <1 2> 2 <1 3> 4 <1 4> 3 <1 5> <2 3> <2 4> <3 4> <3 5> <4 5> C3 3-Seq <1 2 3> <1 2 4> <1 3 4> <1 3 5> <2 3 4> <3 4 5> C4 4-Seq <1 2 3 4> L4 4-Seq Sup <1 2 3 4> 2 January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns AprioriSome Function "next" determines the next sequence length which is counted: this is based on the assumption that if, e.g, almost all sequences of length k are large (frequent), then many of the sequences of length k+1 are also large (frequent) Most of the sequences are large (>85%) => next round is k+5 ... Not many of the sequences are large (<67%) => next round is k+1 (AprioriAll) January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns DynamicSome Similar to AprioriSome AprioriSome generates Ck from Ck-1 DynamicSome generates Ck “on the fly” based on large sequences found from the previous passes and the customer sequences read from the database January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns DynamicSome In the initialization phase, count only sequences up to and including step variable length If step is 3, count sequences of length 1, 2 and 3 In the forward phase, we generate sequences of length 2 × step, 3 × step, 4 × step, etc. on-the-fly based on previous passes and customer sequences in the database While generating sequences of length 9 with a step size 3: While passing the data, if sequences s6  L6 and s3  L3 are both contained in the customer sequence c in hand, and they do not overlap in c, then  s3 . s6  is a candidate (3+6)-sequence January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns DynamicSome In the intermediate phase, generate the candidate sequences for the skipped lengths If we have counted L6 and L3 , and L9 turns out to be empty: we generate C7 and C8 , count C8 followed by C7 after deleting non-maximal sequences, and repeat the process for C4 and C5 The backward phase is identical to AprioriSome January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Final Exam Question #1 What was the greatest hardware concern regarding the algorithms contained in the paper? Main memory capacity. When there is little main memory, or many potentially large sequences, the benefits of AprioriSome vanish. January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Final Exam Question #2 There were two types of algorithms presented to find sequential patterns, CountSome and CountAll. What was the main difference between the two algorithms? CountAll (AprioriAll) is careful with respect to minimum support, careless with respect to maximality. CountSome (AprioriSome) is careful with respect to maximality, careless with respect to minimum support. January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Final Exam Question #3 How did the two best sequence mining algorithms (AprioriAll and AprioriSome) perform compared with each other? Take into consideration memory, speed, and usefulness of the data. January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Final Exam Question #3 Memory: In terms of main memory usage, AprioriAll is better. In terms of secondary storage access, AprioriSome is better. January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Final Exam Question #3 Speed: With sufficient memory, as minimum support decreases the difference between AprioriAll and AprioriSome increases. (AprioriSome is better.) More large sequences not maximal are generated. January 16, 2019 Mining Sequential Patterns

Mining Sequential Patterns Final Exam Question #3 Usefulness of the data: For the problem of finding maximal large sequences, the answer is “Precisely the same”. However, AprioriAll finds all large sequences, while AprioriSome discards some large sequences that aren’t maximal. AprioriAll, then, generates more “useful” data. “The user may want to know the ratio of the number of people who bought the first k + 1 items in a sequence to the number of people who bought the first k items.” January 16, 2019 Mining Sequential Patterns