Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining 02.11. 09.11.

Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining 02.11. 09.11. ClusteringClustering KDD Process Home Exam 23.11. 30.11. 16.11. M M P P Seminar by Mika Seminar by Pirjo P P P P P P M M M M

Course on Data Mining: Seminar Meetings Page 2/30 Today 09.11.2001 Rakesh Agrawal and Ramakrishnan Srikant: Mining Sequential Patterns. Int'l Conference on Data Engineering, 1995.Rakesh Agrawal and Ramakrishnan Srikant: Mining Sequential Patterns. Int'l Conference on Data Engineering, 1995. F. Masseglia, P. Poncelet and M. Teisseire: Incremental Mining of Sequential Patterns in Large Databases. 16èmes Journées Bases de Données Avancées, 2000.F. Masseglia, P. Poncelet and M. Teisseire: Incremental Mining of Sequential Patterns in Large Databases. 16èmes Journées Bases de Données Avancées, 2000. Course on Data Mining (581550-4): Seminar Meetings

Course on Data Mining: Seminar Meetings Page 3/30 Mining Sequential Patterns Rakesh Agrawal and Ramakrishnan Srikant IBM Almaden Research Center, USA Published in ICDE'95 (Int'l Conf. on Data Engineering) Data Mining course Autumn 2001/University of Helsinki Summary by Mika Klemettinen

Course on Data Mining: Seminar Meetings Page 4/30 Mining Sequential Patterns Problem statement:Problem statement: Database D with customer transactions Customer-id, transaction time, items purchased Quantities of items purchased are NOT concerned Definitions:Definitions: Itemset: a non-empty set of items,  i 1 i 2 i 3 …  Sequence: an ordered list of itemsets,  s 1 s 2 s 3 …  A sequence  a 1 a 2 … a n  is contained in  b 1 b 2 … b n  if there exist i 1 < i 2 <... < i n such that a 1  b i 1, a 2  b i 2, … a n  b i n E.g.,  (3)(4 5)(8)    (7)(3 8)(9)(4 5 6)(8)>, since (3)  (3 8), (4 5)  (4 5 6) and (8)  (8) However, note that sequence  (3)(5)    (3 5)  (and vice versa)

Course on Data Mining: Seminar Meetings Page 5/30 Mining Sequential Patterns Customer sequence: a sequence of transactions ("shopping baskets") of a customer, ordered by transaction times T i :  itemset(T 1 ) itemset(T 2 ) … itemset(T n )  A customer supports a sequence s if s is contained in the customer sequence for this customer The support for a sequence is defined as the fraction of total customers who support this sequence Task:Task: Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user- specified minimun support. Each such maximal sequence represents a sequential pattern

Course on Data Mining: Seminar Meetings Page 6/30 Mining Sequential Patterns Customer IdTransaction timeItems bought 1June 25, 199330 1June 30, 199390 2June 10, 199310, 20 2June 15, 199330 2June 20, 199340, 60, 70......... Customer IdCustomer sequence 1  (30)(90)  2  (10 20)(30)(40 60 70)  3  (30 50 70)  4  (30)(40 70)(90)  5  (90)  Min. support 25% => 2 customers: (1&4) and (2&4) are maximal

Course on Data Mining: Seminar Meetings Page 7/30 Mining Sequential Patterns Definitions:Definitions: Length of a sequence is the number of itemsets in the sequence A sequence of length k is called k-sequence A sequence concatenated from sequences x and y is denoted by x.y The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction An itemset with minimum support is called large itemset or litemset Each itemset in a large sequence must have minimum support, i.e., any large sequence must be a list of litemsets (Apriori trick!) Three algorithms, all for sequential patterns:Three algorithms, all for sequential patterns: AprioriSome AprioriAll DynamicSome

Course on Data Mining: Seminar Meetings Page 8/30 Mining Sequential Patterns Mining of sequential patterns:Mining of sequential patterns: 1. Sort Phase 1. Sort Phase Sort according to customer Id and transaction time 2. Litemset Phase 2. Litemset Phase Find large itemsets in a Apriori fashion, but like in MaxFreq, the support count is incremented only once even if the customer buys the same set of items in two different transactions The large itemsets are mapped to a set of contiguous integers (e.g. (30), (40), (70), (40 70) and (90) becomes 1, 2, 3, 4 and 5); checking of equality is then fast (constant time)!

Course on Data Mining: Seminar Meetings Page 9/30 Mining Sequential Patterns 3. Transformation Phase 3. Transformation Phase There is a need to repeatedly check which large itemsets are contained in customer sequences To make this fast, each customer sequence is transformed to a list of large itemsets Then the large itemsets are mapped to integers CId Original seq.Transf. Mapping 1  (30)(90)   {(30)}{(90)}   {1}{5}  2  (10 20)(30)(40 60 70)   {(30)}{(40),(70),(40 70)}   {1}{2,3,4}  3  (30 50 70)   {(30),(70)}   {1,3}  4  (30)(40 70)(90)   {(30)}{(40),(70),(40 70)}{(90)}   {1}{2,3,4}{5}  5  (90)   {(90)}   {5} 

Course on Data Mining: Seminar Meetings Page 10/30 Mining Sequential Patterns 4. Sequence Phase 4. Sequence Phase The large itemsets are used to find the desired sequences AprioriAll: –Based on the normal Apriori algorithm –Counts all the large sequences –Prunes non-maximal in the "Maximal phase" *Some: –Avoid counting sequences that are contained in longer sequences by counting the longer ones first, also avoid having to count many subsequences because their supersequences are not large

Course on Data Mining: Seminar Meetings Page 11/30 Mining Sequential Patterns –Forward phase: find all large sequences of certain lengths –Backward phase: find all remaining large sequences –AprioriSome: use only large sequences from previous pass to generate candidates and validate their supports (i.e., if they are frequent or not) –DynamicSome: generate candidates on-the-fly based on large sequences found from the previous passes and the customer sequences read from the database 5. Maximal Phase Find the maximal sequences among the large sequences In practice, starting from the largest sequences, delete all their subsequences

Course on Data Mining: Seminar Meetings Page 12/30 Mining Sequential Patterns AprioriAll:AprioriAll: Find all large sequences "normally" Prune the non-maximal ones away starting from  1 2 3 4  by deleting all its subsequences (  1 2 3 ,  1 2 4 ,  1 3 4 ,  2 3 4 ,  1 2 ,  1 3 , …,  4  ), then take the remaining  1 3 5  and prune all its subsequences, … The maximal large sequences are  1 2 3 4 ,  1 3 5  and  4 5 

Course on Data Mining: Seminar Meetings Page 13/30 Mining Sequential Patterns AprioriSome:AprioriSome: Count only sequences of, e.g., length 1, 2, 4 and 6 in "forward phase" and count sequences of length 3 and 5 in "backward phase" Note: in the forward phase, candidates for all levels are counted: If in the large sequences of length L k-1 were checked, then generate new candidates C k based on them If in the large sequences of length L k-1 were NOT checked, then generate new candidates C k based on candidates C k-1 In backward phase: delete all sequences of the length k in candidate collection if they are contained in some longer large sequence L i (i > k)

Course on Data Mining: Seminar Meetings Page 14/30 Mining Sequential Patterns Function "next" determines the next sequence length which is counted: this is based on the assumption that if, e.g, almost all sequences of length k are large (frequent), then many of the sequences of length k+1 are also large (frequent). E.g., Most of the sequences are large (85%) => next round is k+5... Not many of the sequences are large (67%) => next round is k+1 (AprioriAll)

Course on Data Mining: Seminar Meetings Page 15/30 Mining Sequential Patterns DynamicSome:DynamicSome: In the initialization phase, count only sequences upto and including step variable length E.g., if step is 3, count sequences of length 1, 2 and 3 In the forward phase, we generate sequences of length 2 × step, 3 × step, 4 × step, etc. on-the-fly based on previous passes and customer sequences in the database E.g., while generating sequences of length 9 with a step size 3: While passing the data, if sequences s 6  L 6 and s 3  L 3 are both contained in the customer sequence c in hand, and they do not overlap in c, then  s k. s j  is a candidate (k+j)-sequence

Course on Data Mining: Seminar Meetings Page 16/30 Mining Sequential Patterns In the intermediate phase, generate the candidate sequences for the skipped lengths E.g., if we have counted L 6 and L 3, and L 9 turns out to be empty: we generate C 7 and C 8, count C 8 followed by C 7 after deleting non-maximal sequences, and repeat the process for C 4 and C 5 The backward phase is identical to AprioriSome Then we go on and spare a few thoughts on incremental mining of sequential patternsThen we go on and spare a few thoughts on incremental mining of sequential patterns

Course on Data Mining: Seminar Meetings Page 17/30 Incremental Mining of Sequential Patterns in Large Databases F. Masseglia, P. Poncelet and M. Teisseire Laboratoire PRiSM & LIRMM UMR CNRS, France Published in BDA'00 (Bases de Données Avancées) Data Mining course Autumn 2001/University of Helsinki Summary by Mika Klemettinen

Course on Data Mining: Seminar Meetings Page 18/30 Incremental Mining of Sequential Patterns Problem setting:Problem setting: Let us consider an original and an incremental customer transaction database For the original database, the frequent patterns have been created Incremental database may contain new customers and new transactions for both old and new customers To compute the set of sequential patterns in the updated database, we want to avoid counting everything from the scratch Some main things one has to consider: Discover all sequential patterns NOT frequent in the original database but become frequent with the increment Examine all transactions in the original database which can be extended to become frequent Old frequent sequences may become invalid when adding a customer or customers

Course on Data Mining: Seminar Meetings Page 19/30 Incremental Mining of Sequential Patterns Definitions are basically the same as in "Mining Sequential Patterns" paperDefinitions are basically the same as in "Mining Sequential Patterns" paper Again, the problem is to find all (maximal) sequences whose support is greater than a specified threshold (minimum support)Again, the problem is to find all (maximal) sequences whose support is greater than a specified threshold (minimum support) Additional definitions:Additional definitions: DB is the original database, minSupp is the minimum support db is the increment database U = DB  db is the updated database containing all sequences from DB and db L DB is the set of frequent sequences in DB Task is to find frequent sequences in U, noted L U, with respect to the minSupp An example database is presented on the next slideAn example database is presented on the next slide

Course on Data Mining: Seminar Meetings Page 20/30 Incremental Mining of Sequential Patterns

Course on Data Mining: Seminar Meetings Page 21/30 Incremental Mining of Sequential Patterns First problem (Figure 1):Append new transactions to customers already existing in the original databaseFirst problem (Figure 1): Append new transactions to customers already existing in the original database Suppose that we have minSupp threshold of 50% In the original database, the frequent (maximal) sequences L DB are {  (10 20) (30) ,  (10 20) (40)  } New transactions are appended to customers C2 and C3 Sequences  (60) (90)  and  (10 20) (50 70)  become frequent Customers C3 and C4 contain the first one, thus support is 50% Customers C1, C2, and C3 contain  (10 20) , thus the increments for C2 and C3 make the second one frequent, since customers C1 and C2 contain it ; thus support is 50% Sequences  (10 20) (30)(50 60)(80)  and  (10 20) (40)(50 60)(80)  become frequent, since  (50 60) (80)  is frequent in db and was added to the rows already containing frequent sequences  (10 20) (30)  and  (10 20) (40) 

Course on Data Mining: Seminar Meetings Page 22/30 Incremental Mining of Sequential Patterns Second problem (Figure 2):Append new customers and new transactions to the original databaseSecond problem (Figure 2): Append new customers and new transactions to the original database Suppose again that we have minSupp threshold of 50% When one new customer is added to the database, a frequent sequence must be observed for 3 customers (previously 2) In the original database, the frequent (maximal) sequences L DB used to be {  (10 20) (30) ,  (10 20) (40)  }, but is now just {  (10 20)  } Sequences  (10 20) (30)  and  (10 20) (40)  occur only for customers C2 and C3 Sequence  (10 20)  occurs for C1, C2, and C3 By introducing increment database db, the L U becomes {  (10 20) (50) ,  (10) (70) ,  (10) (80) ,  (40) (80) ,  (60)  } E.g., sequence  (10 20) (50)  is in the original database only for C1, and is not frequent; as the item 50 becomes frequent with the increment database, the sequence matches also C2 and C3

Course on Data Mining: Seminar Meetings Page 23/30 Incremental Mining of Sequential Patterns Algorithm (ISE):The incremental mining is decomposed into two subproblems (k = length of the longest frequent sequences in DB)Algorithm (ISE): The incremental mining is decomposed into two subproblems (k = length of the longest frequent sequences in DB) Find all new frequent sequences of size j  (k+1). During this phase, three kinds of frequent sequences are considered: Sequences in DB can become frequent since they have sufficient support with the increment There can be new frequent sequences appearing in increment db but not in original DB Sequences in DB can become frequent when adding items of db Find all new frequent sequences of size j > (k+1) This is straightforward Apriori-like algorithm applying, since we have all frequent (k+1)-sequences discovered in the previous phase

Course on Data Mining: Seminar Meetings Page 24/30 Incremental Mining of Sequential Patterns First iteration (1):First iteration (1): Make a pass on db, count support for individual items of db Provide 1-candExt, sequences occurring in db Determine which items of db are frequent in U => L d 1 b Prune out frequent sequences that used to be frequent in L DB, but which are no more frequent in U

Course on Data Mining: Seminar Meetings Page 25/30 Incremental Mining of Sequential Patterns First iteration (2):First iteration (2): Create candidate sequences of length 2 by joining L d 1 b with L d 1 b => 2-candExt Generate from L DB the set of frequent sub-sequences Scan U to find out frequent 2-sequences from 2-candExt and frequent sub-sequences occurring before items of L d 1 b

Course on Data Mining: Seminar Meetings Page 26/30 Incremental Mining of Sequential Patterns First iteration (3):First iteration (3): freqSeed <= frequent sub-sequences occurring before items of L d 1 b and appended with the item 2-freqExt <= frequent 2-sequences from 2-candExt

Course on Data Mining: Seminar Meetings Page 27/30 Incremental Mining of Sequential Patterns j th iteration with j  (k+1) While (j-freqExt !=  AND j  (k+1) do candInc <= Generate candidates from freqSeed and j- freqExt ; j++; j-candExt <= Generate candidate j-sequences from (j- 1)freqExt ; Scan db for j-candExt ; if (j-candExt !=  AND candInc !=  ) then Scan U for j-candExt and candInc ; endif j-freqExt <= frequent j-sequences; freqInc <= freqInc + candidates from candInc verifying the support on U ; enddo L U <= L DB  { max. freq. sequences in freqSeed  freqInc  freqExt};

Course on Data Mining: Seminar Meetings Page 28/30 Incremental Mining of Sequential Patterns j th iteration with j > (k+1) Apply Apriori-style algortihm until all frequent sequences are discovered L U <= L U  { max. freq. sequences obtained from the previous step}; On the next slide, processes in the first and j th iteration with j > (k+1) are summarized Optimization in "candInc <= Generate candidates from freqSeed and j-freqExt ": Consider two sequences (s  freqSeed, s'  freqExt) such that an item i  L d 1 b is the last item of s and the first item of s' Do not append s'  freqExt to s  freqSeed if there exist an item j  L d 1 b such that j is in s' and j is not preceded by s

Course on Data Mining: Seminar Meetings Page 29/30 Incremental Mining of Sequential Patterns

Course on Data Mining: Seminar Meetings Page 30/30 Unofficial Evaluation (Personal Views…) Mining Sequential Patterns:Mining Sequential Patterns: Paper comes from one of the top research groups in data mining area (IBM Almaden Data Mining group led by Rakesh Agrawal) Quite well-written paper: Good language, clear examples and presentation => rather "easy to read" Simple ideas, not very "break-through" ideas (at least this is the interpretation now); quite good international conference One has to remember: this is written already in 1995 Incremental Mining of Sequential Patterns in Large DatabasesIncremental Mining of Sequential Patterns in Large Databases Paper comes from not so well-known French research group Good: Lots of examples Bad: Language is not always as good as it could be & definitions are sometimes somewhat "blurry", maybe too many abbreviations used Probably not very "break-through" ideas, national DB conference Remember: this is from year 2000 - rather new!

Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining 02.11. 09.11.

Similar presentations

Presentation on theme: "Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining 02.11. 09.11."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining 02.11. 09.11.

Similar presentations

Presentation on theme: "Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining (581550-4): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining 02.11. 09.11."— Presentation transcript:

Similar presentations

About project

Feedback