Data Mining Techniques Sequential Patterns
Sequential Pattern Mining Progress in bar-code technology has made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data A record in such data typically consists of the transaction date and the items bought in the transaction Very often, data records also contain customer-id, particularly when the purchase has been made using a credit card or a frequent-buyer card Catalog companies also collect such data using the orders they receive
Sequential Pattern Mining An example of such a pattern is that customers typically rent “Star Wars ( 星際大戰 )”, then “Empire Strikes Back ( 帝國大反擊 )”, and then “Return of the Jedi ( 絕地大反攻 )” These rentals need not be consecutive –Customers who rent some other videos in between also support this sequential pattern Elements of a sequential pattern need not be simple items –“Computer Science and Programming Language”, followed by “Data Structure”, followed by “System Programs and Operating Systems” is an example of a sequential pattern in which the elements are sets of items
Sequential Pattern Mining Given Transaction Time, Customer Id, Items Bought Original Database Answer Set
Definition The length of a sequence is the number of itemsets in the sequence A sequence of length k is called a k-sequence The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction The itemset i and the 1-sequence have the same support An itemset with minimum support is called a large (frequent) itemset or litemset
AprioriAll Algorithm Each itemset in a large sequence must have minimum support Any large sequence must be a list of litemsets Finding all sequential patterns in five phases –Sort Phase –Litemset Phase –Transformation Phase –Sequence Phase –Maximal Phase
AprioriAll Algorithm: Sort Phase Customer-Sequence Version of the Database
AprioriAll Algorithm: Litemset Phase Apriori/DHP FP Growth min_sup_count=2
AprioriAll Algorithm: Transformation Phase
AprioriAll Algorithm: Sequence Phase Customer SequencesLarge 1-Sequences Large 2-Sequences Large 3-Sequences Large 4-Sequences Maximal Large Sequences 2
Sequence Phase: Candidate Generation
AprioriAll Algorithm: Maximal Phase The sequence is contained in, since (3) (3 8), (4 5) (4 5 6) and (8) (8) The sequence is not contained in (and vice versa) –The former represents items 3 and 5 being bought one after the other –The latter represents items 3 and 5 being bought together. In a set of sequences, a sequence s is maximal if s is not contained in any other sequence.
AprioriAll Algorithm With minimum support set to 25%, i.e., a minimum support of 2 customers – and are maximal – which is only supported by customer 2 does not have minimum support –,,,,, and, though having minimum support, are not in the answer because they are not maximal. Answer Set
Summary
Discussions AprioriAll algorithm will generate a huge set of candidate sequences –If there are 1000 frequent sequences of length-1, the algorithm will generate 1000 × (1000 × 999) / 2 = 1,499,500 candidate sequences Many scans of databases in mining Difficulties at mining long sequential patterns
Research Topics Time-Interval Sequential Patterns Time-Gap Sequential Patterns Non-redundant Sequential Patterns Constrained Sequential Pattern Mining Multi-dimensional Sequential Patterns Generalized Sequential Patterns Incremental Mining Sequential Patterns Data Stream Sequential Pattern Mining Interactive Mining Sequential Patterns
Exercise 6 A Sequence Database (min-sup = 50%) Customer sequence SID