Download presentation
Presentation is loading. Please wait.
1
Presented by Yaron Gonen
2
Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results Future Work
3
Frequent Item-sets: The Market-Basket Model A set of items, e.g., stuff sold in a supermarket A set of baskets, (later called events or transactions) each of which is a small set of the items, e.g., the things one customer buys on one day.
4
Support Support for item-set I = the number of baskets containing all items in I (Usually given as a percentage) Given a support threshold minSup, sets of items that appear in > minSup baskets are called frequent item- sets Simplest question: find sets of frequent item-sets
5
Example Items: Minimum Support = 0.6 (2 baskets)
6
Application (1) Items: products at a supermarket Baskets: set of products a customer bought at one time. Example: many people by beer and diapers together. Place beer next to diapers to increase both sales Run a sale on diapers and raise price of beer.
7
Application (2) (Counter-Intuitive) Items: species of plants Baskets: each basket represent an attribute. A basket contains items (plants) that have that attribute Frequent sets may indicate similarity between plants
8
Scale of Problem Costco sells more than 120k different items, and has 57m members (from Wikipedia) Botany has identified about 350k extant species of plants
9
The Naïve Algorithm Generate all possible itemsets. Check their support.,,,,,,,,,, …
10
The Apriori Property All nonempty subsets of a frequent itemset must also be frequent. X X X X
11
The Apriori Algorithm Find frequent 1-itemsets Merge and prune to generate candidate of next size Has candidates ? End Go though whole DB to count support yes no > min support? Frequent itemset Here’s where the apriori property is used. Largest itemset’s length times going over the DB
12
Vertical Format Index on items. Calculating support is fast 12 3 21212 123123 3
13
Frequent Sequences: Taking it to the Next Level A large set of sequences. Each of which is a time ordered list of events (baskets), e.g., all the stuff a single customer buys over time 2 weeks5 days
14
Support Subsequence: a sequence, that all its events are subsets of another sequence, in the same order (but not necessarily consecutive) Support for subsequence s = the number of sequences containing s (Usually given as a percentage) Given a support threshold minSup, subsequences that appear in > minSup sequences are called frequent subsequences Simplest question: find all frequent subsequence
15
Notations Items are letters: a,b,… Events are parenthesized: (ab), (bdf),… Except for events with single items Sequences are surrounded by Every sequence has an identifier sid
16
Example sidsequence 1 2 3 4 Frequent Sequences :4 :3 :4 :2 … minSup = 0.5
17
Motivation Customer shopping patterns Stock market fluctuation Weblog click stream analysis Symptoms of a diseases DNA sequence analysis Weather forecast Machine anti aging Many more…
18
Much Harder than Frequent Item- sets! 2 m*n possible candidates! Where m is the number of items, and n in the number of transactions in the longest sequence
19
The Apriori Property If a sequence is not frequent, then any sequence that contains it cannot be frequent
20
Constraints Problems: Too many frequent sequences most frequent sequences are not useful Solution remove them Constraints are a way to define usefulness The trick do so while mining
21
Previous Work GSP (Srikant and Agrawal, 1996) Generation-and-test Apriori Based approach SPADE (Zaki, 2001) Generation-and-test Apriori Based approach Uses equivalence-class for memory optimization Uses a vertical-format db PrefixSpan (Pei, 2004) No candidate generation Using a db-projection method
22
Why a New Algorithm? Huge set of candidate-sequences/projected db generated Multiple Scans of database needed Inefficient for mining long sequential patterns No exploits of domain-specific properties Weak constraints support
23
The CAMLS Algorithm CAML S Constraint-based Apriori algorithm for Mining Long Sequences Designed especially for efficient mining of long sequences Outperforms SPADE and PrefixSpan on both synthetic and real data
24
The CAMLS Algorithm Makes a logical distinction between two types of constraints: Intra-Event: not time related (i.e. mutually exclusive items) Inter-Event: addresses the temporal aspect of the data (i.e. values that can or cannot appear one after the other)
25
Event-wise Constraints Event must/must not contain a specific item Two items cannot occur on the same time max_event_length: An event cannot contain more than a fixed number of items
26
Sequence–wise Constraints max_sequence_length: a sequence cannot contain more than a fixed number of events max_gap: long time between events dismisses the pattern
27
CAMLS Overview Constraints (minSup, maxGap, …) InputEvent- wise Sequence -wise Output Frequent events + occurrence index
28
What Do We Get? The best of both worlds: Much less candidates are being generated. Support check is fast. Worst case: works like SPADE. Tradeoff: Uses a bit more memory (for storing the frequent item-sets).
29
Event-wise Phase Input: sequence database and constraints Output: frequent events + occurrence index Use Apriori or FP-Growth to find frequent itemsets (both with minor modifications)
30
Event-wise 1. L 1 = all frequent items 2. for k=2;L k-1 ≠Φ;k++ do 1. generateCandidates(L k-1 ) 2. L k = pruneCandidates() 3. L = L L k 3. end for Example soon! If two frequent (k-1) event have the same prefix merge them and form a new candidate Prune, calculate support count and create occurrence index
31
Occurrence Index A compact representation of all occurrences of a sequence Structure: list of sids, each associated with a list of eids Example on next slide! eid 1 sid 1 sid 2 sid 3 eid 2 eid 3 eid 4 eid 5 eid 6 eid 7 eid 8 eid 9 sequence
32
Event-wise Example (Using Apriori) eventeidsid (acd)01 (bcd)51 b101 a02 c42 (bd)82 (cde)03 e73 (acd)113 minSup=2 All frequent items: a:3, b:2, c:3, d:3 candidates: (ab),(ac),(ad),(bc),… Support count: (ac):2, (ad):2, (bd):2, (cd):2 candidates: (abc), (abd),(acd),… Support count: (acd):2 No more candidates! 1 1 3 3 0 0 11
33
Sequence-wise Phase Input: frequent events + occurrence index, constraints Output: all frequent sequences Similar to GSP’s and SPADE’s candidate generation phase – except using the frequent itemsets as seeds
34
Sequence-wise 1. L 1 = all frequent 1-sequences 2. for k=2;L k-1 ≠Φ;k++ do 1. generateCandidates(L k-1 ) 2. L k = pruneAndSupCalc() 3. L = L L k 3. end for Elaboration on next two slide
35
Sequence-wise Candidate Generation If two frequent k-sequences s’ and s’’ share a common k-1 prefix and s 1 is a generator, we form a new candidate s‘ = s’’ = = <s’ 1 s’ 2… s’ k s’’ k >
36
Sequence-wise Pruning 1. Keep a radix-ordered list of pruned sequences in current iteration 2. In the same iteration its possible that a k-sequence will contain another k-sequence in the same iteration. 3. With a new candidate: 1. Check subsequence in pruned list: Very Fast! 2. Test for frequency 3. Add to pruned list if needed
37
Support Calculation A simple intersection operation between the occurrence index of the forming sequences When a new occurrence index is formed, calculation is trivial
38
The maxGap Constraint maxGap is a special kind of constraint: Data dependant Apriori property not applicable The occurrence index enables fast maxGap check A frequent sequence that does not satisfy maxGap is flagged as non- generator. Example: Assume is frequent but gap between a and b > maxgap But frequent sequences and and in all maxgap constraints are ok! So is a non-Generator but kept in order not to prune
39
Sequence-Wise Example even t ei d sid (acd)01 (bcd)51 b101 a02 c42 (bd)82 (cde)03 e73 (acd)113 Original DB Event-wise minSup=2 maxGap=5 g : 3 g : 2 g : 3 g g : 2 g g g g Candidate generation … … is added to pruned list. is a super- sequence of, therefore it is pruned. does not pass maxGap, therefore it is not a generator. : 2 g g g g … :2 No more candidates!
40
Evaluation (1): Machine Anti Aging How can Sequence Mining Help? Data collected from machine is a sequence Discover typical behavior leading to failure Monitor machine and alert before failure Domain: Light intensity for wavelengths (continuous) Pre-process Discretization Meta features (maxDisc, maxWL, isBurned) Synm stands for a synthetic database simulating the machine behavior with m meta-features
41
Evaluation (2) Real Stocks data values Rn stands for stock data (10 different stocks) for n days
42
CAMLS Compared with PrefixSpan
43
CAMLS Compared with Spade and PrefixSpan
44
So, What’s CAMLS Contribution? Constraints distinction: easy implementation Two phases Handling on the MaxGap constraint Occurrence index data structure Fast new pruning method
45
Future Research Main issue: closed sequences More constraints (aspiring regexp)
46
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.