Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Presented by Yaron Gonen

Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results Future Work

Frequent Item-sets: The Market-Basket Model A set of items, e.g., stuff sold in a supermarket A set of baskets, (later called events or transactions) each of which is a small set of the items, e.g., the things one customer buys on one day.

Support Support for item-set I = the number of baskets containing all items in I (Usually given as a percentage) Given a support threshold minSup, sets of items that appear in > minSup baskets are called frequent itemsets Simplest question: find sets of frequent item-sets

Example Items: Minimum Support = 0.6 (2 baskets)

Application (1) Items: products at a supermarket Baskets: set of products a customer bought at one time. Example: many people by beer and diapers together. Place beer next to diapers to increase both sales Run a sale on diapers and raise price of beer.

Application (2) (Counter-Intuitive) Items: species of plants Baskets: each basket represent an attribute. A basket contains items (plants) that have that attribute Frequent sets may indicate similarity between plants

Scale of Problem Costco sells more than 120k different items, and has 57m members (from Wikipedia) Botany has identified about 350k extant species of plants

The Naïve Algorithm Generate all possible itemsets. Check their support.,,,,,,,,,, …

The Apriori Property All nonempty subsets of a frequent itemset must also be frequent. X X X X

The Apriori Algorithm Find frequent 1-itemsets Merge and prune to generate candidate of next size Has candidates ? End Go though whole DB to count support yes no > min support? Frequent itemset Here’s where the apriori property is used. Largest itemset’s length times going over the DB

Vertical Format Index on items. Calculating support is fast 12 3 21212 123123 3

Frequent Sequences: Taking it to the Next Level A large set of sequences. Each of which is a time ordered list of events (baskets), e.g., all the stuff a single customer buys over time 2 weeks5 days

Support Subsequence: a sequence, that all its events are subsets of another sequence, in the same order (but not necessarily consecutive) Support for subsequence s = the number of sequences containing s (Usually given as a percentage) Given a support threshold minSup, subsequences that appear in > minSup sequences are called frequent subsequences Simplest question: find all frequent subsequence

Notations Items are letters: a,b,… Events are parenthesized: (ab), (bdf),… Except for events with single items Sequences are surrounded by Every sequence has an identifier sid

Example sidsequence 1 2 3 4 Frequent Sequences :4 :3 :4 :2 … minSup = 0.5

Motivation Customer shopping patterns Stock market fluctuation Weblog click stream analysis Symptoms of a diseases DNA sequence analysis Weather forecast Machine anti aging Many more…

Much Harder than Frequent Item- sets! 2 m*n possible candidates! Where m is the number of items, and n in the number of transactions in the longest sequence

The Apriori Property If a sequence is not frequent, then any sequence that contains it cannot be frequent

Constraints Problems: Too many frequent sequences most frequent sequences are not useful Solution remove them Constraints are a way to define usefulness The trick do so while mining

Previous Work GSP (Srikant and Agrawal, 1996) Generation-and-test Apriori Based approach SPADE (Zaki, 2001) Generation-and-test Apriori Based approach Uses equivalence-class for memory optimization Uses a vertical-format db PrefixSpan (Pei, 2004) No candidate generation Using a db-projection method

Why a New Algorithm? Huge set of candidate-sequences/projected db generated Multiple Scans of database needed Inefficient for mining long sequential patterns No exploits of domain-specific properties Weak constraints support

The CAMLS Algorithm CAML S Constraint-based Apriori algorithm for Mining Long Sequences Designed especially for efficient mining of long sequences Outperforms SPADE and PrefixSpan on both synthetic and real data

The CAMLS Algorithm Makes a logical distinction between two types of constraints: Intra-Event: not time related (i.e. mutually exclusive items) Inter-Event: addresses the temporal aspect of the data (i.e. values that can or cannot appear one after the other)

Event-wise Constraints Event must/must not contain a specific item Two items cannot occur on the same time max_event_length: An event cannot contain more than a fixed number of items

Sequence–wise Constraints max_sequence_length: a sequence cannot contain more than a fixed number of events max_gap: long time between events dismisses the pattern

CAMLS Overview Constraints (minSup, maxGap, …) InputEvent- wise Sequence -wise Output Frequent events + occurrence index

What Do We Get? The best of both worlds: Much less candidates are being generated. Support check is fast. Worst case: works like SPADE. Tradeoff: Uses a bit more memory (for storing the frequent item-sets).

Event-wise Phase Input: sequence database and constraints Output: frequent events + occurrence index Use Apriori or FP-Growth to find frequent itemsets (both with minor modifications)

Event-wise 1. L 1 = all frequent items 2. for k=2;L k-1 ≠Φ;k++ do 1. generateCandidates(L k-1 ) 2. L k = pruneCandidates() 3. L = L  L k 3. end for Example soon! If two frequent (k-1) event have the same prefix merge them and form a new candidate Prune, calculate support count and create occurrence index

Occurrence Index A compact representation of all occurrences of a sequence Structure: list of sids, each associated with a list of eids Example on next slide! eid 1 sid 1 sid 2 sid 3 eid 2 eid 3 eid 4 eid 5 eid 6 eid 7 eid 8 eid 9 sequence

Event-wise Example (Using Apriori) eventeidsid (acd)01 (bcd)51 b101 a02 c42 (bd)82 (cde)03 e73 (acd)113 minSup=2 All frequent items: a:3, b:2, c:3, d:3 candidates: (ab),(ac),(ad),(bc),… Support count: (ac):2, (ad):2, (bd):2, (cd):2 candidates: (abc), (abd),(acd),… Support count: (acd):2 No more candidates! 1 1 3 3 0 0 11

Sequence-wise Phase Input: frequent events + occurrence index, constraints Output: all frequent sequences Similar to GSP’s and SPADE’s candidate generation phase – except using the frequent itemsets as seeds

Sequence-wise 1. L 1 = all frequent 1-sequences 2. for k=2;L k-1 ≠Φ;k++ do 1. generateCandidates(L k-1 ) 2. L k = pruneAndSupCalc() 3. L = L  L k 3. end for Elaboration on next two slide

Sequence-wise Candidate Generation If two frequent k-sequences s’ and s’’ share a common k-1 prefix and s 1 is a generator, we form a new candidate s‘ = s’’ = = <s’ 1 s’ 2… s’ k s’’ k >

Sequence-wise Pruning 1. Keep a radix-ordered list of pruned sequences in current iteration 2. In the same iteration its possible that a k-sequence will contain another k-sequence in the same iteration. 3. With a new candidate: 1. Check subsequence in pruned list: Very Fast! 2. Test for frequency 3. Add to pruned list if needed

Support Calculation A simple intersection operation between the occurrence index of the forming sequences When a new occurrence index is formed, calculation is trivial

The maxGap Constraint maxGap is a special kind of constraint: Data dependant Apriori property not applicable The occurrence index enables fast maxGap check A frequent sequence that does not satisfy maxGap is flagged as non- generator. Example: Assume is frequent but gap between a and b > maxgap But frequent sequences and and in all maxgap constraints are ok! So is a non-Generator but kept in order not to prune

Sequence-Wise Example even t ei d sid (acd)01 (bcd)51 b101 a02 c42 (bd)82 (cde)03 e73 (acd)113 Original DB Event-wise minSup=2 maxGap=5 g : 3 g : 2 g : 3 g g : 2 g g g g Candidate generation … … is added to pruned list. is a super- sequence of, therefore it is pruned. does not pass maxGap, therefore it is not a generator. : 2 g g g g … :2 No more candidates!

Evaluation (1): Machine Anti Aging How can Sequence Mining Help? Data collected from machine is a sequence Discover typical behavior leading to failure Monitor machine and alert before failure Domain: Light intensity for wavelengths (continuous) Pre-process Discretization Meta features (maxDisc, maxWL, isBurned) Synm stands for a synthetic database simulating the machine behavior with m meta-features

Evaluation (2) Real Stocks data values Rn stands for stock data (10 different stocks) for n days

CAMLS Compared with PrefixSpan

CAMLS Compared with Spade and PrefixSpan

So, What’s CAMLS Contribution? Constraints distinction: easy implementation Two phases Handling on the MaxGap constraint Occurrence index data structure Fast new pruning method

Future Research Main issue: closed sequences More constraints (aspiring regexp)

Thank You!

Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Similar presentations

Presentation on theme: "Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Similar presentations

Presentation on theme: "Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results."— Presentation transcript:

Similar presentations

About project

Feedback