Presentation is loading. Please wait.

Presentation is loading. Please wait.

Association Rule Mining

Similar presentations


Presentation on theme: "Association Rule Mining"— Presentation transcript:

1 Association Rule Mining
C.Eng 714 Spring 2010

2 What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. June 26, 2018

3 Basic Concepts: Frequent Patterns and Association Rules
Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Itemset X = {x1, …, xk} Find all the rules X  Y with minimum support and confidence support, s, probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y Customer buys diaper buys both buys beer Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A  D (60%, 100%) D  A (60%, 75%) June 26, 2018

4 Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if it is frequent and there exists no super-pattern Y כ X, with the same support as X An itemset X is a max-pattern if it is frequent and there exists no frequent super-pattern Y כ X Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules An itemset X is generator if it is frequent and there exists no sub-pattern X כ Y with the same support as X June 26, 2018

5 Closed Patterns and Max-Patterns
Example DB = {<a1, …, a100>, < a1, …, a50>} Minimum support count is 1. What is the set of closed itemset? <a1, …, a100>: 1 < a1, …, a50>: 2 What is the set of max-pattern? What is the set of all frequent patterns? June 26, 2018

6 Methods for Mining Frequent Patterns
Apriori Rule (The downward closure property) Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Basic FIM methods: Apriori Freq. pattern growth (ClosetFPgrowth—Han, Pei & Vertical data format approach (SPADE, CHARM) June 26, 2018

7 Apriori: A Candidate Generation-and-Test Approach
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Mannila, et KDD’ 94) Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated June 26, 2018

8 The Apriori Algorithm—An Example
Supmin = 2 Itemset sup {A} 2 {B} 3 {C} {D} 1 {E} Database TDB Itemset sup {A} 2 {B} 3 {C} {E} L1 C1 Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1st scan C2 Itemset sup {A, B} 1 {A, C} 2 {A, E} {B, C} {B, E} 3 {C, E} C2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} L2 2nd scan Itemset sup {A, C} 2 {B, C} {B, E} 3 {C, E} C3 L3 Itemset {B, C, E} 3rd scan Itemset sup {B, C, E} 2 June 26, 2018

9 The Apriori Algorithm Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; June 26, 2018

10 Important Details of Apriori
How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd} June 26, 2018

11 How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck June 26, 2018

12 How to Count Supports of Candidates?
Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction June 26, 2018

13 Example: Counting Supports of Candidates
Transaction: 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 June 26, 2018

14 Challenges of Frequent Pattern Mining
Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates June 26, 2018

15 Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 June 26, 2018

16 Transaction reduction
A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration because subsequent scans of the database for j-itemsets, where j > k, will not require it. June 26, 2018

17 Sampling for Frequent Patterns
Select a sample of original database, mine frequent patterns within sample using Apriori (PL: potentially large itemsets) Find additional candidates using negative border function PL = BD-(PL) U PL (BD-(PL) : minimal set of itemsets that are not PL, but whose subsets are all in PL) Scan database once to verify frequent itemsets found in sample and to check only borders of closure of frequent patterns (first pass) If there are uncovered new items in the result of scan C, iteratively apply BD-(C) to expand border of closure and scan database again to find missed frequent patterns. (second pass – may or may not happen) H. Toivonen. Sampling large databases for association rules. In VLDB’96 June 26, 2018

18 Sampling for Frequent Patterns
Example: DB t1: {Bread, Jelly, Peanutbutter} t2: {Bread, Peanutbutter} t3: {Bread, Milk, Peanutbutter} t4: {Beer, Bread} t5: {Beer, Milk} Support threshold 40% Sample t1: {Bread, Jelly, Peanutbutter} t2: {Bread, Peanutbutter} Support threshold 10% June 26, 2018

19 Sampling for Frequent Patterns
Sample t1: {Bread, Jelly, Peanutbutter} Support threshold 10% t2: {Bread, Peanutbutter} -> support count 1 PL = { {Bread},{Jelly},{Peanutbutter},{Bread, Jelly}, {Bread, Peanutbutter}, {Jelly, Peanutbutter}, {Bread, Jelly, Peanutbutter}} BD-(PL) = {{Beer}, {Milk}} PL U BD-(PL) = { {Bread},{Jelly},{Peanutbutter},{Bread, Jelly}, {Bread, Peanutbutter}, {Jelly, Peanutbutter}, {Bread, Jelly, Peanutbutter}, {Beer}, {Milk} } C = { {Bread},{Peanutbutter}, {Bread, Peanutbutter}, {Beer}, {Milk} } (on DB with 40% support threshold) Need one more pass June 26, 2018

20 Sampling for Frequent Patterns
C = {{Bread},{Peanutbutter}, {Bread, Peanutbutter}, {Beer}, {Milk} } BD-(C) = { {Bread, Beer}, {Bread, Milk}, {Beer, Milk}, {Beer, Peanutbutter}, {Milk, Peanutbutter}} C = C U BD-(C) BD-(C) = { {Bread, Beer, Milk}, {Bread, Beer, Peanutbutter}, {Beer, Milk, Peanutbutter}, {Bread, Milk, Peanutbutter}} BD-(C) = { {Bread, Beer, Milk, Peanutbutter} } F = { {Bread},{Peanutbutter}, {Bread, Peanutbutter}, {Beer}, {Milk} } (on DB with 40% support threshold) June 26, 2018

21 Dynamic Itemset Counting (DIC): Reduce Number of Scans
ABCD Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins ABC ABD ACD BCD AB AC BC AD BD CD Transactions 1-itemsets A B C D 2-itemsets Apriori {} Itemset lattice 1-itemsets 2-items DIC 3-items June 26, 2018

22 Dynamic Itemset Counting (DIC): Reduce Number of Scans
TID A B C T1 1 T2 T3 T4 Dynamic Itemset Counting (DIC): Reduce Number of Scans

23 Mining Frequent Patterns Without Candidate Generation
Build a compact representation so that no need scan db several times Find frequent itemsets in a recursive way June 26, 2018

24 Construct FP-tree from a Transaction Database
TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Scan DB once, find frequent 1-itemset (single item pattern) Sort frequent items in frequency descending order Scan DB again, construct FP-tree June 26, 2018

25 Partition Patterns and Databases
Frequent patterns can be partitioned into subsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p Patterns having c but no a nor b, m, p Pattern f Completeness and non-redundency June 26, 2018

26 Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 June 26, 2018

27 From Conditional Pattern-bases to Conditional FP-trees
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam {} f:3 c:3 a:3 m-conditional FP-tree c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 June 26, 2018

28 Recursion: Mining Each Conditional FP-tree
{} f:3 c:3 am-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} f:3 c:3 a:3 m-conditional FP-tree {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree {} Cond. pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree June 26, 2018

29 A Special Case: Single Prefix Path in FP-tree
Suppose a (conditional) FP-tree T has a shared single prefix-path P Mining can be decomposed into two parts Reduction of the single prefix path into one node Concatenation of the mining results of the two parts a2:n2 a3:n3 a1:n1 {} b1:m1 C1:k1 C2:k2 C3:k3 b1:m1 C1:k1 C2:k2 C3:k3 r1 a2:n2 a3:n3 a1:n1 {} r1 = + June 26, 2018

30 Mining Frequent Patterns With FP-trees
Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database partition Method For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern June 26, 2018

31 Scaling FP-growth by DB Projection
FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB Parallel projection vs. Partition projection techniques Parallel projection is space costly June 26, 2018

32 Partition-based Projection
Tran. DB fcamp fcabm fb cbp p-proj DB fcam cb m-proj DB fcab fca b-proj DB f a-proj DB fc c-proj DB f-proj DB am-proj DB cm-proj DB Parallel projection needs a lot of disk space Partition projection saves it June 26, 2018

33 FP-Growth vs. Apriori: Scalability With the Support Threshold
Data set T25I20D10K June 26, 2018

34 MaxMiner: Mining Max-patterns
1st scan: find frequent items A, B, C, D, E 2nd scan: find support for AB, AC, AD, AE, ABCDE BC, BD, BE, BCDE CD, CE, CDE, DE, Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98 Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F Potential max-patterns June 26, 2018

35 Mining by Exploring Vertical Data Format
Vertical format: t(AB) = {T11, T25, …} tid-list: list of trans.-ids containing an itemset Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together t(X)  t(Y): transaction having X always has Y Using diffset to accelerate mining Only keep track of differences of tids t(X) = {T1, T2, T3}, t(XY) = {T1, T3} Diffset (XY, X) = {T2} June 26, 2018

36 Parallel And Distributed Algorithms
Data Parallelism: parallelize the data Reduced communication cost when compared to task parallelism Initial candidates and local counts Memory at each processor should be large enough to hold all candidates at each scan Task Parallelism: parallelize the candidates Candidates are partitioned and counted separately, therefore can fit in the memory Pass not only candidates but also local set of transactions June 26, 2018

37 Parallel And Distributed Algorithms
Data Parallelism: Count Distribution Algorithm (CDA) Pass k = 1: 1. Generate local C1 from local data Di 2. Merge local candidate sets to find C1 Pass k > 1: 1. Generate complete Ck from Lk-1 2. Count local data Di to find support for Ck 3. Exchange local counts for Ck to find global Ck 4. Compute Lk from Ck June 26, 2018

38 Parallel And Distributed Algorithms
Task Parallelism: Data Distribution Algorithm (DDA) Partition data + candidate sets 1. Generate Ck from Lk-1; retain |Ck| / P locally. 2. Count local Ck using both local and remote data 3. Calculate local Lk using local Ck and synchronize June 26, 2018

39 Sequential Associations
Sequence: ordered list of itemsets I = {i1, i2, …, im} set of items S = <s1, s2,…,sn>, where s1 I e.g: <{A,B},{B,C},{C}> Subsequence of a given sequence is one that can be obtained by removing some items and any resulting empty itemsets from the original sequence. e.g: <{A},{C}> is a subsequence of <{A,B},{B,C},{C}> {A,C},{B} is NOT subsequence of the above sequence June 26, 2018

40 Sequential Associations
Support(s): percentage of sessions (customers) whose session-sequence contains the sequence s Confidence(s → t): ratio of the number of sessions that contain both sequences s and t to the number of ones that contain s. Example: s = <{A},{C}> Support(s) = 1/3 u = <{B,C},{D}> Support(u) = 2/3 Customer Time Itemset C1 10 AB 20 BC 30 D C2 15 ABC C3 ACD June 26, 2018

41 Sequential Associations
Basic Algorithms: AprioriAll Basic algorithm for finding sequential associations GSP More efficient than AprioriAll Includes extensions Window Concept Hierarchy Spade Improves performance by using vertical table structure Equivalence classes Episode mining Events are ordered with occurance time Uses window June 26, 2018

42 Sequential Associations
AprioriAll Support for a sequence: Fraction of total customers that support a sequence. Maximal Sequence: A sequence that is not contained in any other sequence. Large Sequence: Sequence that meets min_sup. June 26, 2018

43 Sequential Associations - Example
Customer ID Customer Sequence 1 { (30) (90) } 2 { (10 20) (30) ( ) } 3 { (30) (50) (70) } 4 { (30) (40 70) (90) } 5 { (90) } Customer ID Transaction Time Items Bought 1 June 25 '93 30 June 30 '93 90 2 June 10 '93 10,20 June 15 '93 June 20 '93 40,60,70 3 30,50,70 4 40,70 July 25 '93 5 June 12 '93 Maximal sequences with support > 25% { (30) (90) } { (30) (40 70) } Min-sup:25% { (10 20) (30) } Does not satisfy min_sup (Only supported by Cust. 2) { (30) }, { (70) }, { (30) (40) } … are not maximal.

44 AprioriAll- Litemset Phase
Litemset (Large Itemset): Supported by fraction of customers greater than minisup. Customer ID Transaction Time Items Bought 1 June 25 '93 30 June 30 '93 90 2 June 10 '93 10,20 June 15 '93 June 20 '93 40,60,70 3 30,50,70 4 40,70 July 25 '93 5 June 12 '93 Large Itemsets Mapped To (30) 1 (40) 2 (70) 3 (40 70) 4 (90) 5

45 AprioriAll- Transformation Phase
Replace each transaction with all litemsets contained in the transaction. Transactions with no litemsets are dropped. (Still considered for support counts) Customer ID Original Customer Sequence Transformed Custumer Sequence After Mapping 1 { (30) (90) } <{(30)} {(90)}> <{1} {5}> 2 { (10 20) (30) ( ) } <{(30)} {(40),(70),(40 70)}> <{1} {2,3,4}> 3 { (30) (50) (70) } <{(30),(70)}> <{1,3}> 4 { (30) (40 70) (90) } <{(30)} {(40),(70),(40 70)} {(90)}> <{1} {2,3,4} {5}> 5 { (90) } <{(90)}> <{5}> Note: (10 20) dropped because of lack of support. ( ) replaced with set of litemsets {(40),(70),(40 70)} (60 does not have minisup)

46 AprioriAll (Example) L1 1-Seq Support <1> 4 <2> 2 <3> <4> <5> L2 2-Seq Support <1 2> 2 <1 3> 4 <1 4> 3 <1 5> <2 3> <2 4> <3 4> <3 5> <4 5> L3 3-Seq Support <1 2 3> 2 <1 2 4> <1 3 4> 3 <1 3 5> <2 3 4> Customer Sequences <{1 5} {2} {3} {4} > <{1} {3} {4} {3 5}> <{1} {2} {3} {4}> <{1} {3} {5}> <{4} {5}> <3 4 5> 1 L4 4-Seq Support < > 2 <2 5> Maximal: < >, <1 3 5>, <4 5> Minisup = 25%

47 Sequential Associations
GSP Improves pattern generation process Extends problem with Concept hierarchies (taxonomies) Sliding window Time constraints

48 Sequential Associations
Spy Le Carre Perfect Spy Smiley’s People GSP Example: min-sup count =2 <(Ringworld) (Ringworld Engineers)> sliding window =7 days <(Foundation, Ringworld) (Ringworld Engineers)> max-gap= 30 days no patterns taxonomy <(Foundation) (Asimov)> Science Fiction Asimov Niven Foundation Foundation and Empire Second Ringworld Ringworld Engineers Sequence-Id Transaction Time Items C1 1 Ringworld 2 Foundation 15 Ringworld Engineers, Second Foundation C2 Foundation, Ringworld 20 Foundation and Empire 50 Ringworld Engineers

49 Sequential Associations
GSP Initially finds 1-element frequent sequence At subsequent passes, each pass starts with a seed set (the frequent sequences found in the previous pass). Each candidate sequence has one more item than a seed sequence; so all the candidate sequences in the same pass will have the same number of items Candidate generation Join: s1 joins with s2 if the subsequence obtained by dropping the first item of s1 is the same as the subsequence obtained by dropping the last element of s2. s1 is extended with the last item in s2. Prune: delete candidate if it has an infrequent contiguous subsequence

50 Sequential Associations
GSP Initial candidates: <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates Cand Sup <a> 3 <b> 5 <c> 4 <d> <e> <f> 2 <g> 1 <h> <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID min_sup =2

51 Sequential Associations
GSP Initial candidates: <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates Cand Sup <a> 3 <b> 5 <c> 4 <d> <e> <f> 2 <g> 1 <h> <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID min_sup =2

52 Sequential Associations
GSP: 51 length-2 Candidates <a> <b> <c> <d> <e> <f> <aa> <ab> <ac> <ad> <ae> <af> <ba> <bb> <bc> <bd> <be> <bf> <ca> <cb> <cc> <cd> <ce> <cf> <da> <db> <dc> <dd> <de> <df> <ea> <eb> <ec> <ed> <ee> <ef> <fa> <fb> <fc> <fd> <fe> <ff> <a> <b> <c> <d> <e> <f> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <(bc)> <(bd)> <(be)> <(bf)> <(cd)> <(ce)> <(cf)> <(de)> <(df)> <(ef)> Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates

53 Sequential Associations
GSP: <a> <b> <c> <d> <e> <f> <g> <h> <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> <abb> <aab> <aba> <baa> <bab> … <abba> <(bd)bc> … <(bd)cba> 1st scan: 8 cand. 6 length-1 seq. pat. 2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4th scan: 8 cand. 6 length-4 seq. pat. 5th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold Cand. not in DB at all <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID min_sup =2

54 Sequential Associations
GSP Example: L3 Freq. 3-sequences C4 (candidate 4-seq) After join After pruning < (1,2) (3)> < (1,2) (4)> < (1) (3,4)> < (1,3) (5)> < (2) (3,4)> < (2) (3) (5)> < (1,2) (3,4)> <(1,2) (3) (5) > <(1,2) (3,4)>

55 Sequential Associations
SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 A vertical format sequential pattern mining method A sequence database is mapped to a large set of Item: <SID, EID> Sequential pattern mining is performed by growing the subsequences (patterns) one item at a time by Apriori candidate generation

56 Sequential Associations - SPADE

57 Sequential Associations
Episode Mining Event sequence s: <(A1, t1), (A2,t2), …, (An, tn)> Ai: event ti: occurrence time Ts: starting time Te: ending time A sequence is considered to be in [Ts Te) Example: (<(A,1),(B,2),(C,2),(A,7), (C,10),(B,10),(C,12),(A,12),(C,13),(D,14),(C,14), (E,20)>, 1,21) Time window: (w, ts, te) includes events occuring between ts and te Win: window size Episodes: Partially ordered collection of events Serial episodes: A  B Parallel episodes: A & B Non-parallel, non-serial: (A | B)C*(D  E)

58 Sequential Associations
Episode Mining Support (frequency) of an episode: ratio of windows in which the episode occurs s(e) = |windows including e|/|all windows| |all windows| =Te-Ts + win -1 Example: (<(A,1),(B,2),(C,2),(A,7),(C,10),(B,10),(C,12),(A,12),(C,13),(D,14),(C,14), (E,20)>, 1,21) Support threshold = 30% Win = 3 s(<A>) = 9/22 s(A&C) = 6/22 S(A  C) = 5/22

59 Visualization of Association Rules: Plane Graph
June 26, 2018

60 Mining Various Kinds of Association Rules
Mining multilevel association Mining multidimensional association Mining quantitative association Mining interesting correlation patterns June 26, 2018

61 Mining Multiple-Level Association Rules
Items often form hierarchies Flexible support settings Items at the lower level are expected to have lower support Exploration of shared multi-level mining uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support June 26, 2018

62 Multi-level Association: Redundancy Filtering
Some rules may be redundant due to “ancestor” relationships between items. Example milk  wheat bread [support = 8%, confidence = 70%] 2% milk  wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. June 26, 2018

63 Mining Multi-Dimensional Association
Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules:  2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”) hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) June 26, 2018

64 Static Discretization of Quantitative Attributes
Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate sets will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional cuboid correspond to the predicate sets. Mining from data cubes can be much faster. (income) (age) () (buys) (age, income) (age,buys) (income,buys) (age,income,buys) June 26, 2018

65 Mining Other Interesting Patterns
Flexible support constraints (Wang et VLDB’02) Some items (e.g., diamond) may occur rarely but are valuable Customized supmin specification and application Top-K closed frequent patterns (Han, et ICDM’02) Hard to specify supmin, but top-k with lengthmin is more desirable Dynamically raise supmin in FP-tree construction and mining, and select most promising path to mine June 26, 2018

66 Interestingness Measure: Correlations (Lift)
play basketball  eat cereal [40%, 66.7%] is misleading The overall % of students eating cereal is 75% > 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift (also called correlation, interest) Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 5000 June 26, 2018

67 Interestingness Measure: Conviction
Measures the independence of negation of implication Conviction (A → B) = P(A) P(-B) / P(A, -B) 1: not related ∞: always hold \Example: Conviction (basketball → cereal) = (3000/5000) * (1250/5000) / (1000/5000) = 0.75 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 5000 June 26, 2018

68 Interestingness Measure: Conviction
Conviction (A → B) = P(A) P(-B) / P(A, -B) Example: t1: {Bread, Jelly, Peanutbutter} t2: {Bread, Peanutbutter} t3: {Bread, Milk, Peanutbutter} t4: {Beer, Bread} t5: {Beer, Milk} conviction (Peanutbutter → Bread) = (3/5 * 1/5) / 0 = ∞ conviction (Bread → Peanutbutter) = (4/5 * 2/5) / (1/5) = 8/5 > 1 June 26, 2018

69 Interestingness Measure: f-measure
f-measure = (( 1+ α2) * support * confidence ) / (α * s ) + c α takes a value in [0,1], when 0 support has min effect Another representation F-metric = ((B2 +1 ) * confidence * support ) / (( B * confidence) + support)) June 26, 2018

70 Are lift and 2 Good Measures of Correlation?
“Buy walnuts  buy milk [1%, 80%]” is misleading if 85% of customers buy milk Support and confidence are not good to represent correlations lift and 2 are not good measures for correlations in large transactional DBs Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.) m ~m DB m, c ~m, c m~c ~m~c lift all-conf coh 2 A1 1000 100 10,000 9.26 0.91 0.83 9055 A2 100,000 8.44 0.09 0.05 670 A3 10000 9.18 8172 A4 1 0.5 0.33 June 26, 2018

71 Which Measures Should Be Used?
all-conf or coherence could be good measures Both all-conf and coherence have the downward closure property June 26, 2018

72 Constraints in Data Mining
Knowledge type constraint: classification, association, etc. Data constraint — using SQL-like queries find product pairs sold together in stores in Chicago in Dec.’02 Dimension/level constraint in relevance to region, price, brand, customer category Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum > $200) Interestingness constraint strong rules: min_support  3%, min_confidence  60% June 26, 2018

73 Anti-Monotonicity in Constraint Pushing
TDB (min_sup=2) Anti-monotonicity When an intemset S violates the constraint, so does any of its superset sum(S.Price)  v is anti-monotone sum(S.Price)  v is not anti-monotone Example. C: range(S.profit)  15 is anti-monotone Itemset ab violates C So does every superset of ab TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b c -20 d 10 e -30 f 30 g 20 h -10 June 26, 2018

74 Monotonicity for Constraint Pushing
TDB (min_sup=2) Monotonicity When an intemset S satisfies the constraint, so does any of its superset sum(S.Price)  v is monotone min(S.Price)  v is monotone Example. C: range(S.profit)  15 Itemset ab satisfies C So does every superset of ab TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b c -20 d 10 e -30 f 30 g 20 h -10 June 26, 2018

75 Succinctness Succinctness:
Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1 Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items min(S.Price)  v is succinct sum(S.Price)  v is not succinct Optimization: If C is succinct, C is pre-counting pushable June 26, 2018

76 The Apriori Algorithm — Example
Database D L1 C1 Scan D C2 C2 L2 Scan D C3 L3 Scan D June 26, 2018

77 Naïve Algorithm: Apriori + Constraint
Database D L1 C1 Scan D C2 C2 L2 Scan D C3 L3 Scan D Constraint: Sum{S.price} < 5 June 26, 2018

78 The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep
Database D L1 C1 Scan D C2 C2 L2 Scan D C3 L3 Scan D Constraint: Sum{S.price} < 5 June 26, 2018

79 The Constrained Apriori Algorithm: Push a Succinct Constraint Deep
Database D L1 C1 Scan D C2 C2 L2 Scan D not immediately to be used C3 L3 Scan D Constraint: min{S.price } <= 1 June 26, 2018

80 What Constraints Are Convertible?
Convertible anti-monotone Convertible monotone Strongly convertible avg(S)  ,  v Yes median(S)  ,  v sum(S)  v (items could be of any value, v  0) No sum(S)  v (items could be of any value, v  0) sum(S)  v (items could be of any value, v  0) sum(S)  v (items could be of any value, v  0) …… June 26, 2018

81 Constraint-Based Mining—A General Picture
Antimonotone Monotone Succinct v  S no yes S  V S  V min(S)  v min(S)  v max(S)  v max(S)  v count(S)  v weakly count(S)  v sum(S)  v ( a  S, a  0 ) sum(S)  v ( a  S, a  0 ) range(S)  v range(S)  v avg(S)  v,   { , ,  } convertible support(S)   support(S)   June 26, 2018


Download ppt "Association Rule Mining"

Similar presentations


Ads by Google