Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Mining Frequent Patterns & Association Rules Dr. Maher Abuhamdeh

What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Tea and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

Basic Concepts: Frequent Patterns
Tid Items bought 10 Tea, Nuts, Diaper 20 Tea, Coffee, Diaper 30 Tea, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys diaper buys both buys Tea

Basic Concepts: Association Rules
Tid Items bought Find all the rules X  Y with minimum support and confidence support, s, probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Tea:3, Nuts:3, Diaper:4, Eggs:3, {Tea, Diaper}:3 10 Tea, Nuts, Diaper 20 Tea, Coffee, Diaper 30 Tea, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper Customer buys Tea

Rule Measures: Support & Confidence
Simple Formulas: Confidence (AB) = #tuples containing both A & B / #tuples containing A = P(B|A) = P(A U B ) / P (A) Support (AB) = #tuples containing both A & B/ total number of tuples = P(A U B) What do they actually mean ? Find all the rules X & Y  Z with minimum confidence and support support, s, probability that a transaction contains {X, Y, Z} confidence, c, conditional probability that a transaction having {X, Y} also contains Z

Support & Confidence : An Example
Let minimum support 50%, and minimum confidence 50%, then we have, A  C (50%, 66.6%) C  A (50%, 100%)

Apriori: A Candidate Generation & Test Approach
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated

Frequent Itemset Generation
Given d items, there are 2d possible candidate itemsets

Illustrating Apriori Principle
Found to be Infrequent Pruned supersets

Illustrating Apriori Principle, Cont.

The Apriori Algorithm—An Example
Supmin = 2 Itemset sup {A} 2 {B} 3 {C} {D} 1 {E} Database TDB Itemset sup {A} 2 {B} 3 {C} {E} L1 C1 Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1st scan C2 Itemset sup {A, B} 1 {A, C} 2 {A, E} {B, C} {B, E} 3 {C, E} C2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} L2 2nd scan Itemset sup {A, C} 2 {B, C} {B, E} 3 {C, E} C3 L3 Itemset {B, C, E} 3rd scan Itemset sup {B, C, E} 2

The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

Implementation of Apriori
How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4 = {abcd}

Example TID List of item_Ids T100 I1, I2, I5 T200 I2, I4 T300 I2, I3
T I1, I2, I3, I5 T I1, I2, I3

Generation of CK and LK (min.supp. count=2)
Scan D for count of each candidate- scan Itemset Sup. count {I1} {I2} {I3} {I4} {I5} Itemset Sup. count {I1} {I2} {I3} {I4} {I5} Compare candidate support count with minimum support count - compare C1 L1 Itemset {I1,I2} {I1,I3} {I1,I4} {I1,I5} {I2,I3} {I2,I4} {I2,I5} {I3,I4} {I3,I5} {I4,I5} Itemset Sup. count {I1,I2} {I1,I3} {I1,I4} {I1,I5} {I2,I3} {I2,I4} {I2,I5} {I3,I4} {I3,I5} {I4,I5} Itemset Sup. count {I1,I2} {I1,I3} {I1,I5} {I2,I3} {I2,I4} {I2, I5} Generate C2 candidates from L1 Scan Compare L2 C2 C2

Generation of CK and LK (min.supp. count=2)
Generate C3 candidates from L2 Itemset {I1,I2,I3} {I1,I2,I5} Itemset Sup. Count {I1,I2,I3} {I1,I2,I5} Itemset Sup. Count {I1,I2,I3} {I1,I2,I5} Scan Compare C3 C3 L3

Apriori B, C A, C, D A, B, C, D B, D minsup=2 Candidates A B C D {}

Apriori A B C D {} minsup=2 Candidates 1 1 B, C A, C, D A, B, C, D
B, D minsup=2 Candidates A B 1 C 1 D {}

Apriori A B C D {} minsup=2 Candidates 2 2 B, C A, C, D A, B, C, D
B, D minsup=2 Candidates A B 2 C 2 D {}

Apriori A B C D {} minsup=2 Candidates 1 2 3 1 B, C A, C, D A, B, C, D
B, D minsup=2 Candidates A 1 B 2 C 3 D 1 {}

Apriori AB AC AD BC BD CD A B C D {} minsup=2 Candidates 2 4 4 3 B, C
A, C, D A, B, C, D B, D minsup=2 Candidates AB AC AD BC BD CD A 2 B 4 C 4 D 3 {}

Apriori AB AC AD BC BD CD A B C D {} minsup=2 1 2 2 3 2 2 2 4 4 3 B, C
A, C, D A, B, C, D B, D minsup=2 AB 1 AC 2 AD 2 BC 3 BD 2 CD 2 A 2 B 4 C 4 D 3 {}

Apriori ACD BCD AB AC AD BC BD CD A B C D {} Candidates minsup=2 1 2 2
3 BD 2 CD 2 A 2 B 4 C 4 D 3 {}

Apriori ACD BCD AB AC AD BC BD CD A B C D {} minsup=2 2 1 1 2 2 3 2 2
4 C 4 D 3 {}

TABLE 1 BASKET DATA FORMAT Transaction_ID Purchased_items T200 {TV, Camera, Laptop} T250 {Camera} T300 {TV, Monitor, Laptop} Market Basket Data format Transaction_ID TV Camera Laptop Monitor T200 1 T250 T300 Relational Format Transaction_ID Purchased_items T200 TV Camera Laptop T250 T300 Monitor

FP Growth (Han, Pei, Yin 2000) One problematic aspect of the Apriori is the candidate generation Source of exponential growth Another approach is to use a divide and conquer strategy Idea: Compress the database into a frequent pattern tree representing frequent items

FP Growth (Tree construction)
Initially, scan database for frequent 1-itemsets Place resulting set in a list L in descending order by frequency (support) Construct an FP-tree Create a root node labeled null Scan database Process the items in each transaction in L order From the root, add nodes in the order in which items appear in the transactions Link nodes representing items along different branches

FP-Tree Algorithm -- Simple Case
Transaction_id Time Items_bought 101 6:35 Milk, bread, cookies, juice 792 7:38 Milk, juice 1130 8:05 Milk, eggs 1735 8:40 Bread, cookies, coffee

FP Growth – Another Example
Minimum support of 20% (frequency of 2) Frequent 1-itemsets I1,I2,I3,I4,I5 Construct list L = {(I2,7),(I1,6),(I3,6),(I4,2),(I5,2)} TID Items 1 I1,I2,I5 2 I2,I4 3 I2,I3,I6 4 I1,I2,I4 5 I1,I3 6 I2,I3 7 8 I1,I2,I3,I5 9 I1,I2,I3

Build FP-Tree Create root node Scan database Transaction1: I1, I2, I5
null Scan database Transaction1: I1, I2, I5 Order: I2, I1, I5 1 I5 I4 I3 I1 I2 Maintain header table (I2,1) (I1,1) (I5,1) Process transaction Add nodes in item order Label with items, count Header table has item, current count, and links to first node of a particular item type.

Build FP-Tree null (I2,2) (I4,1) (I1,1) (I5,1) TID Items 1 I1,I2,I5 2
3 I2,I3,I6 4 I1,I2,I4 5 I1,I3 6 I2,I3 7 8 I1,I2,I3,I5 9 I1,I2,I3 null (I2,2) (I1,1) (I5,1) 1 I5 I4 I3 I1 2 I2 (I4,1) Exercise: Build the rest of the tree.

Minining the FP-tree Start at the last item in the table
Find all paths containing item Follow the node-links Identify conditional patterns Patterns in paths with required frequency Build conditional FP-tree C Append item to all paths in C, generating frequent patterns Mine C recursively (appending item) Remove item from table and tree

Mining the FP-Tree Prefix Paths (I2 I1,1) (I2 I1 I3, 1)
Conditional Path (I2 I1, 2) Conditional FP-tree (I2 I1 I5, 2) null (I2,7) (I1,4) (I5,1) 2 I5 I4 6 I3 I1 7 I2 (I4,1) (I3,2) (I1,2) null Exercise: What other frequent patterns come from the CP tree? Exercise: Mine all frequent patterns. (I2,2) (I1,2)

FP-Tree Example Continued
Item Conditional pattern base Conditional FP-Tree Frequent pattern generated I5 {(I2 I1: 1),(I2 I1 I3: 1)} <I2:2 , I1:2> I2 I5:2, I1 I5:2, I2 I1 I5: 2 I4 {(I2 I1: 1),(I2: 1)} <I2: 2> I2 I4: 2 I3 {(I2 I1: 2),(I2: 2), (I1: 2)} <I2: 4, I1: 2>,<I1:2> I2 I3:4, I1 I3: 2 , I2 I1 I3: 2 I1 {(I2: 4)} <I2: 4> I2 I1: 4

Frequent 1-itemsets Minimum support of 20% (frequency of 2)
I1,I2,I3,I4,I5 Construct list L = {(I2,7),(I1,6),(I3,6),(I4,2),(I5,2)} TID Items 1 I1,I2,I5 2 I2,I4 3 I2,I3,I6 4 I1,I2,I4 5 I1,I3 6 I2,I3 7 8 I1,I2,I3,I5 9 I1,I2,I3

Build FP-Tree Create root node Scan database Transaction1: I1, I2, I5
null Scan database Transaction1: I1, I2, I5 Order: I2, I1, I5 1 I5 I4 I3 I1 I2 Maintain header table (I2,1) Process transaction Add nodes in item order Label with items, count (I1,1) (I5,1) Header table has item, current count, and links to first node of a particular item type.

Build FP-Tree null (I2,2) (I4,1) (I1,1) (I5,1) TID Items 1 I1,I2,I5 2
3 I2,I3,I6 4 I1,I2,I4 5 I1,I3 6 I2,I3 7 8 I1,I2,I3,I5 9 I1,I2,I3 null 1 I5 I4 I3 I1 2 I2 (I2,2) (I1,1) (I4,1) (I5,1) Exercise: Build the rest of the tree.

Mining the FP-tree Start at the last item in the table
Find all paths containing item Follow the node-links Identify conditional patterns Patterns in paths with required frequency Build conditional FP-tree C Append item to all paths in C, generating frequent patterns Mine C recursively (appending item) Remove item from table and tree

Mining the FP-Tree Prefix Paths (I2 I1,1) (I2 I1 I3, 1)
Conditional Path (I2 I1, 2) Conditional FP-tree (I2 I1 I5, 2) null 2 I5 I4 6 I3 I1 7 I2 (I1,2) (I2,7) (I3,2) null (I4,1) (I1,4) (I5,1) (I3,2) Exercise: What other frequent patterns come from the CP tree? Exercise: Mine all frequent patterns. (I4,1) (I2,2) (I3,2) (I1,2) (I5,1)

FP-tree construction null After reading TID=1: B:1 A:1
Minimum support of 20% (frequency of 2)

FP-Tree Construction B 8 A 7 C D 5 E 3 Transaction Database B:8 A:5
null C:3 D:1 A:2 C:1 E:1 Header table B 8 A 7 C D 5 E 3 Chain pointers help in quickly finding all the paths of the tree containing some given item. There is a pointer chain for each item. I have shown pointer chains only for E and D.

Transactions containing E
B:8 A:5 null C:3 D:1 A:2 C:1 E:1 B:1 null C:1 A:2 D:1 E:1

Suffix E A 2 C D B:1 null C:1 A:2 D:1 E:1 (New) Header table
Conditional FP-Tree for suffix E A 2 C D null B doesn’t survive because it has support 1, which is lower than min support of 2. C:1 A:2 C:1 D:1 The set of paths ending in E. Insert each path (after truncating E) into a new tree. D:1 We continue recursively. Base of recursion: When the tree has a single path only. FI: E

Suffix DE A 2 (New) Header table null
The conditional FP-Tree for suffix DE A 2 A:2 C:1 D:1 null D:1 A:2 The set of paths, from the E-conditional FP-Tree, ending in D. Insert each path (after truncating D) into a new tree. We have reached the base of recursion. FI: DE, ADE

Suffix CE (New) Header table null
The conditional FP-Tree for suffix CE C:1 A:1 C:1 null D:1 The set of paths, from the E-conditional FP-Tree, ending in C. Insert each path (after truncating C) into a new tree. We have reached the base of recursion. FI: CE

Suffix AE (New) Header table The conditional FP-Tree for suffix AE
null null A:2 The set of paths, from the E-conditional FP-Tree, ending in A. Insert each path (after truncating A) into a new tree. We have reached the base of recursion. FI: AE

Suffix D A 4 B 3 C B:3 A:2 null C:1 D:1 (New) Header table
Conditional FP-Tree for suffix D A 4 B 3 C null A:4 B:1 B:2 C:1 C:1 The set of paths containing D. Insert each path (after truncating D) into a new tree. C:1 We continue recursively. Base of recursion: When the tree has a single path only. FI: D

Suffix CD A 2 B (New) Header table null
Conditional FP-Tree for suffix CD A 2 B A:4 B:1 B:2 C:1 C:1 null C:1 A:2 B:1 B:1 The set of paths, from the D-conditional FP-Tree, ending in C. Insert each path (after truncating C) into a new tree. We continue recursively. Base of recursion: When the tree has a single path only. FI: CD

Suffix BCD (New) Header table Conditional FP-Tree for suffix CDB null
The set of paths from the CD-conditional FP-Tree, ending in B. Insert each path (after truncating B) into a new tree. We have reached the base of recursion. FI: CDB

Suffix ACD (New) Header table Conditional FP-Tree for suffix CDA null
The set of paths from the CD-conditional FP-Tree, ending in A. Insert each path (after truncating B) into a new tree. We have reached the base of recursion. FI: CDA

Suffix C B 6 A 4 null (New) Header table
Conditional FP-Tree for suffix C B 6 A 4 B:6 A:1 A:3 C:3 C:1 null C:3 B:6 A:1 A:3 The set of paths ending in C. Insert each path (after truncating C) into a new tree. We continue recursively. Base of recursion: When the tree has a single path only. FI: C

Suffix AC B 3 (New) Header table null
Conditional FP-Tree for suffix AC B 3 B:6 A:1 A:3 null B:3 The set of paths from the C-conditional FP-Tree, ending in A. Insert each path (after truncating A) into a new tree. We have reached the base of recursion. FI: AC, BAC

Suffix BC B 6 (New) Header table null
Conditional FP-Tree for suffix BC B 6 B:6 null The set of paths from the C-conditional FP-Tree, ending in B. Insert each path (after truncating B) into a new tree. We have reached the base of recursion. FI: BC

Suffix A B 5 null (New) Header table Conditional FP-Tree for suffix A
The set of paths ending in A. Insert each path (after truncating A) into a new tree. We have reached the base of recursion. FI: A, BA

Suffix B (New) Header table Conditional FP-Tree for suffix B null null
The set of paths ending in B. Insert each path (after truncating B) into a new tree. We have reached the base of recursion. FI: B

Apriori Advantages and disadvantages
Generate all candidates item set and test them with minimum support are expensive in both space and time. Apriori is so slow comparing with FP growth Apriori is easy to implement

FP Growth Advantages and disadvantages
No candidate generation, no candidate test No repeated scan of entire database Basic ops: counting local frequent items and building sub FP-tree, no pattern search and matching Faster than Apriori FP tree is expensive to build FP tree not fit in memory

Data Mining: Concepts and Techniques

Similar presentations

Presentation on theme: "Data Mining: Concepts and Techniques"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining: Concepts and Techniques

Similar presentations

Presentation on theme: "Data Mining: Concepts and Techniques"— Presentation transcript:

Similar presentations

About project

Feedback