Data Mining: Concepts and Techniques

Slides:



Advertisements
Similar presentations
Association Rules Evgueni Smirnov.
Advertisements

Association Rule Mining
Association Rule Mining
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
CSE 634 Data Mining Techniques
Data Mining Techniques Association Rule
Association rules and frequent itemsets mining
Frequent Closed Pattern Search By Row and Feature Enumeration
FP-Growth algorithm Vasiljevic Vladica,
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
FPtree/FPGrowth (Complete Example). First scan – determine frequent 1- itemsets, then build header B8 A7 C7 D5 E3.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Association Rule Mining Zhenjiang Lin Group Presentation April 10, 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
FP-Tree/FP-Growth Practice. FP-tree construction null B:1 A:1 After reading TID=1: After reading TID=2: null B:2 A:1 C:1 D:1.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Chapter 4: Mining Frequent Patterns, Associations and Correlations
FPtree/FPGrowth. FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Then use a recursive divide-and-conquer.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
SEG Tutorial 2 – Frequent Pattern Mining.
Chapter 2: Mining Frequent Patterns, Associations and Correlations
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Association Analysis (3)
What is Frequent Pattern Analysis?
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Data Mining Find information from data data ? information.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Information Management course
Association rule mining
Association Rules Repoussis Panagiotis.
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Waikato Environment for Knowledge Analysis
Market Basket Analysis and Association Rules
FP-Tree/FP-Growth Detailed Steps
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,
Mining Association Rules in Large Databases
Gyozo Gidofalvi Uppsala Database Laboratory
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Mining Frequent Patterns without Candidate Generation
Frequent patterns and Association Rules
Frequent-Pattern Tree
Market Basket Analysis and Association Rules
FP-Growth Wenlong Zhang.
Department of Computer Science National Tsing Hua University
Association Rule Mining
Mining Association Rules in Large Databases
Presentation transcript:

Data Mining: Concepts and Techniques Mining Frequent Patterns & Association Rules Dr. Maher Abuhamdeh

What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Tea and Milk?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis,.

Applications Market Basket Analysis: given a database of customer transactions, where each transaction is a set of items the goal is to find groups of items which are frequently purchased together. Telecommunication (each customer is a transaction containing the set of phone calls) Credit Cards/ Banking Services (each card/account is a transaction containing the set of customer’s payments) Medical Treatments (each patient is represented as a transaction containing the ordered set of diseases) Basketball-Game Analysis (each game is represented as a transaction containing the ordered set of ball passes)

Basic Concepts: Frequent Patterns Tid Items bought 10 Tea, Nuts, Diaper 20 Tea, Coffee, Diaper 30 Tea, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys diaper buys both buys Tea

Basic Concepts: Association Rules Tid Items bought Find all the rules X  Y with minimum support and confidence support, s, probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Tea:3, Nuts:3, Diaper:4, Eggs:3, {Tea, Diaper}:3 10 Tea, Nuts, Diaper 20 Tea, Coffee, Diaper 30 Tea, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper Customer buys Tea

Rule Measures: Support & Confidence Simple Formulas: Confidence (AB) = #tuples containing both A & B / #tuples containing A = P(B|A) = P(A U B ) / P (A) Support (AB) = #tuples containing both A & B/ total number of tuples = P(A U B) What do they actually mean ? Find all the rules X & Y  Z with minimum confidence and support support, s, probability that a transaction contains {X, Y, Z} confidence, c, conditional probability that a transaction having {X, Y} also contains Z

Support & Confidence : An Example Let minimum support 50%, and minimum confidence 50%, then we have, A  C (50%, 66.6%) C  A (50%, 100%)

FP Growth (Han, Pei, Yin 2000) One problematic aspect of the Apriori is the candidate generation Source of exponential growth Another approach is to use a divide and conquer strategy Idea: Compress the database into a frequent pattern tree representing frequent items

FP Growth (Tree construction) Initially, scan database for frequent 1-itemsets Place resulting set in a list L in descending order by frequency (support) Construct an FP-tree Create a root node labeled null Scan database Process the items in each transaction in L order From the root, add nodes in the order in which items appear in the transactions Link nodes representing items along different branches

FP-Tree Algorithm -- Simple Case Transaction_id Time Items_bought 101 6:35 Milk, bread, cookies, juice 792 7:38 Milk, juice 1130 8:05 Milk, eggs 1735 8:40 Bread, cookies, coffee

FP Growth – Another Example Minimum support of 20% (frequency of 2) Frequent 1-itemsets I1,I2,I3,I4,I5 Construct list L = {(I2,7),(I1,6),(I3,6),(I4,2),(I5,2)} TID Items 1 I1,I2,I5 2 I2,I4 3 I2,I3,I6 4 I1,I2,I4 5 I1,I3 6 I2,I3 7 8 I1,I2,I3,I5 9 I1,I2,I3

Build FP-Tree Create root node Scan database Transaction1: I1, I2, I5 null Scan database Transaction1: I1, I2, I5 Order: I2, I1, I5 1 I5 I4 I3 I1 I2 Maintain header table (I2,1) (I1,1) (I5,1) Process transaction Add nodes in item order Label with items, count Header table has item, current count, and links to first node of a particular item type.

Build FP-Tree null (I2,2) (I4,1) (I1,1) (I5,1) TID Items 1 I1,I2,I5 2 3 I2,I3,I6 4 I1,I2,I4 5 I1,I3 6 I2,I3 7 8 I1,I2,I3,I5 9 I1,I2,I3 null (I2,2) (I1,1) (I5,1) 1 I5 I4 I3 I1 2 I2 (I4,1) Exercise: Build the rest of the tree.

Minining the FP-tree Start at the last item in the table Find all paths containing item Follow the node-links Identify conditional patterns Patterns in paths with required frequency Build conditional FP-tree C Append item to all paths in C, generating frequent patterns Mine C recursively (appending item) Remove item from table and tree

Mining the FP-Tree Prefix Paths (I2 I1,1) (I2 I1 I3, 1) Conditional Path (I2 I1, 2) Conditional FP-tree (I2 I1 I5, 2) null (I2,7) (I1,4) (I5,1) 2 I5 I4 6 I3 I1 7 I2 (I4,1) (I3,2) (I1,2) null Exercise: What other frequent patterns come from the CP tree? Exercise: Mine all frequent patterns. (I2,2) (I1,2)

FP-Tree Example Continued Item Conditional pattern base Conditional FP-Tree Frequent pattern generated I5 {(I2 I1: 1),(I2 I1 I3: 1)} <I2:2 , I1:2> I2 I5:2, I1 I5:2, I2 I1 I5: 2 I4 {(I2 I1: 1),(I2: 1)} <I2: 2> I2 I4: 2 I3 {(I2 I1: 2),(I2: 2), (I1: 2)} <I2: 4, I1: 2>,<I1:2> I2 I3:4, I1 I3: 4 , I2 I1 I3: 2 I1 {(I2: 4)} <I2: 4> I2 I1: 4

Frequent 1-itemsets Minimum support of 20% (frequency of 2) I1,I2,I3,I4,I5 Construct list L = {(I2,7),(I1,6),(I3,6),(I4,2),(I5,2)} TID Items 1 I1,I2,I5 2 I2,I4 3 I2,I3,I6 4 I1,I2,I4 5 I1,I3 6 I2,I3 7 8 I1,I2,I3,I5 9 I1,I2,I3

Build FP-Tree Create root node Scan database Transaction1: I1, I2, I5 null Scan database Transaction1: I1, I2, I5 Order: I2, I1, I5 1 I5 I4 I3 I1 I2 Maintain header table (I2,1) Process transaction Add nodes in item order Label with items, count (I1,1) (I5,1) Header table has item, current count, and links to first node of a particular item type.

Build FP-Tree null (I2,2) (I4,1) (I1,1) (I5,1) TID Items 1 I1,I2,I5 2 3 I2,I3,I6 4 I1,I2,I4 5 I1,I3 6 I2,I3 7 8 I1,I2,I3,I5 9 I1,I2,I3 null 1 I5 I4 I3 I1 2 I2 (I2,2) (I1,1) (I4,1) (I5,1) Exercise: Build the rest of the tree.

Mining the FP-tree Start at the last item in the table Find all paths containing item Follow the node-links Identify conditional patterns Patterns in paths with required frequency Build conditional FP-tree C Append item to all paths in C, generating frequent patterns Mine C recursively (appending item) Remove item from table and tree

Mining the FP-Tree Prefix Paths (I2 I1,1) (I2 I1 I3, 1) Conditional Path (I2 I1, 2) Conditional FP-tree (I2 I1 I5, 2) null 2 I5 I4 6 I3 I1 7 I2 (I1,2) (I2,7) (I3,2) null (I4,1) (I1,4) (I5,1) (I3,2) Exercise: What other frequent patterns come from the CP tree? Exercise: Mine all frequent patterns. (I4,1) (I2,2) (I3,2) (I1,2) (I5,1)

FP-tree construction null After reading TID=1: B:1 A:1 Minimum support of 20% (frequency of 2)

FP-Tree Construction B 8 A 7 C D 5 E 3 Transaction Database B:8 A:5 null C:3 D:1 A:2 C:1 E:1 Header table B 8 A 7 C D 5 E 3 Chain pointers help in quickly finding all the paths of the tree containing some given item. There is a pointer chain for each item. I have shown pointer chains only for E and D.

Transactions containing E B:8 A:5 null C:3 D:1 A:2 C:1 E:1 B:1 null C:1 A:2 D:1 E:1

Suffix E A 2 C D B:1 null C:1 A:2 D:1 E:1 (New) Header table Conditional FP-Tree for suffix E A 2 C D null B doesn’t survive because it has support 1, which is lower than min support of 2. C:1 A:2 C:1 D:1 The set of paths ending in E. Insert each path (after truncating E) into a new tree. D:1 We continue recursively. Base of recursion: When the tree has a single path only. FI: E

Suffix DE A 2 (New) Header table null The conditional FP-Tree for suffix DE A 2 A:2 C:1 D:1 null D:1 A:2 The set of paths, from the E-conditional FP-Tree, ending in D. Insert each path (after truncating D) into a new tree. We have reached the base of recursion. FI: DE, ADE

Suffix CE (New) Header table null The conditional FP-Tree for suffix CE C:1 A:1 C:1 null D:1 The set of paths, from the E-conditional FP-Tree, ending in C. Insert each path (after truncating C) into a new tree. We have reached the base of recursion. FI: CE

Suffix AE (New) Header table The conditional FP-Tree for suffix AE null null A:2 The set of paths, from the E-conditional FP-Tree, ending in A. Insert each path (after truncating A) into a new tree. We have reached the base of recursion. FI: AE

Suffix D A 4 B 3 C B:3 A:2 null C:1 D:1 (New) Header table Conditional FP-Tree for suffix D A 4 B 3 C null A:4 B:1 B:2 C:1 C:1 The set of paths containing D. Insert each path (after truncating D) into a new tree. C:1 We continue recursively. Base of recursion: When the tree has a single path only. FI: D

Suffix CD A 2 B (New) Header table null Conditional FP-Tree for suffix CD A 2 B A:4 B:1 B:2 C:1 C:1 null C:1 A:2 B:1 B:1 The set of paths, from the D-conditional FP-Tree, ending in C. Insert each path (after truncating C) into a new tree. We continue recursively. Base of recursion: When the tree has a single path only. FI: CD

Suffix BCD (New) Header table Conditional FP-Tree for suffix CDB null The set of paths from the CD-conditional FP-Tree, ending in B. Insert each path (after truncating B) into a new tree. We have reached the base of recursion. FI: CDB

Suffix ACD (New) Header table Conditional FP-Tree for suffix CDA null The set of paths from the CD-conditional FP-Tree, ending in A. Insert each path (after truncating B) into a new tree. We have reached the base of recursion. FI: CDA

Suffix C B 6 A 4 null (New) Header table Conditional FP-Tree for suffix C B 6 A 4 B:6 A:1 A:3 C:3 C:1 null C:3 B:6 A:1 A:3 The set of paths ending in C. Insert each path (after truncating C) into a new tree. We continue recursively. Base of recursion: When the tree has a single path only. FI: C

Suffix AC B 3 (New) Header table null Conditional FP-Tree for suffix AC B 3 B:6 A:1 A:3 null B:3 The set of paths from the C-conditional FP-Tree, ending in A. Insert each path (after truncating A) into a new tree. We have reached the base of recursion. FI: AC, BAC

Suffix BC B 6 (New) Header table null Conditional FP-Tree for suffix BC B 6 B:6 null The set of paths from the C-conditional FP-Tree, ending in B. Insert each path (after truncating B) into a new tree. We have reached the base of recursion. FI: BC

Suffix A B 5 null (New) Header table Conditional FP-Tree for suffix A The set of paths ending in A. Insert each path (after truncating A) into a new tree. We have reached the base of recursion. FI: A, BA

Suffix B (New) Header table Conditional FP-Tree for suffix B null null The set of paths ending in B. Insert each path (after truncating B) into a new tree. We have reached the base of recursion. FI: B

Apriori Advantages and disadvantages Generate all candidates item set and test them with minimum support are expensive in both space and time. Apriori is so slow comparing with FP growth Apriori is easy to implement

FP Growth Advantages and disadvantages No candidate generation, no candidate test No repeated scan of entire database Basic ops: counting local frequent items and building sub FP-tree, no pattern search and matching Faster than Apriori FP tree is expensive to build FP tree not fit in memory

ECLAT Algorithm by Example Equivalence CLASS Transformation Transform the horizontally formatted data to the vertical format by scanning the database once The support count of an itemset is simply the length of the TID_set of the itemset TID List of item IDS T100 I1,I2,I5 T200 I2,I4 T300 I2,I3 T400 I1,I2,I4 T500 I1,I3 T600 T700 T800 I1,I2,I3,I5 T900 I1,I2,I3 itemset TID_set I1 {T100,T400,T500,T700,T800,T900} I2 {T100,T200,T300,T400,T600,T800,T900} I3 {T300,T500,T600,T700,T800,T900} I4 {T200,T400} I5 {T100,T800}

ECLAT Algorithm by Example Frequent 1-itemsets in vertical format itemset TID_set I1 {T100,T400,T500,T700,T800,T900} I2 {T100,T200,T300,T400,T600,T800,T900} I3 {T300,T500,T600,T700,T800,T900} I4 {T200,T400} I5 {T100,T800} min_sup=2 The frequent k-itemsets can be used to construct the candidate (k+1)-itemsets based on the Apriori property Frequent 2-itemsets in vertical format itemset TID_set {I1,I2} {T100,T400,T800,T900} {I1,I3} {T500,T700,T800,T900} {I1,I4} {T400} {I1,I5} {T100,T800} {I2,I3} {T300,T600,T800,T900} {I2,I4} {T200,T400} {I2,I5} {I3,I5} {T800}

ECLAT Algorithm by Example Frequent 3-itemsets in vertical format itemset TID_set {I1,I2,I3} {T800,T900} {I1,I2,I5} {T100,T800} min_sup=2 This process repeats, with k incremented by 1 each time, until no frequent items or no candidate itemsets can be found Properties of mining with vertical data format Take the advantage of the Apriori property in the generation of candidate (k+1)-itemset from k-itemsets No need to scan the database to find the support of (k+1) itemsets, for k>=1 The TID_set of each k-itemset carries the complete information required for counting such support The TID-sets can be quite long, hence expensive to manipulate Use diffset technique to optimize the support count computation