DATA MINING ASSOCIATION RULES.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Mining Association Rules
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Techniques Association Rule
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Data Mining Association Analysis: Basic Concepts and Algorithms
CPS : Information Management and Mining
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules Mining Part III. Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support.
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Ch5 Mining Frequent Patterns, Associations, and Correlations
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Mining Frequent Patterns without Candidate Generation.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 2: Mining Association Rules
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Association Rules Repoussis Panagiotis.
Mining Association Rules
Frequent Pattern Mining
©Jiawei Han and Micheline Kamber
Market Basket Analysis and Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Find Patterns Having P From P-conditional Database
Association Analysis: Basic Concepts and Algorithms
732A02 Data Mining - Clustering and Association Analysis
Mining Frequent Patterns without Candidate Generation
Frequent-Pattern Tree
Market Basket Analysis and Association Rules
©Jiawei Han and Micheline Kamber
FP-Growth Wenlong Zhang.
Mining Association Rules in Large Databases
Association Analysis: Basic Concepts
What Is Association Mining?
Presentation transcript:

DATA MINING ASSOCIATION RULES

DATA MINING VESIT M.VIJAYALAKSHMI Outline Large Itemsets Basic Algorithms Simple Apriori Algorithm FP Algorithm Advanced Topics DATA MINING VESIT M.VIJAYALAKSHMI

Association Rules Outline Association Rules Problem Overview Large itemsets Association Rules Algorithms Apriori Sampling Partitioning Parallel Algorithms Comparing Techniques Incremental Algorithms Advanced AR Techniques DATA MINING VESIT M.VIJAYALAKSHMI

Example: Market Basket Data Items frequently purchased together: Computer Printer Uses: Placement Advertising Sales Coupons Objective: increase sales and reduce costs Called Market Basket Analysis, Shopping Cart Analysis DATA MINING VESIT M.VIJAYALAKSHMI

Association Rule Definitions Set of items: I={I1,I2,…,Im} Transactions: D={t1,t2, …, tn}, tj I Itemset: {Ii1,Ii2, …, Iik}  I Support of an itemset: Percentage of transactions which contain that itemset. Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. DATA MINING VESIT M.VIJAYALAKSHMI

Transaction Data: Supermarket Data Market basket transactions: t1: {bread, cheese, milk} t2: {apple, jam, salt, ice-cream} … … tn: {biscuit, jam, milk} Concepts: An item: an item/article in a basket I: the set of all items sold in the store A Transaction: items purchased in a basket; it may have TID (transaction ID) A Transactional dataset: A set of transactions DATA MINING VESIT M.VIJAYALAKSHMI

Transaction Data: A Set Of Documents A text document data set. Each document is treated as a “bag” of keywords doc1: Student, Teach, School doc2: Student, School doc3: Teach, School, City, Game doc4: Baseball, Basketball doc5: Basketball, Player, Spectator doc6: Baseball, Coach, Game, Team doc7: Basketball, Team, City, Game DATA MINING VESIT M.VIJAYALAKSHMI

Association Rules Example {Biscuit}  {FruitJuice}, {Milk, Bread}  {Eggs,Coke}, {FruitJuice, Bread}  {Milk}, DATA MINING VESIT M.VIJAYALAKSHMI

Association Rule Definitions Association Rule (AR): implication X  Y where X,Y  I and X  Y = ; Support of AR (s) X  Y: Percentage of transactions that contain X Y Confidence of AR (a) X  Y: Ratio of number of transactions that contain X  Y to the number that contain X DATA MINING VESIT M.VIJAYALAKSHMI

Association Rules Ex (cont’d) Itemset A collection of one or more items Example: {Milk, Bread, Biscuit} k-itemset An itemset that contains k items Support count () Frequency of occurrence of an itemset E.g. ({Milk, Bread,Biscuit}) = 2 Support Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Biscuit}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold DATA MINING VESIT M.VIJAYALAKSHMI

Association Rule Problem Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij  I, the Association Rule Problem is to identify all association rules X  Y with a minimum support and confidence. Link Analysis NOTE: Support of X  Y is same as support of X  Y. DATA MINING VESIT M.VIJAYALAKSHMI

Definition: Association Rule An implication expression of the form X  Y, where X and Y are itemsets Example: {Milk, Biscuit}  {FruitJuice} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X Example: DATA MINING VESIT M.VIJAYALAKSHMI

Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive! DATA MINING VESIT M.VIJAYALAKSHMI

Mining Association Rules Example of Rules: {Milk,Biscuit}  {FruitJuice} (s=0.4, c=0.67) {Milk,FruitJuice}  {Biscuit} (s=0.4, c=1.0) {Biscuit,FruitJuice}  {Milk} (s=0.4, c=0.67) {FruitJuice}  {Milk,Biscuit} (s=0.4, c=0.67) {Biscuit}  {Milk,FruitJuice} (s=0.4, c=0.5) {Milk}  {Biscuit,FruitJuice} (s=0.4, c=0.5) DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Observations All the above rules are binary partitions of the same itemset: {Milk, Biscuit, FruitJuice} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Example 2 t1: Butter, Cocoa, Milk t2: Butter, Cheese t3: Cheese, Boots t4: Butter, Cocoa, Cheese t5: Butter, Cocoa, Clothes, Cheese, Milk t6: Cocoa, Clothes, Milk t7: Cocoa, Milk, Clothes Transaction data Assume: minsup = 30% minconf = 80% An example frequent itemset: {Cocoa, Clothes, Milk} [sup = 3/7] Association rules from the itemset: Clothes  Milk, Cocoa [sup = 3/7, conf = 3/3] … … Clothes, Cocoa  Milk, [sup = 3/7, conf = 3/3] DATA MINING VESIT M.VIJAYALAKSHMI

Mining Association Rules Two-step approach: Frequent Itemset Generation Generate all itemsets whose support  minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive DATA MINING VESIT M.VIJAYALAKSHMI

Frequent Itemset Generation Given d items, there are 2d possible candidate itemsets DATA MINING VESIT M.VIJAYALAKSHMI

Frequent Itemset Generation Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!! W N DATA MINING VESIT M.VIJAYALAKSHMI

Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction DATA MINING VESIT M.VIJAYALAKSHMI

Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Apriori Rule Frequent Itemset Property: Any subset of a Frequent itemset is Frequent. Contrapositive: If an itemset is not Frequent, none of its supersets are large Frequent. DATA MINING VESIT M.VIJAYALAKSHMI

Illustrating Apriori Principle Found to be Infrequent Pruned supersets DATA MINING VESIT M.VIJAYALAKSHMI

Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13 DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Apriori Algorithm Method: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D 1-candidates Freq 1-itemsets 2-candidates Itemset Sup a 2 b 3 c d 1 e Itemset Sup a 2 b 3 c e Itemset ab ac ae bc be ce Scan D Min_sup=2 3-candidates Freq 2-itemsets Counting Scan D Itemset bce Itemset Sup ac 2 bc be 3 ce Itemset Sup ab 1 ac 2 ae bc be 3 ce Scan D Freq 3-itemsets Itemset Sup bce 2 DATA MINING VESIT M.VIJAYALAKSHMI

Example – Finding frequent itemsets Dataset T minsup=0.5 TID Items T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5 T400 2, 5 itemset:count 1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3  F1: {1}:2, {2}:3, {3}:3, {5}:3  C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5} 2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2  F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2  C3: {2, 3,5} 3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5} DATA MINING VESIT M.VIJAYALAKSHMI

Details: Ordering Of Items The items in I are sorted in lexicographic order (which is a total order). The order is used throughout the algorithm in each itemset. {w[1], w[2], …, w[k]} represents a k-itemset w consisting of items w[1], w[2], …, w[k], where w[1] < w[2] < … < w[k] according to the total order. DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Apriori-Gen Generate candidates of size i+1 from large itemsets of size i. Approach used: join large itemsets of size i if they agree on i-1 May also prune candidates who have subsets that are not large. DATA MINING VESIT M.VIJAYALAKSHMI

Candidate-Gen Function Function candidate-gen(Fk-1) Ck  ; forall f1, f2  Fk-1 with f1 = {i1, … , ik-2, ik-1} and f2 = {i1, … , ik-2, i’k-1} and ik-1 < i’k-1 do c  {i1, …, ik-1, i’k-1}; // join f1 and f2 Ck  Ck  {c}; for each (k-1)-subset s of c do if (s  Fk-1) then delete c from Ck; // prune end end return Ck; DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI An Example F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}} After join C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}} After pruning: C4 = {{1, 2, 3, 4}} because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is removed) DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Apriori-Gen Example DATA MINING VESIT M.VIJAYALAKSHMI

Apriori-Gen Example (cont’d) DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Apriori Adv/Disadv Advantages: Uses large itemset property. Easily parallelized Easy to implement. Disadvantages: Assumes transaction database is memory resident. Requires up to m database scans. DATA MINING VESIT M.VIJAYALAKSHMI

Step 2: Generating Rules From Frequent Itemsets Frequent itemsets  association rules One more step is needed to generate association rules For each frequent itemset X, For each proper nonempty subset A of X, Let B = X - A A  B is an association rule if Confidence(A  B) ≥ minconf, support(A  B) = support(AB) = support(X) confidence(A  B) = support(A  B) / support(A) DATA MINING VESIT M.VIJAYALAKSHMI

Generating Rules: An example Suppose {2,3,4} is frequent, with sup=50% Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively These generate these association rules: 2,3  4, confidence=100% 2,4  3, confidence=100% 3,4  2, confidence=67% 2  3,4, confidence=67% 3  2,4, confidence=67% 4  2,3, confidence=67% All rules have support > 50% DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Rule Generation Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB, If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and   L) DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Generating Rules To recap, in order to obtain A  B, we need to have support(A  B) and support(A) All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more. This step is not as time-consuming as frequent itemsets generation. Han and Kamber 2001 DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD) DATA MINING VESIT M.VIJAYALAKSHMI

Rule Generation for Apriori Algorithm Lattice of rules Pruned Rules Low Confidence Rule DATA MINING VESIT M.VIJAYALAKSHMI

APriori - Performance Bottlenecks The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to collect counts for the candidate itemsets Bottleneck of Apriori: candidate generation Huge candidate sets: 104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates. Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern DATA MINING VESIT M.VIJAYALAKSHMI

Mining Frequent Patterns Without Candidate Generation Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only! DATA MINING VESIT M.VIJAYALAKSHMI

Construct FP-tree From A Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 0.5 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Steps: Scan DB once, find frequent 1-itemset (single item pattern) Order frequent items in frequency descending order Scan DB again, construct FP-tree DATA MINING VESIT M.VIJAYALAKSHMI

Benefits of the FP-tree Structure Completeness: never breaks a long pattern of any transaction preserves complete information for frequent pattern mining Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are more likely to be shared never be larger than the original database (if not count node-links and counts) DATA MINING VESIT M.VIJAYALAKSHMI

Major Steps to Mine FP-tree Construct conditional pattern base for each node in the FP-tree Construct conditional FP-tree from each conditional pattern-base Recursively mine conditional FP-trees and grow frequent patterns obtained so far If the conditional FP-tree contains a single path, simply enumerate all the patterns DATA MINING VESIT M.VIJAYALAKSHMI

Step 1: FP-tree to Conditional Pattern Base Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 DATA MINING VESIT M.VIJAYALAKSHMI

Properties of FP-tree for Conditional Pattern Base Construction Node-link property For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree header Prefix path property To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai. DATA MINING VESIT M.VIJAYALAKSHMI

Step 2: Construct Conditional FP-tree For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam   DATA MINING VESIT M.VIJAYALAKSHMI

Mining Frequent Patterns by Creating Conditional Pattern-Bases Empty f {(f:3)}|c {(f:3)} c {(f:3, c:3)}|a {(fc:3)} a {(fca:1), (f:1), (c:1)} b {(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m {(c:3)}|p {(fcam:2), (cb:1)} p Conditional FP-tree Conditional pattern-base Item DATA MINING VESIT M.VIJAYALAKSHMI

Step 3: Recursively mine the conditional FP-tree {} f:3 c:3 am-conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree Cond. pattern base of “cam”: (f:3) {} f:3 cam-conditional FP-tree DATA MINING VESIT M.VIJAYALAKSHMI

Single FP-tree Path Generation Suppose an FP-tree T has a single path P The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam f:3  c:3 a:3 m-conditional FP-tree DATA MINING VESIT M.VIJAYALAKSHMI

Principles of Frequent Pattern Growth Pattern growth property Let  be a frequent itemset in DB, B be 's conditional pattern base, and  be an itemset in B. Then    is a frequent itemset in DB iff  is frequent in B. “abcdef ” is a frequent pattern, if and only if “abcde ” is a frequent pattern, and “f ” is frequent in the set of transactions containing “abcde ” DATA MINING VESIT M.VIJAYALAKSHMI

Why Is Frequent Pattern Growth Fast? Performance study shows FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection Reasoning No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI FP-Tree Construction null After reading TID=1: A:1 B:1 After reading TID=2: null B:1 A:1 B:1 C:1 D:1 DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI FP-Tree Construction Transaction Database null B:3 A:7 B:5 C:3 C:1 D:1 Header table D:1 C:3 E:1 D:1 E:1 D:1 E:1 D:1 Pointers are used to assist frequent itemset generation DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI FP-growth Build conditional pattern base for E: P = {(A:1,C:1,D:1), (A:1,D:1), (B:1,C:1)} Recursively apply FP-growth on P null B:3 A:7 B:5 C:3 C:1 D:1 C:3 D:1 D:1 E:1 E:1 D:1 E:1 D:1 DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI FP-growth Conditional tree for E: Conditional Pattern base for E: P = {(A:1,C:1,D:1,E:1), (A:1,D:1,E:1), (B:1,C:1,E:1)} Count for E is 3: {E} is frequent itemset Recursively apply FP-growth on P null B:1 A:2 C:1 C:1 D:1 D:1 E:1 E:1 E:1 DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI FP-growth Conditional tree for D within conditional tree for E: Conditional pattern base for D within conditional base for E: P = {(A:1,C:1,D:1), (A:1,D:1)} Count for D is 2: {D,E} is frequent itemset Recursively apply FP-growth on P null A:2 C:1 D:1 D:1 DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI FP-growth Conditional tree for C within D within E: Conditional pattern base for C within D within E: P = {(A:1,C:1)} Count for C is 1: {C,D,E} is NOT frequent itemset null A:1 C:1 DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI FP-growth Conditional tree for A within D within E: Count for A is 2: {A,D,E} is frequent itemset Next step: Construct conditional tree C within conditional tree E Continue until exploring conditional tree for A (which has only node A) null A:2 DATA MINING VESIT M.VIJAYALAKSHMI

Benefits of the FP-tree Structure Performance study shows FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection Reasoning No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI Iceberg Queries Icerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold Example: select P.custID, P.itemID, sum(P.qty) from purchase P group by P.custID, P.itemID having sum(P.qty) >= 10 Compute iceberg queries efficiently by Apriori: First compute lower dimensions Then compute higher dimensions only when all the lower ones are above the threshold DATA MINING VESIT M.VIJAYALAKSHMI

Multiple-Level Association Rules Food bread milk skim Dairy Amul 2% white wheat Items often form hierarchy. Items at the lower level are expected to have lower support. Rules regarding itemsets at appropriate levels could be quite useful. Transaction database can be encoded based on dimensions and levels We can explore shared multi-level mining DATA MINING VESIT M.VIJAYALAKSHMI

Mining Multi-Level Associations A top_down, progressive deepening approach: First find high-level strong rules: milk ® bread [20%, 60%]. Then find their lower-level “weaker” rules: 2% milk ® wheat bread [6%, 50%]. Variations at mining multiple-level association rules. Level-crossed association rules: 2% milk ® Wonder wheat bread Association rules with multiple, alternative hierarchies: 2% milk ® Wonder bread DATA MINING VESIT M.VIJAYALAKSHMI

Multi-level Association: Uniform Support vs. Reduced Support Uniform Support: the same minimum support for all levels + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. – Lower level items do not occur as frequently. If support threshold too high  miss low level associations too low  generate too many high level associations Reduced Support: reduced minimum support at lower levels DATA MINING VESIT M.VIJAYALAKSHMI

Multi-level Association: Redundancy Filtering Some rules may be redundant due to “ancestor” relationships between items. Example milk  wheat bread [support = 8%, confidence = 70%] 2% milk  wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. DATA MINING VESIT M.VIJAYALAKSHMI

Multi-Dimensional Association: Concepts Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules:  2 dimensions or predicates Inter-dimension association rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”) hybrid-dimension association rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) Categorical Attributes finite number of possible values, no ordering among values Quantitative Attributes numeric, implicit ordering among values DATA MINING VESIT M.VIJAYALAKSHMI

Techniques for Mining MD Associations Search for frequent k-predicate set: Example: {age, occupation, buys} is a 3-predicate set. Techniques can be categorized by how age are treated. 1. Using static discretization of quantitative attributes Quantitative attributes are statically discretized by using predefined concept hierarchies. 2. Quantitative association rules Quantitative attributes are dynamically discretized into “bins”based on the distribution of the data. 3. Distance-based association rules This is a dynamic discretization process that considers the distance between data points. DATA MINING VESIT M.VIJAYALAKSHMI

Interestingness Measurements Objective measures Two popular measurements: support; and confidence Subjective measures A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it) DATA MINING VESIT M.VIJAYALAKSHMI

Criticism to Support and Confidence Example 1: Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball  eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball  not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence DATA MINING VESIT M.VIJAYALAKSHMI

DATA MINING VESIT M.VIJAYALAKSHMI (Cont.) Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates We need a measure of dependent or correlated events P(B|A)/P(B) is also called the lift of rule A => B DATA MINING VESIT M.VIJAYALAKSHMI