DATA MINING ASSOCIATION RULES
DATA MINING VESIT M.VIJAYALAKSHMI Outline Large Itemsets Basic Algorithms Simple Apriori Algorithm FP Algorithm Advanced Topics DATA MINING VESIT M.VIJAYALAKSHMI
Association Rules Outline Association Rules Problem Overview Large itemsets Association Rules Algorithms Apriori Sampling Partitioning Parallel Algorithms Comparing Techniques Incremental Algorithms Advanced AR Techniques DATA MINING VESIT M.VIJAYALAKSHMI
Example: Market Basket Data Items frequently purchased together: Computer Printer Uses: Placement Advertising Sales Coupons Objective: increase sales and reduce costs Called Market Basket Analysis, Shopping Cart Analysis DATA MINING VESIT M.VIJAYALAKSHMI
Association Rule Definitions Set of items: I={I1,I2,…,Im} Transactions: D={t1,t2, …, tn}, tj I Itemset: {Ii1,Ii2, …, Iik} I Support of an itemset: Percentage of transactions which contain that itemset. Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. DATA MINING VESIT M.VIJAYALAKSHMI
Transaction Data: Supermarket Data Market basket transactions: t1: {bread, cheese, milk} t2: {apple, jam, salt, ice-cream} … … tn: {biscuit, jam, milk} Concepts: An item: an item/article in a basket I: the set of all items sold in the store A Transaction: items purchased in a basket; it may have TID (transaction ID) A Transactional dataset: A set of transactions DATA MINING VESIT M.VIJAYALAKSHMI
Transaction Data: A Set Of Documents A text document data set. Each document is treated as a “bag” of keywords doc1: Student, Teach, School doc2: Student, School doc3: Teach, School, City, Game doc4: Baseball, Basketball doc5: Basketball, Player, Spectator doc6: Baseball, Coach, Game, Team doc7: Basketball, Team, City, Game DATA MINING VESIT M.VIJAYALAKSHMI
Association Rules Example {Biscuit} {FruitJuice}, {Milk, Bread} {Eggs,Coke}, {FruitJuice, Bread} {Milk}, DATA MINING VESIT M.VIJAYALAKSHMI
Association Rule Definitions Association Rule (AR): implication X Y where X,Y I and X Y = ; Support of AR (s) X Y: Percentage of transactions that contain X Y Confidence of AR (a) X Y: Ratio of number of transactions that contain X Y to the number that contain X DATA MINING VESIT M.VIJAYALAKSHMI
Association Rules Ex (cont’d) Itemset A collection of one or more items Example: {Milk, Bread, Biscuit} k-itemset An itemset that contains k items Support count () Frequency of occurrence of an itemset E.g. ({Milk, Bread,Biscuit}) = 2 Support Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Biscuit}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold DATA MINING VESIT M.VIJAYALAKSHMI
Association Rule Problem Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence. Link Analysis NOTE: Support of X Y is same as support of X Y. DATA MINING VESIT M.VIJAYALAKSHMI
Definition: Association Rule An implication expression of the form X Y, where X and Y are itemsets Example: {Milk, Biscuit} {FruitJuice} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X Example: DATA MINING VESIT M.VIJAYALAKSHMI
Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! DATA MINING VESIT M.VIJAYALAKSHMI
Mining Association Rules Example of Rules: {Milk,Biscuit} {FruitJuice} (s=0.4, c=0.67) {Milk,FruitJuice} {Biscuit} (s=0.4, c=1.0) {Biscuit,FruitJuice} {Milk} (s=0.4, c=0.67) {FruitJuice} {Milk,Biscuit} (s=0.4, c=0.67) {Biscuit} {Milk,FruitJuice} (s=0.4, c=0.5) {Milk} {Biscuit,FruitJuice} (s=0.4, c=0.5) DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Observations All the above rules are binary partitions of the same itemset: {Milk, Biscuit, FruitJuice} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Example 2 t1: Butter, Cocoa, Milk t2: Butter, Cheese t3: Cheese, Boots t4: Butter, Cocoa, Cheese t5: Butter, Cocoa, Clothes, Cheese, Milk t6: Cocoa, Clothes, Milk t7: Cocoa, Milk, Clothes Transaction data Assume: minsup = 30% minconf = 80% An example frequent itemset: {Cocoa, Clothes, Milk} [sup = 3/7] Association rules from the itemset: Clothes Milk, Cocoa [sup = 3/7, conf = 3/3] … … Clothes, Cocoa Milk, [sup = 3/7, conf = 3/3] DATA MINING VESIT M.VIJAYALAKSHMI
Mining Association Rules Two-step approach: Frequent Itemset Generation Generate all itemsets whose support minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive DATA MINING VESIT M.VIJAYALAKSHMI
Frequent Itemset Generation Given d items, there are 2d possible candidate itemsets DATA MINING VESIT M.VIJAYALAKSHMI
Frequent Itemset Generation Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!! W N DATA MINING VESIT M.VIJAYALAKSHMI
Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction DATA MINING VESIT M.VIJAYALAKSHMI
Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Apriori Rule Frequent Itemset Property: Any subset of a Frequent itemset is Frequent. Contrapositive: If an itemset is not Frequent, none of its supersets are large Frequent. DATA MINING VESIT M.VIJAYALAKSHMI
Illustrating Apriori Principle Found to be Infrequent Pruned supersets DATA MINING VESIT M.VIJAYALAKSHMI
Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13 DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Apriori Algorithm Method: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D 1-candidates Freq 1-itemsets 2-candidates Itemset Sup a 2 b 3 c d 1 e Itemset Sup a 2 b 3 c e Itemset ab ac ae bc be ce Scan D Min_sup=2 3-candidates Freq 2-itemsets Counting Scan D Itemset bce Itemset Sup ac 2 bc be 3 ce Itemset Sup ab 1 ac 2 ae bc be 3 ce Scan D Freq 3-itemsets Itemset Sup bce 2 DATA MINING VESIT M.VIJAYALAKSHMI
Example – Finding frequent itemsets Dataset T minsup=0.5 TID Items T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5 T400 2, 5 itemset:count 1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 F1: {1}:2, {2}:3, {3}:3, {5}:3 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5} 2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2 C3: {2, 3,5} 3. scan T C3: {2, 3, 5}:2 F3: {2, 3, 5} DATA MINING VESIT M.VIJAYALAKSHMI
Details: Ordering Of Items The items in I are sorted in lexicographic order (which is a total order). The order is used throughout the algorithm in each itemset. {w[1], w[2], …, w[k]} represents a k-itemset w consisting of items w[1], w[2], …, w[k], where w[1] < w[2] < … < w[k] according to the total order. DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Apriori-Gen Generate candidates of size i+1 from large itemsets of size i. Approach used: join large itemsets of size i if they agree on i-1 May also prune candidates who have subsets that are not large. DATA MINING VESIT M.VIJAYALAKSHMI
Candidate-Gen Function Function candidate-gen(Fk-1) Ck ; forall f1, f2 Fk-1 with f1 = {i1, … , ik-2, ik-1} and f2 = {i1, … , ik-2, i’k-1} and ik-1 < i’k-1 do c {i1, …, ik-1, i’k-1}; // join f1 and f2 Ck Ck {c}; for each (k-1)-subset s of c do if (s Fk-1) then delete c from Ck; // prune end end return Ck; DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI An Example F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}} After join C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}} After pruning: C4 = {{1, 2, 3, 4}} because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is removed) DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Apriori-Gen Example DATA MINING VESIT M.VIJAYALAKSHMI
Apriori-Gen Example (cont’d) DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Apriori Adv/Disadv Advantages: Uses large itemset property. Easily parallelized Easy to implement. Disadvantages: Assumes transaction database is memory resident. Requires up to m database scans. DATA MINING VESIT M.VIJAYALAKSHMI
Step 2: Generating Rules From Frequent Itemsets Frequent itemsets association rules One more step is needed to generate association rules For each frequent itemset X, For each proper nonempty subset A of X, Let B = X - A A B is an association rule if Confidence(A B) ≥ minconf, support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A) DATA MINING VESIT M.VIJAYALAKSHMI
Generating Rules: An example Suppose {2,3,4} is frequent, with sup=50% Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively These generate these association rules: 2,3 4, confidence=100% 2,4 3, confidence=100% 3,4 2, confidence=67% 2 3,4, confidence=67% 3 2,4, confidence=67% 4 2,3, confidence=67% All rules have support > 50% DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L) DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Generating Rules To recap, in order to obtain A B, we need to have support(A B) and support(A) All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more. This step is not as time-consuming as frequent itemsets generation. Han and Kamber 2001 DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) DATA MINING VESIT M.VIJAYALAKSHMI
Rule Generation for Apriori Algorithm Lattice of rules Pruned Rules Low Confidence Rule DATA MINING VESIT M.VIJAYALAKSHMI
APriori - Performance Bottlenecks The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to collect counts for the candidate itemsets Bottleneck of Apriori: candidate generation Huge candidate sets: 104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates. Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern DATA MINING VESIT M.VIJAYALAKSHMI
Mining Frequent Patterns Without Candidate Generation Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only! DATA MINING VESIT M.VIJAYALAKSHMI
Construct FP-tree From A Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 0.5 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Steps: Scan DB once, find frequent 1-itemset (single item pattern) Order frequent items in frequency descending order Scan DB again, construct FP-tree DATA MINING VESIT M.VIJAYALAKSHMI
Benefits of the FP-tree Structure Completeness: never breaks a long pattern of any transaction preserves complete information for frequent pattern mining Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are more likely to be shared never be larger than the original database (if not count node-links and counts) DATA MINING VESIT M.VIJAYALAKSHMI
Major Steps to Mine FP-tree Construct conditional pattern base for each node in the FP-tree Construct conditional FP-tree from each conditional pattern-base Recursively mine conditional FP-trees and grow frequent patterns obtained so far If the conditional FP-tree contains a single path, simply enumerate all the patterns DATA MINING VESIT M.VIJAYALAKSHMI
Step 1: FP-tree to Conditional Pattern Base Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 DATA MINING VESIT M.VIJAYALAKSHMI
Properties of FP-tree for Conditional Pattern Base Construction Node-link property For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree header Prefix path property To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai. DATA MINING VESIT M.VIJAYALAKSHMI
Step 2: Construct Conditional FP-tree For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam DATA MINING VESIT M.VIJAYALAKSHMI
Mining Frequent Patterns by Creating Conditional Pattern-Bases Empty f {(f:3)}|c {(f:3)} c {(f:3, c:3)}|a {(fc:3)} a {(fca:1), (f:1), (c:1)} b {(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m {(c:3)}|p {(fcam:2), (cb:1)} p Conditional FP-tree Conditional pattern-base Item DATA MINING VESIT M.VIJAYALAKSHMI
Step 3: Recursively mine the conditional FP-tree {} f:3 c:3 am-conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree Cond. pattern base of “cam”: (f:3) {} f:3 cam-conditional FP-tree DATA MINING VESIT M.VIJAYALAKSHMI
Single FP-tree Path Generation Suppose an FP-tree T has a single path P The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam f:3 c:3 a:3 m-conditional FP-tree DATA MINING VESIT M.VIJAYALAKSHMI
Principles of Frequent Pattern Growth Pattern growth property Let be a frequent itemset in DB, B be 's conditional pattern base, and be an itemset in B. Then is a frequent itemset in DB iff is frequent in B. “abcdef ” is a frequent pattern, if and only if “abcde ” is a frequent pattern, and “f ” is frequent in the set of transactions containing “abcde ” DATA MINING VESIT M.VIJAYALAKSHMI
Why Is Frequent Pattern Growth Fast? Performance study shows FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection Reasoning No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI FP-Tree Construction null After reading TID=1: A:1 B:1 After reading TID=2: null B:1 A:1 B:1 C:1 D:1 DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI FP-Tree Construction Transaction Database null B:3 A:7 B:5 C:3 C:1 D:1 Header table D:1 C:3 E:1 D:1 E:1 D:1 E:1 D:1 Pointers are used to assist frequent itemset generation DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI FP-growth Build conditional pattern base for E: P = {(A:1,C:1,D:1), (A:1,D:1), (B:1,C:1)} Recursively apply FP-growth on P null B:3 A:7 B:5 C:3 C:1 D:1 C:3 D:1 D:1 E:1 E:1 D:1 E:1 D:1 DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI FP-growth Conditional tree for E: Conditional Pattern base for E: P = {(A:1,C:1,D:1,E:1), (A:1,D:1,E:1), (B:1,C:1,E:1)} Count for E is 3: {E} is frequent itemset Recursively apply FP-growth on P null B:1 A:2 C:1 C:1 D:1 D:1 E:1 E:1 E:1 DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI FP-growth Conditional tree for D within conditional tree for E: Conditional pattern base for D within conditional base for E: P = {(A:1,C:1,D:1), (A:1,D:1)} Count for D is 2: {D,E} is frequent itemset Recursively apply FP-growth on P null A:2 C:1 D:1 D:1 DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI FP-growth Conditional tree for C within D within E: Conditional pattern base for C within D within E: P = {(A:1,C:1)} Count for C is 1: {C,D,E} is NOT frequent itemset null A:1 C:1 DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI FP-growth Conditional tree for A within D within E: Count for A is 2: {A,D,E} is frequent itemset Next step: Construct conditional tree C within conditional tree E Continue until exploring conditional tree for A (which has only node A) null A:2 DATA MINING VESIT M.VIJAYALAKSHMI
Benefits of the FP-tree Structure Performance study shows FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection Reasoning No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI Iceberg Queries Icerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold Example: select P.custID, P.itemID, sum(P.qty) from purchase P group by P.custID, P.itemID having sum(P.qty) >= 10 Compute iceberg queries efficiently by Apriori: First compute lower dimensions Then compute higher dimensions only when all the lower ones are above the threshold DATA MINING VESIT M.VIJAYALAKSHMI
Multiple-Level Association Rules Food bread milk skim Dairy Amul 2% white wheat Items often form hierarchy. Items at the lower level are expected to have lower support. Rules regarding itemsets at appropriate levels could be quite useful. Transaction database can be encoded based on dimensions and levels We can explore shared multi-level mining DATA MINING VESIT M.VIJAYALAKSHMI
Mining Multi-Level Associations A top_down, progressive deepening approach: First find high-level strong rules: milk ® bread [20%, 60%]. Then find their lower-level “weaker” rules: 2% milk ® wheat bread [6%, 50%]. Variations at mining multiple-level association rules. Level-crossed association rules: 2% milk ® Wonder wheat bread Association rules with multiple, alternative hierarchies: 2% milk ® Wonder bread DATA MINING VESIT M.VIJAYALAKSHMI
Multi-level Association: Uniform Support vs. Reduced Support Uniform Support: the same minimum support for all levels + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. – Lower level items do not occur as frequently. If support threshold too high miss low level associations too low generate too many high level associations Reduced Support: reduced minimum support at lower levels DATA MINING VESIT M.VIJAYALAKSHMI
Multi-level Association: Redundancy Filtering Some rules may be redundant due to “ancestor” relationships between items. Example milk wheat bread [support = 8%, confidence = 70%] 2% milk wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. DATA MINING VESIT M.VIJAYALAKSHMI
Multi-Dimensional Association: Concepts Single-dimensional rules: buys(X, “milk”) buys(X, “bread”) Multi-dimensional rules: 2 dimensions or predicates Inter-dimension association rules (no repeated predicates) age(X,”19-25”) occupation(X,“student”) buys(X,“coke”) hybrid-dimension association rules (repeated predicates) age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”) Categorical Attributes finite number of possible values, no ordering among values Quantitative Attributes numeric, implicit ordering among values DATA MINING VESIT M.VIJAYALAKSHMI
Techniques for Mining MD Associations Search for frequent k-predicate set: Example: {age, occupation, buys} is a 3-predicate set. Techniques can be categorized by how age are treated. 1. Using static discretization of quantitative attributes Quantitative attributes are statically discretized by using predefined concept hierarchies. 2. Quantitative association rules Quantitative attributes are dynamically discretized into “bins”based on the distribution of the data. 3. Distance-based association rules This is a dynamic discretization process that considers the distance between data points. DATA MINING VESIT M.VIJAYALAKSHMI
Interestingness Measurements Objective measures Two popular measurements: support; and confidence Subjective measures A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it) DATA MINING VESIT M.VIJAYALAKSHMI
Criticism to Support and Confidence Example 1: Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence DATA MINING VESIT M.VIJAYALAKSHMI
DATA MINING VESIT M.VIJAYALAKSHMI (Cont.) Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates We need a measure of dependent or correlated events P(B|A)/P(B) is also called the lift of rule A => B DATA MINING VESIT M.VIJAYALAKSHMI