Download presentation
Presentation is loading. Please wait.
1
DATA MINING ASSOCIATION RULES
2
DATA MINING VESIT M.VIJAYALAKSHMI
Outline Large Itemsets Basic Algorithms Simple Apriori Algorithm FP Algorithm Advanced Topics DATA MINING VESIT M.VIJAYALAKSHMI
3
Association Rules Outline
Association Rules Problem Overview Large itemsets Association Rules Algorithms Apriori Sampling Partitioning Parallel Algorithms Comparing Techniques Incremental Algorithms Advanced AR Techniques DATA MINING VESIT M.VIJAYALAKSHMI
4
Example: Market Basket Data
Items frequently purchased together: Computer Printer Uses: Placement Advertising Sales Coupons Objective: increase sales and reduce costs Called Market Basket Analysis, Shopping Cart Analysis DATA MINING VESIT M.VIJAYALAKSHMI
5
Association Rule Definitions
Set of items: I={I1,I2,…,Im} Transactions: D={t1,t2, …, tn}, tj I Itemset: {Ii1,Ii2, …, Iik} I Support of an itemset: Percentage of transactions which contain that itemset. Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. DATA MINING VESIT M.VIJAYALAKSHMI
6
Transaction Data: Supermarket Data
Market basket transactions: t1: {bread, cheese, milk} t2: {apple, jam, salt, ice-cream} … … tn: {biscuit, jam, milk} Concepts: An item: an item/article in a basket I: the set of all items sold in the store A Transaction: items purchased in a basket; it may have TID (transaction ID) A Transactional dataset: A set of transactions DATA MINING VESIT M.VIJAYALAKSHMI
7
Transaction Data: A Set Of Documents
A text document data set. Each document is treated as a “bag” of keywords doc1: Student, Teach, School doc2: Student, School doc3: Teach, School, City, Game doc4: Baseball, Basketball doc5: Basketball, Player, Spectator doc6: Baseball, Coach, Game, Team doc7: Basketball, Team, City, Game DATA MINING VESIT M.VIJAYALAKSHMI
8
Association Rules Example
{Biscuit} {FruitJuice}, {Milk, Bread} {Eggs,Coke}, {FruitJuice, Bread} {Milk}, DATA MINING VESIT M.VIJAYALAKSHMI
9
Association Rule Definitions
Association Rule (AR): implication X Y where X,Y I and X Y = ; Support of AR (s) X Y: Percentage of transactions that contain X Y Confidence of AR (a) X Y: Ratio of number of transactions that contain X Y to the number that contain X DATA MINING VESIT M.VIJAYALAKSHMI
10
Association Rules Ex (cont’d)
Itemset A collection of one or more items Example: {Milk, Bread, Biscuit} k-itemset An itemset that contains k items Support count () Frequency of occurrence of an itemset E.g. ({Milk, Bread,Biscuit}) = 2 Support Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Biscuit}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold DATA MINING VESIT M.VIJAYALAKSHMI
11
Association Rule Problem
Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence. Link Analysis NOTE: Support of X Y is same as support of X Y. DATA MINING VESIT M.VIJAYALAKSHMI
12
Definition: Association Rule
An implication expression of the form X Y, where X and Y are itemsets Example: {Milk, Biscuit} {FruitJuice} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X Example: DATA MINING VESIT M.VIJAYALAKSHMI
13
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! DATA MINING VESIT M.VIJAYALAKSHMI
14
Mining Association Rules
Example of Rules: {Milk,Biscuit} {FruitJuice} (s=0.4, c=0.67) {Milk,FruitJuice} {Biscuit} (s=0.4, c=1.0) {Biscuit,FruitJuice} {Milk} (s=0.4, c=0.67) {FruitJuice} {Milk,Biscuit} (s=0.4, c=0.67) {Biscuit} {Milk,FruitJuice} (s=0.4, c=0.5) {Milk} {Biscuit,FruitJuice} (s=0.4, c=0.5) DATA MINING VESIT M.VIJAYALAKSHMI
15
DATA MINING VESIT M.VIJAYALAKSHMI
Observations All the above rules are binary partitions of the same itemset: {Milk, Biscuit, FruitJuice} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements DATA MINING VESIT M.VIJAYALAKSHMI
16
DATA MINING VESIT M.VIJAYALAKSHMI
Example 2 t1: Butter, Cocoa, Milk t2: Butter, Cheese t3: Cheese, Boots t4: Butter, Cocoa, Cheese t5: Butter, Cocoa, Clothes, Cheese, Milk t6: Cocoa, Clothes, Milk t7: Cocoa, Milk, Clothes Transaction data Assume: minsup = 30% minconf = 80% An example frequent itemset: {Cocoa, Clothes, Milk} [sup = 3/7] Association rules from the itemset: Clothes Milk, Cocoa [sup = 3/7, conf = 3/3] … … Clothes, Cocoa Milk, [sup = 3/7, conf = 3/3] DATA MINING VESIT M.VIJAYALAKSHMI
17
Mining Association Rules
Two-step approach: Frequent Itemset Generation Generate all itemsets whose support minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive DATA MINING VESIT M.VIJAYALAKSHMI
18
Frequent Itemset Generation
Given d items, there are 2d possible candidate itemsets DATA MINING VESIT M.VIJAYALAKSHMI
19
Frequent Itemset Generation
Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!! W N DATA MINING VESIT M.VIJAYALAKSHMI
20
Frequent Itemset Generation Strategies
Reduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction DATA MINING VESIT M.VIJAYALAKSHMI
21
Reducing Number of Candidates
Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets DATA MINING VESIT M.VIJAYALAKSHMI
22
DATA MINING VESIT M.VIJAYALAKSHMI
Apriori Rule Frequent Itemset Property: Any subset of a Frequent itemset is Frequent. Contrapositive: If an itemset is not Frequent, none of its supersets are large Frequent. DATA MINING VESIT M.VIJAYALAKSHMI
23
Illustrating Apriori Principle
Found to be Infrequent Pruned supersets DATA MINING VESIT M.VIJAYALAKSHMI
24
Illustrating Apriori Principle
Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, = 13 DATA MINING VESIT M.VIJAYALAKSHMI
25
DATA MINING VESIT M.VIJAYALAKSHMI
Apriori Algorithm Method: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent DATA MINING VESIT M.VIJAYALAKSHMI
26
DATA MINING VESIT M.VIJAYALAKSHMI
Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D 1-candidates Freq 1-itemsets 2-candidates Itemset Sup a 2 b 3 c d 1 e Itemset Sup a 2 b 3 c e Itemset ab ac ae bc be ce Scan D Min_sup=2 3-candidates Freq 2-itemsets Counting Scan D Itemset bce Itemset Sup ac 2 bc be 3 ce Itemset Sup ab 1 ac 2 ae bc be 3 ce Scan D Freq 3-itemsets Itemset Sup bce 2 DATA MINING VESIT M.VIJAYALAKSHMI
27
Example – Finding frequent itemsets
Dataset T minsup=0.5 TID Items T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5 T400 2, 5 itemset:count 1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 F1: {1}:2, {2}:3, {3}:3, {5}:3 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5} 2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2 C3: {2, 3,5} 3. scan T C3: {2, 3, 5}:2 F3: {2, 3, 5} DATA MINING VESIT M.VIJAYALAKSHMI
28
Details: Ordering Of Items
The items in I are sorted in lexicographic order (which is a total order). The order is used throughout the algorithm in each itemset. {w[1], w[2], …, w[k]} represents a k-itemset w consisting of items w[1], w[2], …, w[k], where w[1] < w[2] < … < w[k] according to the total order. DATA MINING VESIT M.VIJAYALAKSHMI
29
DATA MINING VESIT M.VIJAYALAKSHMI
Apriori-Gen Generate candidates of size i+1 from large itemsets of size i. Approach used: join large itemsets of size i if they agree on i-1 May also prune candidates who have subsets that are not large. DATA MINING VESIT M.VIJAYALAKSHMI
30
Candidate-Gen Function
Function candidate-gen(Fk-1) Ck ; forall f1, f2 Fk-1 with f1 = {i1, … , ik-2, ik-1} and f2 = {i1, … , ik-2, i’k-1} and ik-1 < i’k-1 do c {i1, …, ik-1, i’k-1}; // join f1 and f2 Ck Ck {c}; for each (k-1)-subset s of c do if (s Fk-1) then delete c from Ck; // prune end end return Ck; DATA MINING VESIT M.VIJAYALAKSHMI
31
DATA MINING VESIT M.VIJAYALAKSHMI
An Example F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}} After join C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}} After pruning: C4 = {{1, 2, 3, 4}} because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is removed) DATA MINING VESIT M.VIJAYALAKSHMI
32
DATA MINING VESIT M.VIJAYALAKSHMI
Apriori-Gen Example DATA MINING VESIT M.VIJAYALAKSHMI
33
Apriori-Gen Example (cont’d)
DATA MINING VESIT M.VIJAYALAKSHMI
34
DATA MINING VESIT M.VIJAYALAKSHMI
Apriori Adv/Disadv Advantages: Uses large itemset property. Easily parallelized Easy to implement. Disadvantages: Assumes transaction database is memory resident. Requires up to m database scans. DATA MINING VESIT M.VIJAYALAKSHMI
35
Step 2: Generating Rules From Frequent Itemsets
Frequent itemsets association rules One more step is needed to generate association rules For each frequent itemset X, For each proper nonempty subset A of X, Let B = X - A A B is an association rule if Confidence(A B) ≥ minconf, support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A) DATA MINING VESIT M.VIJAYALAKSHMI
36
Generating Rules: An example
Suppose {2,3,4} is frequent, with sup=50% Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively These generate these association rules: 2,3 4, confidence=100% 2,4 3, confidence=100% 3,4 2, confidence=67% 2 3,4, confidence=67% 3 2,4, confidence=67% 4 2,3, confidence=67% All rules have support > 50% DATA MINING VESIT M.VIJAYALAKSHMI
37
DATA MINING VESIT M.VIJAYALAKSHMI
Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L) DATA MINING VESIT M.VIJAYALAKSHMI
38
DATA MINING VESIT M.VIJAYALAKSHMI
Generating Rules To recap, in order to obtain A B, we need to have support(A B) and support(A) All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more. This step is not as time-consuming as frequent itemsets generation. Han and Kamber 2001 DATA MINING VESIT M.VIJAYALAKSHMI
39
DATA MINING VESIT M.VIJAYALAKSHMI
Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) DATA MINING VESIT M.VIJAYALAKSHMI
40
Rule Generation for Apriori Algorithm
Lattice of rules Pruned Rules Low Confidence Rule DATA MINING VESIT M.VIJAYALAKSHMI
41
APriori - Performance Bottlenecks
The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to collect counts for the candidate itemsets Bottleneck of Apriori: candidate generation Huge candidate sets: 104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates. Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern DATA MINING VESIT M.VIJAYALAKSHMI
42
Mining Frequent Patterns Without Candidate Generation
Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only! DATA MINING VESIT M.VIJAYALAKSHMI
43
Construct FP-tree From A Transaction DB
TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 0.5 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Steps: Scan DB once, find frequent 1-itemset (single item pattern) Order frequent items in frequency descending order Scan DB again, construct FP-tree DATA MINING VESIT M.VIJAYALAKSHMI
44
Benefits of the FP-tree Structure
Completeness: never breaks a long pattern of any transaction preserves complete information for frequent pattern mining Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are more likely to be shared never be larger than the original database (if not count node-links and counts) DATA MINING VESIT M.VIJAYALAKSHMI
45
Major Steps to Mine FP-tree
Construct conditional pattern base for each node in the FP-tree Construct conditional FP-tree from each conditional pattern-base Recursively mine conditional FP-trees and grow frequent patterns obtained so far If the conditional FP-tree contains a single path, simply enumerate all the patterns DATA MINING VESIT M.VIJAYALAKSHMI
46
Step 1: FP-tree to Conditional Pattern Base
Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 DATA MINING VESIT M.VIJAYALAKSHMI
47
Properties of FP-tree for Conditional Pattern Base Construction
Node-link property For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree header Prefix path property To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai. DATA MINING VESIT M.VIJAYALAKSHMI
48
Step 2: Construct Conditional FP-tree
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam DATA MINING VESIT M.VIJAYALAKSHMI
49
Mining Frequent Patterns by Creating Conditional Pattern-Bases
Empty f {(f:3)}|c {(f:3)} c {(f:3, c:3)}|a {(fc:3)} a {(fca:1), (f:1), (c:1)} b {(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m {(c:3)}|p {(fcam:2), (cb:1)} p Conditional FP-tree Conditional pattern-base Item DATA MINING VESIT M.VIJAYALAKSHMI
50
Step 3: Recursively mine the conditional FP-tree
{} f:3 c:3 am-conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree Cond. pattern base of “cam”: (f:3) {} f:3 cam-conditional FP-tree DATA MINING VESIT M.VIJAYALAKSHMI
51
Single FP-tree Path Generation
Suppose an FP-tree T has a single path P The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam f:3 c:3 a:3 m-conditional FP-tree DATA MINING VESIT M.VIJAYALAKSHMI
52
Principles of Frequent Pattern Growth
Pattern growth property Let be a frequent itemset in DB, B be 's conditional pattern base, and be an itemset in B. Then is a frequent itemset in DB iff is frequent in B. “abcdef ” is a frequent pattern, if and only if “abcde ” is a frequent pattern, and “f ” is frequent in the set of transactions containing “abcde ” DATA MINING VESIT M.VIJAYALAKSHMI
53
Why Is Frequent Pattern Growth Fast?
Performance study shows FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection Reasoning No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building DATA MINING VESIT M.VIJAYALAKSHMI
54
DATA MINING VESIT M.VIJAYALAKSHMI
FP-Tree Construction null After reading TID=1: A:1 B:1 After reading TID=2: null B:1 A:1 B:1 C:1 D:1 DATA MINING VESIT M.VIJAYALAKSHMI
55
DATA MINING VESIT M.VIJAYALAKSHMI
FP-Tree Construction Transaction Database null B:3 A:7 B:5 C:3 C:1 D:1 Header table D:1 C:3 E:1 D:1 E:1 D:1 E:1 D:1 Pointers are used to assist frequent itemset generation DATA MINING VESIT M.VIJAYALAKSHMI
56
DATA MINING VESIT M.VIJAYALAKSHMI
FP-growth Build conditional pattern base for E: P = {(A:1,C:1,D:1), (A:1,D:1), (B:1,C:1)} Recursively apply FP-growth on P null B:3 A:7 B:5 C:3 C:1 D:1 C:3 D:1 D:1 E:1 E:1 D:1 E:1 D:1 DATA MINING VESIT M.VIJAYALAKSHMI
57
DATA MINING VESIT M.VIJAYALAKSHMI
FP-growth Conditional tree for E: Conditional Pattern base for E: P = {(A:1,C:1,D:1,E:1), (A:1,D:1,E:1), (B:1,C:1,E:1)} Count for E is 3: {E} is frequent itemset Recursively apply FP-growth on P null B:1 A:2 C:1 C:1 D:1 D:1 E:1 E:1 E:1 DATA MINING VESIT M.VIJAYALAKSHMI
58
DATA MINING VESIT M.VIJAYALAKSHMI
FP-growth Conditional tree for D within conditional tree for E: Conditional pattern base for D within conditional base for E: P = {(A:1,C:1,D:1), (A:1,D:1)} Count for D is 2: {D,E} is frequent itemset Recursively apply FP-growth on P null A:2 C:1 D:1 D:1 DATA MINING VESIT M.VIJAYALAKSHMI
59
DATA MINING VESIT M.VIJAYALAKSHMI
FP-growth Conditional tree for C within D within E: Conditional pattern base for C within D within E: P = {(A:1,C:1)} Count for C is 1: {C,D,E} is NOT frequent itemset null A:1 C:1 DATA MINING VESIT M.VIJAYALAKSHMI
60
DATA MINING VESIT M.VIJAYALAKSHMI
FP-growth Conditional tree for A within D within E: Count for A is 2: {A,D,E} is frequent itemset Next step: Construct conditional tree C within conditional tree E Continue until exploring conditional tree for A (which has only node A) null A:2 DATA MINING VESIT M.VIJAYALAKSHMI
61
Benefits of the FP-tree Structure
Performance study shows FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection Reasoning No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building DATA MINING VESIT M.VIJAYALAKSHMI
62
DATA MINING VESIT M.VIJAYALAKSHMI
Iceberg Queries Icerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold Example: select P.custID, P.itemID, sum(P.qty) from purchase P group by P.custID, P.itemID having sum(P.qty) >= 10 Compute iceberg queries efficiently by Apriori: First compute lower dimensions Then compute higher dimensions only when all the lower ones are above the threshold DATA MINING VESIT M.VIJAYALAKSHMI
63
Multiple-Level Association Rules
Food bread milk skim Dairy Amul 2% white wheat Items often form hierarchy. Items at the lower level are expected to have lower support. Rules regarding itemsets at appropriate levels could be quite useful. Transaction database can be encoded based on dimensions and levels We can explore shared multi-level mining DATA MINING VESIT M.VIJAYALAKSHMI
64
Mining Multi-Level Associations
A top_down, progressive deepening approach: First find high-level strong rules: milk ® bread [20%, 60%]. Then find their lower-level “weaker” rules: 2% milk ® wheat bread [6%, 50%]. Variations at mining multiple-level association rules. Level-crossed association rules: 2% milk ® Wonder wheat bread Association rules with multiple, alternative hierarchies: 2% milk ® Wonder bread DATA MINING VESIT M.VIJAYALAKSHMI
65
Multi-level Association: Uniform Support vs. Reduced Support
Uniform Support: the same minimum support for all levels + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. – Lower level items do not occur as frequently. If support threshold too high miss low level associations too low generate too many high level associations Reduced Support: reduced minimum support at lower levels DATA MINING VESIT M.VIJAYALAKSHMI
66
Multi-level Association: Redundancy Filtering
Some rules may be redundant due to “ancestor” relationships between items. Example milk wheat bread [support = 8%, confidence = 70%] 2% milk wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. DATA MINING VESIT M.VIJAYALAKSHMI
67
Multi-Dimensional Association: Concepts
Single-dimensional rules: buys(X, “milk”) buys(X, “bread”) Multi-dimensional rules: 2 dimensions or predicates Inter-dimension association rules (no repeated predicates) age(X,”19-25”) occupation(X,“student”) buys(X,“coke”) hybrid-dimension association rules (repeated predicates) age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”) Categorical Attributes finite number of possible values, no ordering among values Quantitative Attributes numeric, implicit ordering among values DATA MINING VESIT M.VIJAYALAKSHMI
68
Techniques for Mining MD Associations
Search for frequent k-predicate set: Example: {age, occupation, buys} is a 3-predicate set. Techniques can be categorized by how age are treated. 1. Using static discretization of quantitative attributes Quantitative attributes are statically discretized by using predefined concept hierarchies. 2. Quantitative association rules Quantitative attributes are dynamically discretized into “bins”based on the distribution of the data. 3. Distance-based association rules This is a dynamic discretization process that considers the distance between data points. DATA MINING VESIT M.VIJAYALAKSHMI
69
Interestingness Measurements
Objective measures Two popular measurements: support; and confidence Subjective measures A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it) DATA MINING VESIT M.VIJAYALAKSHMI
70
Criticism to Support and Confidence
Example 1: Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence DATA MINING VESIT M.VIJAYALAKSHMI
71
DATA MINING VESIT M.VIJAYALAKSHMI
(Cont.) Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates We need a measure of dependent or correlated events P(B|A)/P(B) is also called the lift of rule A => B DATA MINING VESIT M.VIJAYALAKSHMI
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.