Download presentation
Presentation is loading. Please wait.
1
© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar
2
© Vipin Kumar CSci 8980 Fall 2002 2 Rule Generation l Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement –If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD,B ACD,C ABD, D ABC AB CD,AC BD, AD BC, BC AD, BD AC, CD AB, l If |L| = k, then there are 2 k – 2 candidate association rules (ignoring L and L)
3
© Vipin Kumar CSci 8980 Fall 2002 3 Rule Generation l How to efficiently generate rules from frequent itemsets? –In general, confidence does not have an anti- monotone property –But confidence of rules generated from the same itemset has an anti-monotone property –L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) Confidence is non-increasing as number of items in rule consequent increases
4
© Vipin Kumar CSci 8980 Fall 2002 4 Rule Generation for Apriori Algorithm Lattice of rules Lattice corresponds to partial order of items in the rule consequent
5
© Vipin Kumar CSci 8980 Fall 2002 5 Rule Generation for Apriori Algorithm l Candidate rule is generated by merging two rules that share the same prefix in the rule consequent l join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC l Prune rule D=>ABC if its subset AD=>BC does not have high confidence
6
© Vipin Kumar CSci 8980 Fall 2002 6 Other Frequent Itemset Algorithms l Traversal of Itemset Lattice –Apriori uses breadth-first (level-wise) traversal l Representation of Database –Apriori uses horizontal data layout l Generate-and-count paradigm
7
© Vipin Kumar CSci 8980 Fall 2002 7 FP-growth (Pei et al 2000) l Use a compressed representation of the database using an FP-tree l Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets
8
© Vipin Kumar CSci 8980 Fall 2002 8 FP-tree construction null A:1 B:1 null A:1 B:1 C:1 D:1 After reading TID=1: After reading TID=2:
9
© Vipin Kumar CSci 8980 Fall 2002 9 FP-Tree Construction null A:7 B:5 B:3 C:3 D:1 C:1 D:1 C:3 D:1 E:1 Pointers are used to assist frequent itemset generation D:1 E:1 Transaction Database Header table
10
© Vipin Kumar CSci 8980 Fall 2002 10 FP-growth null A:7 B:5 B:1 C:1 D:1 C:1 D:1 C:3 D:1 Conditional Pattern base for D: P = {(A:1,B:1,C:1), (A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)} Recursively apply FP- growth on P Frequent Itemsets found (with sup > 1): AD, BD, CD, ACD, BCD D:1
11
© Vipin Kumar CSci 8980 Fall 2002 11 Tree Projection (Agrawal et al 1999) Set enumeration tree: Possible Extension: E(A) = {B,C,D,E} Possible Extension: E(ABC) = {D,E}
12
© Vipin Kumar CSci 8980 Fall 2002 12 Tree Projection l Items are listed in lexicographic order l Each node P stores the following information: –Itemset for node P –List of possible lexicographic extensions of P: E(P) –Pointer to projected database of its ancestor node –Bitvector containing information about which transactions in the projected database containing the itemset
13
© Vipin Kumar CSci 8980 Fall 2002 13 Projected Database Original Database: Projected Database for node A: For each transaction T, projected transaction at node A is T E(A)
14
© Vipin Kumar CSci 8980 Fall 2002 14 ECLAT (Zaki 2000) l For each item, store a list of transaction ids (tids) TID-list
15
© Vipin Kumar CSci 8980 Fall 2002 15 ECLAT l Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. l 3 traversal approaches: –top-down, bottom-up and hybrid l Advantage: very fast support counting l Disadvantage: intermediate tid-lists may become too large for memory
16
© Vipin Kumar CSci 8980 Fall 2002 16 Interestingness Measures l Association rule algorithms tend to produce too many rules –many of them are uninteresting or redundant –Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence l Interestingness measures can be used to prune/rank the derived patterns l In the original formulation of association rules, support & confidence are the only measures used
17
© Vipin Kumar CSci 8980 Fall 2002 17 Application of Interestingness Measure Interestingness Measures
18
© Vipin Kumar CSci 8980 Fall 2002 18 Computing Interestingness Measure l Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X Y f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Can apply various Measures u support, confidence, lift, Gini, J-measure, etc.
19
© Vipin Kumar CSci 8980 Fall 2002 19 Drawback of Confidence Coffee Tea15520 Tea75580 9010100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375
20
© Vipin Kumar CSci 8980 Fall 2002 20 Other Measures
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.