Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis Basic concepts Basic concepts Scalable mining methods Scalable mining methods Mining a variety of rules and interesting patterns Mining a variety of rules and interesting patterns Constraint-based mining Constraint-based mining Mining sequential and structured patterns Mining sequential and structured patterns Extensions and applications Extensions and applications

SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis  Basic concepts Scalable mining methods Scalable mining methods Mining a variety of rules and interesting patterns Mining a variety of rules and interesting patterns Constraint-based mining Constraint-based mining Mining sequential and structured patterns Mining sequential and structured patterns Extensions and applications Extensions and applications

SAD Tagus 2004/05 H. Galhardas What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of association rule mining First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of association rule mining

SAD Tagus 2004/05 H. Galhardas Motivation Finding inherent regularities in data Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Exs: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, DNA sequence analysis, etc.

SAD Tagus 2004/05 H. Galhardas Basic Concepts (1) Item: Boolean variable representing its presence or absence Basket: Boolean vector of variables Analyzed to discover patterns of items that are frequently associated (or buyed) together Association rules: association patterns X => Y

SAD Tagus 2004/05 H. Galhardas Basic Concepts (2) Itemset X = {x 1, …, x k } Itemset X = {x 1, …, x k } Find all the rules X  Y with minimum support and confidence Find all the rules X  Y with minimum support and confidence Support, s, probability that a transaction contains X  Y Support = # tuples (X  Y) / # total tuples = P(X  Y) Confidence, c, conditional probability that a transaction having X also contains Y Confidence = # tuples (X  Y) / # tuples X = P(Y|X) Customer buys diaper Customer buys both Customer buys beer Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

SAD Tagus 2004/05 H. Galhardas Basic Concepts (3) Interesting (or strong) rules – satisfy minimum support threshold and minimum confidence threshold Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Let supmin = 50%, confmin = 50% Frequent Pattern: {A:3, B:3, D:4, E:3, AD:3} Association rules: A  D (60%, 100%), 60% of all analyzed transactions show that A and D are bought together; 100% of customers that bought A also bought D D  A (60%, 75%)

SAD Tagus 2004/05 H. Galhardas Basic Concepts (4) Itemset (or pattern): set of items (k-itemset has k items) Occurrence frequency of itemset: # of transactions containing the itemset Also known as: frequency, support count or count of the itemset Frequent itemset: Itemset satisfies minimum support, i.e., frequency >= min-sup * # total transactions Confidence (A => B) = P(A|B) = frequency (A  B)/frequency (A) Pb. Mining assoc. Rules = Pb. Mining frequent itemsets

SAD Tagus 2004/05 H. Galhardas Mining frequent itemsets A long pattern contains a combinatorial number of sub-patterns Solution: Mine closed patterns and max-patterns An itemset X is closed frequent if X is frequent and there exists no super-itemset Y כ X, with the same support as X An itemset X is closed frequent if X is frequent and there exists no super-itemset Y כ X, with the same support as X An itemset X is a max-itemset if X is frequent and there exists no frequent super-itemset Y כ X An itemset X is a max-itemset if X is frequent and there exists no frequent super-itemset Y כ X Closed itemset is a lossless compression of freq. patterns: reducing the # of patterns and rules Closed itemset is a lossless compression of freq. patterns: reducing the # of patterns and rules

SAD Tagus 2004/05 H. Galhardas Example DB = {, } Min_sup = 1. What is the set of closed itemset? What is the set of closed itemset? : 1 : 2 What is the set of max-pattern? What is the set of max-pattern? : 1 What is the set of all patterns? What is the set of all patterns? !!

SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis Basic concepts Basic concepts  Scalable mining methods Mining a variety of rules and interesting patterns Mining a variety of rules and interesting patterns Constraint-based mining Constraint-based mining Mining sequential and structured patterns Mining sequential and structured patterns Extensions and applications Extensions and applications

SAD Tagus 2004/05 H. Galhardas Scalable Methods for Mining Frequent Patterns Downward closure (or Apriori) property of frequent patterns: any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth — Han, Pei & Yin @SIGMOD’00) Vertical data format approach (Charm — Zaki & Hsiao @SDM ’ 02)

SAD Tagus 2004/05 H. Galhardas Apriori: A Candidate Generation- and-Test Approach Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!

SAD Tagus 2004/05 H. Galhardas How to Generate Candidates? Suppose the items in L k-1 are listed in an order Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 Step 2: pruning Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k

SAD Tagus 2004/05 H. Galhardas Example Database TDB 1 st scan C1C1 L1L1 L2L2 C2C2 C2C2 2 nd scan C3C3 L3L3 3 rd scanTidItems10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemsetsup{A}2 {B}3 {C}3 {D}1 {E}3 Itemsetsup{A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemsetsup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemsetsup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemsetsup 2 Sup min = 2

SAD Tagus 2004/05 H. Galhardas The Apriori Algorithm Pseudo-code: Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;

SAD Tagus 2004/05 H. Galhardas Important Details of Apriori How to generate candidates? How to generate candidates? Step 1: self-joining L k Step 2: pruning Example of Candidate-generation Example of Candidate-generation L 3 ={abc, abd, acd, ace, bcd} Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L 3 C 4 ={abcd}

SAD Tagus 2004/05 H. Galhardas Another example (1)

SAD Tagus 2004/05 H. Galhardas Another example (2)

SAD Tagus 2004/05 H. Galhardas Generation of ass. rules from frequent itemsets Confidence(A=>B) = P(B|A) = = support-count (A  B) / support-count (A) support-count (A  B): nb of transactions containing itemsets A  B support-count (A): nb of transactions containing itemset A Association rules can be generated as: For each frequent itemset l, generate all nonempty subsets of l For each frequent itemset l, generate all nonempty subsets of l For every nonempty subset s of l, output the rule: For every nonempty subset s of l, output the rule: s=>(l-s) if support-count(l)/support-count(s) >= min-conf

SAD Tagus 2004/05 H. Galhardas How to Count Supports of Candidates? Why counting supports of candidates a problem? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method based on hashing: Method based on hashing: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction

SAD Tagus 2004/05 H. Galhardas Challenges of Frequent Pattern Mining Challenges Challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Improving Apriori: general ideas Partitioning Sampling others

SAD Tagus 2004/05 H. Galhardas Partition: Scan Database Only Twice Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns

SAD Tagus 2004/05 H. Galhardas Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample Scan database once to verify frequent itemsets found in sample Scan database again to find missed frequent patterns Scan database again to find missed frequent patterns Use lower support threshold to find frequent itemsets in sample Trade off accuracy vs efficiency Trade off accuracy vs efficiency

SAD Tagus 2004/05 H. Galhardas Bottleneck of Frequent- pattern Mining Multiple database scans are costly Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100 # of scans: 100 # of Candidates: ( 100 1 ) + ( 100 2 ) + … + ( 1 1 0 0 0 0 ) = 2 100 -1 = 1.27*10 30 ! Bottleneck: candidate-generation-and-test Bottleneck: candidate-generation-and-test Can we avoid candidate generation? Yes! Can we avoid candidate generation? Yes!

SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules: Plane Graph

SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules: Rule Graph

SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules (SGI/MineSet 3.0)

SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis Basic concepts Basic concepts Scalable mining methods Scalable mining methods  Mining a variety of rules and interesting patterns Constraint-based mining Constraint-based mining Mining sequential and structured patterns Mining sequential and structured patterns Extensions and applications Extensions and applications

SAD Tagus 2004/05 H. Galhardas Mining Various Kinds of Association Rules Mining multi-level association Mining multi-level association Miming multi-dimensional association Miming multi-dimensional association Mining quantitative association Mining quantitative association Mining interesting correlation patterns Mining interesting correlation patterns

SAD Tagus 2004/05 H. GalhardasExample

SAD Tagus 2004/05 H. Galhardas Mining Multiple-Level Association Rules Items often form hierarchy Items often form hierarchy Flexible support settings Flexible support settings Items at the lower level are expected to have lower support uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support

SAD Tagus 2004/05 H. Galhardas Multi-level Association: Redundancy Filtering Some rules may be redundant due to “ancestor” relationships between items. Example: Some rules may be redundant due to “ancestor” relationships between items. Example: R1: milk  wheat bread [support = 8%, confidence = 70%] R2: 2% milk  wheat bread [support = 2%, confidence = 72%] R1 is an ancestor of R2: R1 can be obtained from R2 replacing items by its ancestors in the concept hierarchy R1 is an ancestor of R2: R1 can be obtained from R2 replacing items by its ancestors in the concept hierarchy R2 is redundant if its support is close to the “expected” value, based on the rule’s ancestor. R2 is redundant if its support is close to the “expected” value, based on the rule’s ancestor.

SAD Tagus 2004/05 H. Galhardas Mining Multi-Dimensional Association Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules:  2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”) Hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)

SAD Tagus 2004/05 H. Galhardas Recall: categorical vs quantitative attributes Categorical (or nominal) attributes: finite number of possible values, no ordering among values Quantitative Attributes: numeric, implicit ordering among values

SAD Tagus 2004/05 H. Galhardas Mining Quantitative Associations Techniques can be categorized by how numerical attributes, such as age or salary are treated: Static discretization based on predefined concept hierarchies (data cube methods) Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96) Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97): one dimensional clustering then association Deviation: (such as Aumann and Lindell@KDD99) Sex = female => Wage: mean=$7/hr (overall mean = $9)

SAD Tagus 2004/05 H. Galhardas Static Discretization of Quantitative Attributes (1) Discretized prior to mining using concept hierarchy. Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. Numeric values are replaced by ranges. Frequent itemset mining algo. must be modified so that frequent predicate sets are searched Frequent itemset mining algo. must be modified so that frequent predicate sets are searched Instead of searching only one attribute (ex: buys), search through relevant attributes (ex: age, occupation, buys) and treat each attribute-value pair as an itemset.

SAD Tagus 2004/05 H. Galhardas Static Discretization of Quantitative Attributes (2) Data cube is well suited for mining and may already exist Data cube is well suited for mining and may already exist The cells of an n-dimensional cuboid correspond to the predicate sets and can be used to store the support counts The cells of an n-dimensional cuboid correspond to the predicate sets and can be used to store the support counts Mining from data cubes Mining from data cubes can be much faster. can be much faster. (income)(age) () (buys) (age, income)(age,buys)(income,buys) (age,income,buys)

SAD Tagus 2004/05 H. Galhardas Mining Various Kinds of Association Rules Mining multi-level association Mining multi-level association Miming multi-dimensional association Miming multi-dimensional association Mining quantitative association Mining quantitative association  Mining interesting correlation patterns

SAD Tagus 2004/05 H. Galhardas Mining interesting correlation patterns Strong association rules (w/ high support and confidence) can be uninteresting. Example: Strong association rules (w/ high support and confidence) can be uninteresting. Example: play basketball  eat cereal [40%, 66.7%] is misleading The overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Basketball Not basketball Sum (row) Cereal200017503750 Not cereal 10002501250 Sum(col.)300020005000

SAD Tagus 2004/05 H. Galhardas Are All the Rules Found Interesting? The confidence of a rule only estimates the conditional probability of item B given item A, but doesn’t measure the real correlation between A and B The confidence of a rule only estimates the conditional probability of item B given item A, but doesn’t measure the real correlation between A and B Another example: Another example: Buy walnuts  buy milk [1%, 80%] is misleading if 85% of customers buy milk Support and confidence are not good to represent correlations Support and confidence are not good to represent correlations Many other interestingness measures (Tan, Kumar, Sritastava @KDD’02) Many other interestingness measures (Tan, Kumar, Sritastava @KDD’02)

SAD Tagus 2004/05 H. Galhardas Are All the Rules Found Interesting? Correlation rule: A => B [supp, conf., corr. measure] Correlation rule: A => B [supp, conf., corr. measure] Measure of dependent/correlated events: correlation coefficient or lift Measure of dependent/correlated events: correlation coefficient or lift The occurrence of A is independent of the occurrence of B if The occurrence of A is independent of the occurrence of B if If lift > 1, A and B positively correlated, if lift 1, A and B positively correlated, if lift < 1 then negatively correlated

SAD Tagus 2004/05 H. Galhardas Lift - example Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.)m~m DB m, c ~m, c m~c~m~cliftA1100010010010,0009.26 A210010001000100,0008.44 A3100010010000100,0009.18 A410001000100010001 Coffee => Milk

SAD Tagus 2004/05 H. Galhardas Mining Highly Correlated Patterns lift and  2 are not good measures for correlations in transactional DBs lift and  2 are not good measures for correlations in transactional DBs all-conf or coherence could be good measures (Omiecinski @TKDE’03) all-conf or coherence could be good measures (Omiecinski @TKDE’03) Both all-conf and coherence have the downward closure property Both all-conf and coherence have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub) Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)

SAD Tagus 2004/05 H. Galhardas Bibliografia (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Cap. 6 – livro 2001, Cap. 4 – draft) (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Cap. 6 – livro 2001, Cap. 4 – draft)

Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)

Similar presentations

Presentation on theme: "Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)

Similar presentations

Presentation on theme: "Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)"— Presentation transcript:

Similar presentations

About project

Feedback