Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)

Slides:



Advertisements
Similar presentations
Association rule mining
Advertisements

Association Rules Mining
Data Mining Techniques Association Rule
Effect of Support Distribution l Many real data sets have skewed support distribution Support distribution of a retail data set.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules Mining Part III. Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
Association Analysis: Basic Concepts and Algorithms.
1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Mining Association Rules in Large Databases
Mining Association Rules
Association Analysis: Basic Concepts and Algorithms
Mining Frequent Patterns I: Association Rule Discovery Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Mining Association Rules
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
Information Systems Data Analysis – Association Mining Prof. Les Sztandera.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
November 3, 2015Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
UNIT-5 Mining Association Rules in Large Databases LectureTopic ********************************************** Lecture-27Association rule mining Lecture-28Mining.
UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1.
1 Data Mining: Mining Frequent Patterns, Association and Correlations.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Chapter 6: Mining Frequent Patterns, Association and Correlations
Dept. of Information Management, Tamkang University
Mining Frequent Patterns, Association, and Correlations (cont.) Pertemuan 06 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
What is Frequent Pattern Analysis?
Data Mining  Association Rule  Classification  Clustering.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Spring 2009.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
Association Rule Mining CENG 514 Data Mining
Association Rule Mining CENG 514 Data Mining July 2,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Information Management course
Association rule mining
Mining Association Rules
©Jiawei Han and Micheline Kamber
Association Rule Mining
Find Patterns Having P From P-conditional Database
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
©Jiawei Han and Micheline Kamber
Department of Computer Science National Tsing Hua University
Association Rule Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Presentation transcript:

Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis Basic concepts Basic concepts Scalable mining methods Scalable mining methods Mining a variety of rules and interesting patterns Mining a variety of rules and interesting patterns Constraint-based mining Constraint-based mining Mining sequential and structured patterns Mining sequential and structured patterns Extensions and applications Extensions and applications

SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis  Basic concepts Scalable mining methods Scalable mining methods Mining a variety of rules and interesting patterns Mining a variety of rules and interesting patterns Constraint-based mining Constraint-based mining Mining sequential and structured patterns Mining sequential and structured patterns Extensions and applications Extensions and applications

SAD Tagus 2004/05 H. Galhardas What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of association rule mining First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of association rule mining

SAD Tagus 2004/05 H. Galhardas Motivation Finding inherent regularities in data Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Exs: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, DNA sequence analysis, etc.

SAD Tagus 2004/05 H. Galhardas Basic Concepts (1) Item: Boolean variable representing its presence or absence Basket: Boolean vector of variables Analyzed to discover patterns of items that are frequently associated (or buyed) together Association rules: association patterns X => Y

SAD Tagus 2004/05 H. Galhardas Basic Concepts (2) Itemset X = {x 1, …, x k } Itemset X = {x 1, …, x k } Find all the rules X  Y with minimum support and confidence Find all the rules X  Y with minimum support and confidence Support, s, probability that a transaction contains X  Y Support = # tuples (X  Y) / # total tuples = P(X  Y) Confidence, c, conditional probability that a transaction having X also contains Y Confidence = # tuples (X  Y) / # tuples X = P(Y|X) Customer buys diaper Customer buys both Customer buys beer Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F

SAD Tagus 2004/05 H. Galhardas Basic Concepts (3) Interesting (or strong) rules – satisfy minimum support threshold and minimum confidence threshold Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Let supmin = 50%, confmin = 50% Frequent Pattern: {A:3, B:3, D:4, E:3, AD:3} Association rules: A  D (60%, 100%), 60% of all analyzed transactions show that A and D are bought together; 100% of customers that bought A also bought D D  A (60%, 75%)

SAD Tagus 2004/05 H. Galhardas Basic Concepts (4) Itemset (or pattern): set of items (k-itemset has k items) Occurrence frequency of itemset: # of transactions containing the itemset Also known as: frequency, support count or count of the itemset Frequent itemset: Itemset satisfies minimum support, i.e., frequency >= min-sup * # total transactions Confidence (A => B) = P(A|B) = frequency (A  B)/frequency (A) Pb. Mining assoc. Rules = Pb. Mining frequent itemsets

SAD Tagus 2004/05 H. Galhardas Mining frequent itemsets A long pattern contains a combinatorial number of sub-patterns Solution: Mine closed patterns and max-patterns An itemset X is closed frequent if X is frequent and there exists no super-itemset Y כ X, with the same support as X An itemset X is closed frequent if X is frequent and there exists no super-itemset Y כ X, with the same support as X An itemset X is a max-itemset if X is frequent and there exists no frequent super-itemset Y כ X An itemset X is a max-itemset if X is frequent and there exists no frequent super-itemset Y כ X Closed itemset is a lossless compression of freq. patterns: reducing the # of patterns and rules Closed itemset is a lossless compression of freq. patterns: reducing the # of patterns and rules

SAD Tagus 2004/05 H. Galhardas Example DB = {, } Min_sup = 1. What is the set of closed itemset? What is the set of closed itemset? : 1 : 2 What is the set of max-pattern? What is the set of max-pattern? : 1 What is the set of all patterns? What is the set of all patterns? !!

SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis Basic concepts Basic concepts  Scalable mining methods Mining a variety of rules and interesting patterns Mining a variety of rules and interesting patterns Constraint-based mining Constraint-based mining Mining sequential and structured patterns Mining sequential and structured patterns Extensions and applications Extensions and applications

SAD Tagus 2004/05 H. Galhardas Scalable Methods for Mining Frequent Patterns Downward closure (or Apriori) property of frequent patterns: any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches Apriori (Agrawal & Freq. pattern growth (FPgrowth — Han, Pei & Vertical data format approach (Charm — Zaki & ’ 02)

SAD Tagus 2004/05 H. Galhardas Apriori: A Candidate Generation- and-Test Approach Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!

SAD Tagus 2004/05 H. Galhardas How to Generate Candidates? Suppose the items in L k-1 are listed in an order Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 Step 2: pruning Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k

SAD Tagus 2004/05 H. Galhardas Example Database TDB 1 st scan C1C1 L1L1 L2L2 C2C2 C2C2 2 nd scan C3C3 L3L3 3 rd scanTidItems10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemsetsup{A}2 {B}3 {C}3 {D}1 {E}3 Itemsetsup{A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemsetsup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemsetsup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemsetsup 2 Sup min = 2

SAD Tagus 2004/05 H. Galhardas The Apriori Algorithm Pseudo-code: Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;

SAD Tagus 2004/05 H. Galhardas Important Details of Apriori How to generate candidates? How to generate candidates? Step 1: self-joining L k Step 2: pruning Example of Candidate-generation Example of Candidate-generation L 3 ={abc, abd, acd, ace, bcd} Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L 3 C 4 ={abcd}

SAD Tagus 2004/05 H. Galhardas Another example (1)

SAD Tagus 2004/05 H. Galhardas Another example (2)

SAD Tagus 2004/05 H. Galhardas Generation of ass. rules from frequent itemsets Confidence(A=>B) = P(B|A) = = support-count (A  B) / support-count (A) support-count (A  B): nb of transactions containing itemsets A  B support-count (A): nb of transactions containing itemset A Association rules can be generated as: For each frequent itemset l, generate all nonempty subsets of l For each frequent itemset l, generate all nonempty subsets of l For every nonempty subset s of l, output the rule: For every nonempty subset s of l, output the rule: s=>(l-s) if support-count(l)/support-count(s) >= min-conf

SAD Tagus 2004/05 H. Galhardas How to Count Supports of Candidates? Why counting supports of candidates a problem? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method based on hashing: Method based on hashing: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction

SAD Tagus 2004/05 H. Galhardas Challenges of Frequent Pattern Mining Challenges Challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Improving Apriori: general ideas Partitioning Sampling others

SAD Tagus 2004/05 H. Galhardas Partition: Scan Database Only Twice Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns

SAD Tagus 2004/05 H. Galhardas Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample Scan database once to verify frequent itemsets found in sample Scan database again to find missed frequent patterns Scan database again to find missed frequent patterns Use lower support threshold to find frequent itemsets in sample Trade off accuracy vs efficiency Trade off accuracy vs efficiency

SAD Tagus 2004/05 H. Galhardas Bottleneck of Frequent- pattern Mining Multiple database scans are costly Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100 # of scans: 100 # of Candidates: ( ) + ( ) + … + ( ) = = 1.27*10 30 ! Bottleneck: candidate-generation-and-test Bottleneck: candidate-generation-and-test Can we avoid candidate generation? Yes! Can we avoid candidate generation? Yes!

SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules: Plane Graph

SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules: Rule Graph

SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules (SGI/MineSet 3.0)

SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis Basic concepts Basic concepts Scalable mining methods Scalable mining methods  Mining a variety of rules and interesting patterns Constraint-based mining Constraint-based mining Mining sequential and structured patterns Mining sequential and structured patterns Extensions and applications Extensions and applications

SAD Tagus 2004/05 H. Galhardas Mining Various Kinds of Association Rules Mining multi-level association Mining multi-level association Miming multi-dimensional association Miming multi-dimensional association Mining quantitative association Mining quantitative association Mining interesting correlation patterns Mining interesting correlation patterns

SAD Tagus 2004/05 H. GalhardasExample

SAD Tagus 2004/05 H. Galhardas Mining Multiple-Level Association Rules Items often form hierarchy Items often form hierarchy Flexible support settings Flexible support settings Items at the lower level are expected to have lower support uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support

SAD Tagus 2004/05 H. Galhardas Multi-level Association: Redundancy Filtering Some rules may be redundant due to “ancestor” relationships between items. Example: Some rules may be redundant due to “ancestor” relationships between items. Example: R1: milk  wheat bread [support = 8%, confidence = 70%] R2: 2% milk  wheat bread [support = 2%, confidence = 72%] R1 is an ancestor of R2: R1 can be obtained from R2 replacing items by its ancestors in the concept hierarchy R1 is an ancestor of R2: R1 can be obtained from R2 replacing items by its ancestors in the concept hierarchy R2 is redundant if its support is close to the “expected” value, based on the rule’s ancestor. R2 is redundant if its support is close to the “expected” value, based on the rule’s ancestor.

SAD Tagus 2004/05 H. Galhardas Mining Multi-Dimensional Association Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules:  2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”) Hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)

SAD Tagus 2004/05 H. Galhardas Recall: categorical vs quantitative attributes Categorical (or nominal) attributes: finite number of possible values, no ordering among values Quantitative Attributes: numeric, implicit ordering among values

SAD Tagus 2004/05 H. Galhardas Mining Quantitative Associations Techniques can be categorized by how numerical attributes, such as age or salary are treated: Static discretization based on predefined concept hierarchies (data cube methods) Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Clustering: Distance-based association (e.g., Yang & one dimensional clustering then association Deviation: (such as Aumann and Sex = female => Wage: mean=$7/hr (overall mean = $9)

SAD Tagus 2004/05 H. Galhardas Static Discretization of Quantitative Attributes (1) Discretized prior to mining using concept hierarchy. Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. Numeric values are replaced by ranges. Frequent itemset mining algo. must be modified so that frequent predicate sets are searched Frequent itemset mining algo. must be modified so that frequent predicate sets are searched Instead of searching only one attribute (ex: buys), search through relevant attributes (ex: age, occupation, buys) and treat each attribute-value pair as an itemset.

SAD Tagus 2004/05 H. Galhardas Static Discretization of Quantitative Attributes (2) Data cube is well suited for mining and may already exist Data cube is well suited for mining and may already exist The cells of an n-dimensional cuboid correspond to the predicate sets and can be used to store the support counts The cells of an n-dimensional cuboid correspond to the predicate sets and can be used to store the support counts Mining from data cubes Mining from data cubes can be much faster. can be much faster. (income)(age) () (buys) (age, income)(age,buys)(income,buys) (age,income,buys)

SAD Tagus 2004/05 H. Galhardas Mining Various Kinds of Association Rules Mining multi-level association Mining multi-level association Miming multi-dimensional association Miming multi-dimensional association Mining quantitative association Mining quantitative association  Mining interesting correlation patterns

SAD Tagus 2004/05 H. Galhardas Mining interesting correlation patterns Strong association rules (w/ high support and confidence) can be uninteresting. Example: Strong association rules (w/ high support and confidence) can be uninteresting. Example: play basketball  eat cereal [40%, 66.7%] is misleading The overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Basketball Not basketball Sum (row) Cereal Not cereal Sum(col.)

SAD Tagus 2004/05 H. Galhardas Are All the Rules Found Interesting? The confidence of a rule only estimates the conditional probability of item B given item A, but doesn’t measure the real correlation between A and B The confidence of a rule only estimates the conditional probability of item B given item A, but doesn’t measure the real correlation between A and B Another example: Another example: Buy walnuts  buy milk [1%, 80%] is misleading if 85% of customers buy milk Support and confidence are not good to represent correlations Support and confidence are not good to represent correlations Many other interestingness measures (Tan, Kumar, Many other interestingness measures (Tan, Kumar,

SAD Tagus 2004/05 H. Galhardas Are All the Rules Found Interesting? Correlation rule: A => B [supp, conf., corr. measure] Correlation rule: A => B [supp, conf., corr. measure] Measure of dependent/correlated events: correlation coefficient or lift Measure of dependent/correlated events: correlation coefficient or lift The occurrence of A is independent of the occurrence of B if The occurrence of A is independent of the occurrence of B if If lift > 1, A and B positively correlated, if lift 1, A and B positively correlated, if lift < 1 then negatively correlated

SAD Tagus 2004/05 H. Galhardas Lift - example Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.)m~m DB m, c ~m, c m~c~m~cliftA , A , A , A Coffee => Milk

SAD Tagus 2004/05 H. Galhardas Mining Highly Correlated Patterns lift and  2 are not good measures for correlations in transactional DBs lift and  2 are not good measures for correlations in transactional DBs all-conf or coherence could be good measures all-conf or coherence could be good measures Both all-conf and coherence have the downward closure property Both all-conf and coherence have the downward closure property Efficient algorithms can be derived for mining (Lee et Efficient algorithms can be derived for mining (Lee et

SAD Tagus 2004/05 H. Galhardas Bibliografia (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Cap. 6 – livro 2001, Cap. 4 – draft) (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Cap. 6 – livro 2001, Cap. 4 – draft)