1 Association Rule Mining (II) Instructor: Qiang Yang Thanks: J.Han and J. Pei.

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

Association Rules Mining
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
CSE 634 Data Mining Techniques
Graph Mining Laks V.S. Lakshmanan
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña FP grow algorithm Correlation analysis.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Constrained frequent itemset mining.
Data Mining Association Analysis: Basic Concepts and Algorithms
CPS : Information Management and Mining
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules Mining Part III. Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support.
Association Rule Mining Zhenjiang Lin Group Presentation April 10, 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
1 Mining Association Rules in Large Databases Association rule mining Algorithms for scalable mining of (single-dimensional Boolean) association rules.
Association Analysis: Basic Concepts and Algorithms.
Mining Association Rules in Large Databases
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
FPtree/FPGrowth. FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Then use a recursive divide-and-conquer.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Constrained frequent itemset mining.
Association Analysis (3). FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed,
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
SEG Tutorial 2 – Frequent Pattern Mining.
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Constraint-based (Query-Directed) Mining Finding all the patterns in a database autonomously? — unrealistic! The patterns could be too many but not focused!
Ch5 Mining Frequent Patterns, Associations, and Correlations
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)
Mining Frequent Patterns without Candidate Generation.
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
November 3, 2015Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
UNIT-5 Mining Association Rules in Large Databases LectureTopic ********************************************** Lecture-27Association rule mining Lecture-28Mining.
1 Mining Association Rules with Constraints Wei Ning Joon Wong COSC 6412 Presentation.
Chapter 6: Mining Frequent Patterns, Association and Correlations
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
Mining Frequent Patterns, Association, and Correlations (cont.) Pertemuan 06 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
Association Analysis (3)
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Association Rule Mining
Data Mining: Concepts and Techniques
Mining Association Rules
Mining Association Rules in Large Databases
Association Rule Mining
Find Patterns Having P From P-conditional Database
732A02 Data Mining - Clustering and Association Analysis
Mining Frequent Patterns without Candidate Generation
Frequent-Pattern Tree
©Jiawei Han and Micheline Kamber
FP-Growth Wenlong Zhang.
Association Rule Mining
Presentation transcript:

1 Association Rule Mining (II) Instructor: Qiang Yang Thanks: J.Han and J. Pei

Frequent-pattern mining methods2 Bottleneck of Frequent-pattern Mining Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100 # of scans: 100 # of Candidates: ( ) + ( ) + … + ( ) = = 1.27*10 30 ! Bottleneck: candidate-generation-and-test Can we avoid candidate generation?

Frequent-pattern mining methods3 FP-growth: Frequent-pattern Mining Without Candidate Generation Heuristic: let P be a frequent itemset, S be the set of transactions contain P, and x be an item. If x is a frequent item in S, {x}  P must be a frequent itemset No candidate generation! A compact data structure, FP-tree, to store information for frequent pattern mining Recursive mining algorithm for mining complete set of frequent patterns

Frequent-pattern mining methods4 Example Items Bought f,a,c,d,g,i,m,p a,b,c,f,l,m,o b,f,h,j,o b,c,k,s,p a,f,c,e,l,p,m,n Min Support = 3

Frequent-pattern mining methods5 Scan the database List of frequent items, sorted: (item:support) The root of the tree is created and labeled with “{}” Scan the database Scanning the first transaction leads to the first branch of the tree: Order according to frequency

Frequent-pattern mining methods6 Scanning TID=100 Transaction Database TIDItems 100f,a,c,d,g,i,m,p {} f:1 c:1 a:1 m:1 p:1 Header Table Node Itemcounthead f1 c1 a1 m1 p1 root

Frequent-pattern mining methods7 Scanning TID=200 Frequent Single Items: F1= TID=200 Possible frequent items: Intersect with F1: f,c,a,b,m Along the first branch of, intersect: Generate two children, Items Bought f,a,c,d,g,i,m,p a,b,c,f,l,m,o b,f,h,j,o b,c,k,s,p a,f,c,e,l,p,m,n

Frequent-pattern mining methods8 Scanning TID=200 Transaction Database TIDItems 200f,c,a,b,m {} f:2 c:2 a:2 m:1 p:1 Header Table Node Itemcounthead f1 c1 a1 b1 m2 p1 root b:1 m:1

Frequent-pattern mining methods9 The final FP-tree Transaction Database TIDItems 100f,a,c,d,g,i,m,p 200a,b,c,f,l,m,o 300b,f,h,j,o 400b,c,k,s,p 500a,f,c,e,l,p,m,n Min support = 3 Frequent 1-items in frequency descending order: f,c,a,b,m,p {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Node Itemcounthead f1 c2 a1 b3 m2 p2

Frequent-pattern mining methods10 FP-Tree Construction Scans the database only twice Subsequent mining: based on the FP-tree

Frequent-pattern mining methods11 How to Mine an FP-tree? Step 1: form conditional pattern base Step 2: construct conditional FP-tree Step 3: recursively mine conditional FP- trees

Frequent-pattern mining methods12 Conditional Pattern Base Let {I} be a frequent item A sub database which consists of the set of prefix paths in the FP-tree With item {I} as a co-occurring suffix pattern Example: {m} is a frequent item {m}’s conditional pattern base: : support =2 : support = 1 Mine recursively on such databases {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1

Frequent-pattern mining methods13 Conditional Pattern Tree Let {I} be a suffix item, {DB|I} be the conditional pattern base The frequent pattern tree Tree I is known as the conditional pattern tree Example: {m} is a frequent item {m}’s conditional pattern base: : support =2 : support = 1 {m}’s conditional pattern tree {} f:4 c:3 a:3 m:2

Frequent-pattern mining methods14 Composition of patterns  and  Let  be a frequent item in DB, B be  ’s conditional pattern base, and  be an itemset in B. Then  +  is frequent in DB if and only if  is frequent in B. Example: Starting with  ={p} {p}’s conditional pattern base (from the tree) B= (f,c,a,m): 2 (c,b): 1 Let  be {c}. Then  ={p,c}, with support = 3.

Frequent-pattern mining methods15 Single path tree Let P be a single path FP tree Let {I 1, I 2, …I k } be an itemset in the tree Let I j have the lowest support Then the support({I 1, I 2, …I k })=support(I j ) Example: {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1

Frequent-pattern mining methods16 FP_growth Algorithm Fig 6.10 Recursive Algorithm Input: A transaction database, min_supp Output: The complete set of frequent patterns 1. FP-Tree construction 2. Mining FP-Tree by calling FP_growth(FP_tree, null) Key Idea: consider single path FP-tree and multi-path FP-tree separately Continue to split until get single-path FP-tree

Frequent-pattern mining methods17 FP_Growth (tree,  ) If tree contains a single path P, then For each combination (denoted as  ) of the nodes in the path P, then Generate pattern  +  with support = min_supp of nodes in  Else for each a in the header of tree, do { Generate pattern  = a +  with support = a. support ; Construct (1)  ’s conditional pattern base and (2)  ’s conditional FP-tree Tree  If Tree  is not empty, then Call FP-growth(Tree ,  ); }

Frequent-pattern mining methods18 FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K

Frequent-pattern mining methods19 FP-Growth vs. Tree-Projection: Scalability with the Support Threshold Data set T25I20D100K

Frequent-pattern mining methods20 Why Is FP-Growth the Winner? Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting and FP-tree building, not pattern search and matching

Frequent-pattern mining methods21 Implications of the Methodology: Papers by Han, et al. Mining closed frequent itemsets and max-patterns CLOSET (DMKD’00) Mining sequential patterns FreeSpan (KDD’00), PrefixSpan (ICDE’01) Constraint-based mining of frequent patterns Convertible constraints (KDD’00, ICDE’01) Computing iceberg data cubes with complex measures H-tree and H-cubing algorithm (SIGMOD’01)

Frequent-pattern mining methods22 Visualization of Association Rules: Pane Graph

Frequent-pattern mining methods23 Visualization of Association Rules: Rule Graph

Frequent-pattern mining methods24 Mining Various Kinds of Rules or Regularities Multi-level, quantitative association rules, correlation and causality, ratio rules, sequential patterns, emerging patterns, temporal associations, partial periodicity Classification, clustering, iceberg cubes, etc.

Frequent-pattern mining methods25 Multiple-level Association Rules Items often form hierarchy Flexible support settings: Items at the lower level are expected to have lower support. Transaction database can be encoded based on dimensions and levels explore shared multi-level mining uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support

Frequent-pattern mining methods26 Quantitative Association Rules age(X,”34-35”)  income(X,”30K - 50K”)  buys(X,”high resolution TV”) Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is maximized. 2-D quantitative association rules: A quan1  A quan2  A cat Cluster “adjacent” association rules to form general rules using a 2-D grid. Example:

Frequent-pattern mining methods27 Redundant Rules [SA95] Which rule is redundant? milk  wheat bread, [support = 8%, confidence = 70%] “ skim milk ”  wheat bread, [support = 2%, confidence = 72%] The first rule is more general than the second rule. A rule is redundant if its support is close to the “ expected ” value, based on a general rule, and its confidence is close to that of the general rule.

INCREMENTAL MINING [CHNW96] Rules in DB were found and a set of new tuples db is added to DB, Task: to find new rules in DB + db. Usually, DB is much larger than db. Properties of Itemsets: frequent in DB + db if frequent in both DB and db. infrequent in DB + db if also in both DB and db. frequent only in DB, then merge with counts in db. No DB scan is needed! frequent only in db, then scan DB once to update their itemset counts. Same principle applicable to distributed/parallel mining.

Frequent-pattern mining methods29 CORRELATION RULES Association does not measure correlation [BMS97, AY98]. Among 5000 students 3000 play basketball, 3750 eat cereal, 2000 do both play basketball  eat cereal [40%, 66.7%] Conclusion: “ basketball and cereal are correlated ” is misleading because the overall percentage of students eating cereal is 75%, higher than 66.7%. Confidence does not always give correct picture!

Frequent-pattern mining methods30 Correlation Rules P(A^B)=P(B)*P(A), if A and B are independent events A and B negatively correlated  the value is less than 1; Otherwise A and B positively correlated. P(B|A)/P(B) is known as the lift of rule B  A If less than one, then B and A are negatively correlated. Basketball  Cereal 2000/(3000*3750/500 0)=2000*5000/3000*3 750<1

Frequent-pattern mining methods31 Chi-square Correlation [BMS97] The cutoff value at 95% significance level is 3.84 > 0.9 Thus, we do not reject the independence assumption.

Frequent-pattern mining methods32 Constraint-based Data Mining Finding all the patterns in a database autonomously? — unrealistic! The patterns could be too many but not focused! Data mining should be an interactive process User directs what to be mined using a data mining query language (or a graphical user interface) Constraint-based mining User flexibility: provides constraints on what to be mined System optimization: explores such constraints for efficient mining—constraint-based mining

Frequent-pattern mining methods33 Constraints in Data Mining Knowledge type constraint: classification, association, etc. Data constraint — using SQL-like queries find product pairs sold together in stores in Vancouver in Dec.’00 Dimension/level constraint in relevance to region, price, brand, customer category Rule (or pattern) constraint small sales (price $200) Interestingness constraint strong rules: min_support  3%, min_confidence  60%

Frequent-pattern mining methods34 Constrained Mining vs. Constraint-Based Search Constrained mining vs. constraint-based search/reasoning Both are aimed at reducing search space Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI Constraint-pushing vs. heuristic search It is an interesting research problem on how to integrate them Constrained mining vs. query processing in DBMS Database query processing requires to find all Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing

Frequent-pattern mining methods35 Constrained Frequent Pattern Mining: A Mining Query Optimization Problem Given a frequent pattern mining query with a set of constraints C, the algorithm should be sound: it only finds frequent sets that satisfy the given constraints C complete: all frequent sets satisfying the given constraints C are found A naïve solution First find all frequent sets, and then test them for constraint satisfaction More efficient approaches: Analyze the properties of constraints comprehensively Push them as deeply as possible inside the frequent pattern computation.

Frequent-pattern mining methods36 Anti-Monotonicity in Constraint- Based Mining Anti-monotonicity intemset S satisfies the constraint, so does any of its subset sum(S.Price)  v is anti-monotone sum(S.Price)  v is not anti-monotone Example. C: range(S.profit)  15 is anti-monotone Itemset ab violates C So does every superset of ab TIDTransaction 10a, b, c, d, f 20b, c, d, f, g, h 30a, c, d, e, f 40c, e, f, g TDB (min_sup=2) ItemProfit a40 b0 c-20 d10 e-30 f30 g20 h-10

Frequent-pattern mining methods37 Which Constraints Are Anti- Monotone? ConstraintAntimonotone v  S No S  V no S  V yes min(S)  v no min(S)  v yes max(S)  v yes max(S)  v no count(S)  v yes count(S)  v no sum(S)  v ( a  S, a  0 ) yes sum(S)  v ( a  S, a  0 ) no range(S)  v yes range(S)  v no avg(S)  v,   { , ,  } convertible support(S)   yes support(S)   no

Frequent-pattern mining methods38 Monotonicity in Constraint- Based Mining Monotonicity When an intemset S satisfies the constraint, so does any of its superset sum(S.Price)  v is monotone min(S.Price)  v is monotone Example. C: range(S.profit)  15 Itemset ab satisfies C So does every superset of ab TIDTransaction 10a, b, c, d, f 20b, c, d, f, g, h 30a, c, d, e, f 40c, e, f, g TDB (min_sup=2) Ite m Profit a40 b0 c-20 d10 e-30 f30 g20 h-10

Frequent-pattern mining methods39 Which Constraints Are Monotone? ConstraintMonotone v  S yes S  V yes S  V no min(S)  v yes min(S)  v no max(S)  v no max(S)  v yes count(S)  v no count(S)  v yes sum(S)  v ( a  S, a  0 ) no sum(S)  v ( a  S, a  0 ) yes range(S)  v no range(S)  v yes avg(S)  v,   { , ,  } convertible support(S)   no support(S)   yes

Frequent-pattern mining methods40 Succinctness, Convertible, Inconvertable Constraints in Book We will not consider these in this course.

Frequent-pattern mining methods41 Associative Classification Mine association possible rules in form of itemset  class Itemset: a set of attribute-value pairs Class: class label Build Classifier Organize rules according to decreasing precedence based on confidence and support B. Liu, W. Hsu & Y. Ma. Integrating classification and association rule mining. In KDD’98

Frequent-pattern mining methods42 Classification by Aggregating Emerging Patterns Emerging pattern (EP): A pattern frequent in one class of data but infrequent in others. Age<=30 is frequent in class “buys_computer=yes” and infrequent in class “buys_computer=no” Rule: age<=30  buys computer G. Dong & J. Li. Efficient mining of emerging patterns: discovering trends and differences. In KDD’99