Chapter 6: Mining Frequent Patterns, Association and Correlations

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Rakesh Agrawal Ramakrishnan Srikant
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)
Mining Association Rules in Large Databases
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Mining Association Rules
Mining Association Rules
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
SEG Tutorial 2 – Frequent Pattern Mining.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Chapter 2: Mining Frequent Patterns, Associations and Correlations
Constraint-based (Query-Directed) Mining Finding all the patterns in a database autonomously? — unrealistic! The patterns could be too many but not focused!
What Is Sequential Pattern Mining?
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
What Is Association Mining? l Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
Chapter 4 – Frequent Pattern Mining Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage:
November 3, 2015Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 Mining Association Rules with Constraints Wei Ning Joon Wong COSC 6412 Presentation.
1 Data Mining: Mining Frequent Patterns, Association and Correlations.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Dept. of Information Management, Tamkang University
Mining Frequent Patterns, Association, and Correlations (cont.) Pertemuan 06 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
What is Frequent Pattern Analysis?
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Association Rule Mining CENG 514 Data Mining July 2,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Information Management course
Association rule mining
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Department of Computer Science National Tsing Hua University
Association Rule Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Association Analysis: Basic Concepts
Presentation transcript:

Chapter 6: Mining Frequent Patterns, Association and Correlations Basic concepts Frequent itemset mining methods Constraint-based frequent pattern mining (ch7) Association rules

What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

Why Is Freq. Pattern Mining Important? Freq. pattern: intrinsic and important property of data sets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications

Basic Concepts: Frequent Patterns Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk itemset: A set of items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys diaper buys both buys beer k-itemset: an itemset containing k items

Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is a closed pattern if X is frequent and there exist no super-patterns with the same support all super-patterns must have smaller support An itemset X is a max-pattern if X is frequent and there exist no super-patterns that are frequent Relationship between the two? Closed patterns are a lossless compression of freq. patterns, whereas max-patterns are a lossy compression Lossless: can derive all frequent patterns as well as their support Lossy: can derive all frequent patterns 1. There are more closed patterns (lossless) than max patterns (lossy). A max pattern must be a closed pattern. But not vice versa. 2. to derive the support for S (not a closed pattern), just find all closed patterns that contain S, then choose the largest support Why? Because S is a subset of those closed patterns, it must have at least the support of those closed patterns. S cannot have a larger support, otherwise S would be closed. Lossy: bigger compression ratio. Can derive all frequent patterns, but support info is lost Closed pattern: proposed by Pasquier, et al. @ ICDT’99) Max pattern: frequent proposed by Bayardo @ SIGMOD’98

Closed Patterns and Max-Patterns DB = {<a1, …, a100>, < a1, …, a50>} min_sup = 1 What is the set of closed patterns? <a1, …, a100>: 1 < a1, …, a50>: 2 How to derive frequent patterns and their support values? What is the set of max-patterns? How to derive frequent patterns? What is the set of all patterns? {a1}: 2, …, {a1, a2}: 2, …, {a1, a51}: 1, …, {a1, a2, …, a100}: 1 A big number: 2100 – 1 2^{100} – 1 because every k-itemset (pattern) for k = 1, ..,100 is frequent with support either 1 or 2. What if min_sup = 2? < a1, …, a50>: 2 is the only closed pattern, and only max pattern

Closed Patterns and Max-Patterns For a given dataset with itemset I = {a,b,c,d} and min_sup = 8, the closed patterns are {a,b,c,d} with support of 10, {a,b,c} with support of 12, and {a, b,d} with support of 14. Derive the frequent 2-itemsets together with their support values {a,b}: 14 {a,c}: 12 {a,d}: 14 {b,c}: 12 {b,d}: 14 {c,d}: 10

Chapter 6: Mining Frequent Patterns, Association and Correlations Basic concepts Frequent itemset mining methods Constraint-based frequent pattern mining (ch7) Association rules

Scalable Frequent Itemset Mining Methods Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach ECLAT: Frequent Pattern Mining with Vertical Data Format

Scalable Methods for Mining Frequent Patterns The downward closure (anti-monotonic) property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (Fpgrowth: Han, Pei & Yin @SIGMOD’00) Vertical data format (Charm—Zaki & Hsiao @SDM’02)

Apriori: A Candidate Generation-and-Test Approach Apriori pruning principle: If there is any itemset that is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated

The Apriori Algorithm—An Example min_sup= 2 Itemset sup {a} 2 {b} 3 {c} {d} 1 {e} DB Itemset sup {a} 2 {b} 3 {c} {e} L1 C1 Tid Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e 1st scan C2 Itemset sup {a, b} 1 {a, c} 2 {a, e} {b, c} {b, e} 3 {c, e} C2 Itemset {a, b} {a, c} {a, e} {b, c} {b, e} {c, e} L2 2nd scan Itemset sup {a, c} 2 {b, c} {b, e} 3 {c, e} C3 L3 Itemset {b, c, e} 3rd scan Itemset sup {b, c, e} 2

The Apriori Algorithm (Pseudo-code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

Implementation of Apriori Generate candidates, then count support for the generated candidates How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example: L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd} The above procedures do not miss any legitimate candidates. Thus Apriori mines a complete set of frequent patterns. Why complete? If a 4-itemset is frequent, then it must have been generated by self-join. e.g., if abde is frquent, then abd and abe must be frequent and in L3. then abde must have been generated by self-join

How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction

Example: Counting Supports of Candidates 1,4,7 2,5,8 3,6,9 Subset function Transaction: 1 2 3 5 6 1 + 2 3 5 6 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1 3 + 5 6 1 2 + 3 5 6

Further Improvement of the Apriori Method Major computational challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates

Apriori applications beyond freq. pattern mining Given a set S of students, we want to find each subset of S such that the age range of the subset is less than 5. Apriori algorithm, level-wise search using the downward closure property for pruning to gain efficiency Can be used to search for any subsets with the downward closure property (i.e., anti-monotone constraint) CLIQUE for subspace clustering used the same Apriori principle, where the dimensions are the itemset

Chapter 6: Mining Frequent Patterns, Association and Correlations Basic concepts Frequent itemset mining methods Constraint-based frequent pattern mining (ch7) Association rules

Constraint-based (Query-Directed) Mining Finding all the patterns in a database autonomously? — unrealistic! The patterns could be too many but not focused! Data mining should be an interactive process User directs what to be mined using a data mining query language (or a graphical user interface) Constraint-based mining User flexibility: provides constraints on what to be mined Optimization: explores such constraints for efficient mining — constraint-based mining: constraint-pushing, similar to push selection first in DB query processing Note: still find all the answers satisfying constraints, not finding some answers in “heuristic search”

Constrained Mining vs. Constraint-Based Search Constrained mining vs. constraint-based search/reasoning Both are aimed at reducing search space Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI Constraint-pushing vs. heuristic search It is an interesting research problem on how to integrate them Constrained mining vs. query processing in DBMS Database query processing requires to find all Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing Constraints would make an (optimization or mining) problem harder or easier? In general, easier. Unless no constraint at all (no need to search).

Constraint-Based Frequent Pattern Mining Pattern space pruning constraints Anti-monotonic: If constraint c is violated, its further mining can be terminated Monotonic: If c is satisfied, no need to check c again Succinct: c must be satisfied, so one can start with the data sets satisfying c Convertible: c is not monotonic nor anti-monotonic, but it can be converted into it if items in the transaction can be properly ordered Data space pruning constraint Data succinct: Data space can be pruned at the initial pattern mining process Data anti-monotonic: If a transaction t does not satisfy c, t can be pruned from its further mining

Anti-Monotonicity in Constraint Pushing TDB (min_sup=2) Anti-monotonicity When an itemset S violates the constraint, so does any of its superset sum(S.Price)  v is anti-monotonic sum(S.Price)  v is not anti-monotonic C: range(S.profit)  15 is anti-monotonic Itemset ab violates C So does every superset of ab support count >= min_sup is anti-monotonic core property used in Apriori TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g Item Profit a 40 b c -20 d 10 e -30 f 30 g 20 h -10

Monotonicity for Constraint Pushing When an itemset S satisfies the constraint, so does any of its superset sum(S.Price)  v is monotonic min(S.Price)  v is monotonic C: range(S.profit)  15 Itemset ab satisfies C So does every superset of ab Item Profit a 40 b c -20 d 10 e -30 f 30 g 20 h -10

Succinctness Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1 Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items If a constraint is succinct, we can directly generate precisely the sets that satisfy it, even before support counting begins. Avoids substantial overhead of generate-and-test, i.e., such constraint is pre-counting pushable min(S.Price)  v is succinct sum(S.Price)  v is not succinct

Constraint-Based Mining—A General Picture Antimonotone Monotone Succinct v  S no yes S  V S  V min(S)  v min(S)  v max(S)  v max(S)  v count(S)  v weakly count(S)  v sum(S)  v ( a  S, a  0 ) sum(S)  v ( a  S, a  0 ) range(S)  v range(S)  v avg(S)  v,   { , ,  } convertible support(S)   support(S)  

Chapter 6: Mining Frequent Patterns, Association and Correlations Basic concepts Frequent itemset mining methods Constraint-based frequent pattern mining (ch7) Association rules

Basic Concepts: Association Rules An association rule is of the form X  Y with minimum support and confidence, where X,Y  I, X  Y =  support(X->Y): probability that a transaction contains X  Y, i.e., support(X->Y) = P(X U Y) Can be estimated by the percentage of transactions in DB that contain X  Y. Not to be confused with P(X or Y) confidence(X->Y): conditional probability that a transaction having X also contains Y, i.e. confidence(X->Y) = P(Y|X) confidence(X->Y) = P(Y|X) = support(X  Y) / support (X) = support_count(X  Y) / support_count(X) confidence(X->Y) can be easily derived from the support count of X and the support count of X  Y. Thus association rule mining can be reduced to frequent pattern mining

Basic Concepts: Association rules Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association rules: (many more!) Beer  Diaper (60%, 100%) Diaper  Beer (60%, 75%) If {a} => {b} is an association rule, then {b} => {a} is also an association rule? Same support, different confidence If {a,b} => {c} is an association rule, then {b} => {c} is also an association rule? If {b} => {c} is an association rule then {a,b} => {c} is also an association rule? Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys diaper buys both buys beer No to all the three

Interestingness Measure: Correlations (Lift) play basketball  eat cereal [40%, 66.7%] is misleading The overall % of students eating cereal is 75% > 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Support and confidence are not good to indicate correlations Measure of dependent/correlated events: lift Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 5000 Again here, A is a set of items (basketball, cereal …) not a set of transactions (students). So A U B, not A intersect B