COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong

Slides:



Advertisements
Similar presentations
Association Rules Evgueni Smirnov.
Advertisements

Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
Rakesh Agrawal Ramakrishnan Srikant
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.
Mining Association Rules in Large Databases
Mining Association Rules in Large Databases
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Association Rule Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Fast Algorithms for Association Rule Mining
Lecture14: Association Rules
Mining Association Rules
Mining Association Rules
SEG Tutorial 2 – Frequent Pattern Mining.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
DATA MINING LECTURE 3 Frequent Itemsets Association Rules.
Data Mining Association Analysis Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Data Mining Find information from data data ? information.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Dept. of Information Management, Tamkang University
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International.
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
Introduction to Machine Learning Lecture 13 Introduction to Association Rules Albert Orriols i Puig Artificial.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Data Mining – Association Rules
Data Mining Find information from data data ? information.
Association rule mining
Association Rules Repoussis Panagiotis.
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
A Parameterised Algorithm for Mining Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Association Analysis: Basic Concepts
Presentation transcript:

COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong

COMP53312 Introduction Raymond David Emily Derek Supermarket Application apple coke coffee coke diaper … coke biscuit milk … … We want to find some associations between items. An interesting association: Diaper and Beer are usually bought together. beer diaper Why? Is it strange? Item History or Transaction

COMP53313 Introduction Supermarket Application An interesting association: Diaper and Beer are usually bought together. beer diaper Why? Is it strange?

COMP53314 Introduction Supermarket Application An interesting association: Diaper and Beer are usually bought together. beer diaper Why? Is it strange? Reasons: This pattern occurs frequently in the early evening. Office Daytime Working …

COMP53315 Introduction Supermarket Application An interesting association: Diaper and Beer are usually bought together. beer diaper Why? Is it strange? Reasons: This pattern occurs frequently in the early evening. Office Early Evening Home Morning Please buy diapers

COMP53316 Introduction Applications of Association Rule Mining Supermarket Web Mining Medical analysis Bioinformatics Network analysis (e.g., Denial-of-service (DoS)) Programming Pattern Finding

COMP53317 Outline Association Rule Mining Problem Definition NP-hardness Algorithm Apriori Properties Algorithm

COMP53318 Association Rule Mining TIDABCDE t t t t t A, D A, B, D, E B, C A, B, C, D, E B, C, E Single Items (or simply items): ABCDE Itemsets:{B, C}{A, B, C}{B, C, D}{A} 2-itemset3-itemset 1-itemset

COMP53319 Association Rule Mining TIDABCDE t t t t t Single Items (or simply items): ABCDE Itemsets:{B, C}{A, B, C}{B, C, D}{A} Support = 3 Support = 4 Support = 3Support = 1 Large itemsets: itemsets with support >= a threshold (e.g., 3) e.g., {A}, {B}, {B, C} but NOT {A, B, C} Frequent itemsets 3-frequent itemset of size 2 1-frequent itemset of size 3

COMP Association Rule Mining TIDABCDE t t t t t Association rules: {B, C}  E Support = 2

COMP Association Rule Mining TIDABCDE t t t t t Association rules: {B, C}  E Support = 2 Confidence = 2/3 = 66.7%

COMP Association Rule Mining TIDABCDE t t t t t Association rules: {B, C}  E Support = 2 Confidence = 2/3 = 66.7% B  C Support = 3

COMP Association Rule Mining TIDABCDE t t t t t Association rules: {B, C}  E Support = 2 Confidence = 2/3 = 66.7% B  C Support = 3 Confidence= 3/4 = 75%

COMP Association Rule Mining TIDABCDE t t t t t Problem: We want to find some “ interesting ” association rules {B, C}  E B  C Support = 2 Confidence = 2/3 = 66.7% Support = 3 Confidence = 3/4 = 75% … Association rules with 1. Support >= a threshold (e.g., 3) 2. Confidence >= another threshold (e.g., 50%) How can we find all “ interesting ” association rules? Step 1: to find all “ large ” itemsets (i.e., itemsets with support >= 3) (e.g., itemset {B, C} has support = 3) Step 2: to find all “ interesting ” rules after Step 1 - from all “ large ” itemsets find the association rule with confidence >= 50%

COMP Outline Association Rule Mining Problem Definition NP-hardness Algorithm Apriori Properties Algorithm

COMP NP-Completeness Step 1: to find all “ large ” itemsets (i.e., itemsets with support >= 3) (e.g., itemset {B, C} has support = 3) Step 2: to find all “ interesting ” rules after Step 1 - from all “ large ” itemsets find the association rule with confidence >= 50% Problem: to find all “ large ” itemsets (i.e., itemsets with support >= 3) Problem: to find all “ large ” J-itemsets for each positive integer J (i.e., J-itemsets with support >= 3)

COMP NP-Completeness Finding Large J-itemsets INSTANCE: Given a database of transaction records QUESTION: Is there an f-frequent itemset of size J? EggRiceOilJuice

COMP NP-Completeness Balanced Complete Bipartite Subgraph INSTANCE: Bipartite graph G = (V, E), positive integer K  |V| QUESTION: Are there two disjoint subsets V 1, V 2  V such that |V 1 | = |V 2 | = K and such that, for each u  V 1 and each v  V 2, {u, v}  E? A B C D E F G H A B C D E F G H NP-complete problem

COMP NP-Completeness We can transform the graph problem into itemset problem. For each vertex in V 1, create a transaction For each vertex in V 2, create an item For each edge (u, v), create a purchase of item v in transaction u f  K J  K Is there a K-frequent itemset of size K? A B C D E F G H ABCD E1110 F1111 G1110 H0010

COMP NP-Completeness It is easy to verify that solving the problem Finding Large K-itemsets is equal to solving problem Balanced Complete Bipartite Subgraph Finding Large K-itemsets is NP-hard.

COMP Methods to prove that a problem P is NP-hard Step 1: Find an existing NP-complete problem (e.g., complete bipartite graph) Step 2: Transform this NP-complete problem to P (in polynomial-time) Step 3: Show that solving the “ transformed ” problem is equal to solving “ original ” NP-complete problem

COMP Outline Association Rule Mining Problem Definition NP-hardness Algorithm Apriori Properties Algorithm

COMP Apriori TIDABCDE t t t t t Suppose we want to find all “ large ” itemsets (e.g., itemsets with support >= 3) {B, C} is large Support of {B, C} = 3 Is {B} large? Is {C} large? Property 1: If an itemset S is large, then any proper subset of S must be large.

COMP Apriori TIDABCDE t t t t t Suppose we want to find all “ large ” itemsets (e.g., itemsets with support >= 3) {B, C, E} is NOT large Support of {B, C, E} = 2 Is {A, B, C, E} large? Is {B, C, D, E} large? Property 2: If an itemset S is NOT large, then any proper superset of S must NOT be large.

COMP Apriori Property 1: If an itemset S is large, then any proper subset of S must be large. Property 2: If an itemset S is NOT large, then any proper superset of S must NOT be large.

COMP Outline Association Rule Mining Problem Definition NP-hardness Algorithm Apriori Properties Algorithm

COMP Apriori TIDABCDE t t t t t ItemCount A B C D E 3

COMP Apriori TIDABCDE t t t t t ItemCount A3 B4 C3 D3 E3 Suppose we want to find all “ large ” itemsets (e.g., itemsets with support >= 3) Thus, {A}, {B}, {C}, {D} and {E} are “ large ” itemsets of size 1 (or, “ large ” 1-itemsets). We set L 1 = {{A}, {B}, {C}, {D}, {E}}

COMP Apriori TIDABCDE t t t t t Suppose we want to find all “ large ” itemsets (e.g., itemsets with support >= 3) L1L1 C2C2 L2L2 Candidate Generation “ Large ” Itemset Generation Thus, {A}, {B}, {C}, {D} and {E} are “ large ” itemsets of size 1 (or, “ large ” 1-itemsets). We set L 1 = {{A}, {B}, {C}, {D}, {E}} Large 2-itemset Generation C3C3 L3L3 Candidate Generation “ Large ” Itemset Generation Large 3-itemset Generation …

COMP Apriori TIDABCDE t t t t t Suppose we want to find all “ large ” itemsets (e.g., itemsets with support >= 3) L1L1 C2C2 L2L2 Candidate Generation “ Large ” Itemset Generation Thus, {A}, {B}, {C}, {D} and {E} are “ large ” itemsets of size 1 (or, “ large ” 1-itemsets). We set L 1 = {{A}, {B}, {C}, {D}, {E}} Large 2-itemset Generation C3C3 L3L3 Candidate Generation “ Large ” Itemset Generation Large 3-itemset Generation … 1.Join Step 2.Prune Step Counting Step

COMP Candidate Generation Join Step Prune Step

COMP Join Step TIDABCDE t t t t t Property 1: If an itemset S is large, then any proper subset of S must be large. Property 2: If an itemset S is NOT large, then any proper superset of S must NOT be large. Suppose we know that itemset {B, C} and itemset {B, E} are large (i.e., L 2 ). It is possible that itemset {B, C, E} is also large (i.e., C 3 ).

COMP Join Step Input: L k-1, a set of all large (k-1)-itemsets Output: C k, a set of candidates k-itemsets Algorithm: insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 = q.item 1, p.item 2 = q.item 2, … p.item k-2 = q.item k-2, p.item k-1 < q.item k-1

COMP Prune Step TIDABCDE t t t t t Property 1: If an itemset S is large, then any proper subset of S must be large. Property 2: If an itemset S is NOT large, then any proper superset of S must NOT be large. Suppose we know that itemset {B, C} and itemset {B, E} are large (i.e., L 2 ). It is possible that itemset {B, C, E} is also large (i.e., C 3 ). Suppose we know that {C, E} is not large. We can prune {B, C, E} in C 3.

COMP Prune Step forall itemsets c  C k (from Join Step) do for all (k-1)-subsets s of c do if (s not in L k-1 ) then delete c from C k

COMP Apriori TIDABCDE t t t t t Suppose we want to find all “ large ” itemsets (e.g., itemsets with support >= 3) L1L1 C2C2 L2L2 Candidate Generation “ Large ” Itemset Generation Thus, {A}, {B}, {C}, {D} and {E} are “ large ” itemsets of size 1 (or, “ large ” 1-itemsets). We set L 1 = {{A}, {B}, {C}, {D}, {E}} Large 2-itemset Generation C3C3 L3L3 Candidate Generation “ Large ” Itemset Generation Large 3-itemset Generation … 1.Join Step 2.Prune Step Counting Step

COMP Counting Step After the candidate generation (i.e., Join Step and Prune Step), we are given a set of candidate itemsets We need to verify whether these candidate itemsets are large or not We have to scan the database to obtain the count of each itemset in the candidate set. Algorithm For each itemset c in C k, obtain the count of c (from the database) If the count of c is smaller than a given threshold, remove it from C k The remaining itemsets in C k correspond to L k