Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,

Slides:



Advertisements
Similar presentations
Mining Association Rules in Large Databases
Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
CSE 634 Data Mining Techniques
Data Mining Techniques Association Rule
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Analysis: Basic Concepts and Algorithms.
1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei.
Mining Association Rules in Large Databases
Mining Association Rules in Large Databases
Mining Association Rules
Mining Frequent Patterns I: Association Rule Discovery Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Mining Association Rules
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
DATA MINING LECTURE 3 Frequent Itemsets Association Rules.
Information Systems Data Analysis – Association Mining Prof. Les Sztandera.
Fast Algorithms For Mining Association Rules By Rakesh Agrawal and R. Srikant Presented By: Chirayu Modi.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Data Mining Find information from data data ? information.
Lecture 4: Association Market Basket Analysis Analysis of Customer Behavior and Service Modeling.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
UNIT-5 Mining Association Rules in Large Databases LectureTopic ********************************************** Lecture-27Association rule mining Lecture-28Mining.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Dept. of Information Management, Tamkang University
What is Frequent Pattern Analysis?
Data Mining  Association Rule  Classification  Clustering.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Spring 2009.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Mining Association Rules in Large Databases
Data Mining Find information from data data ? information.
Association rule mining
Mining Association Rules
Frequent Pattern Mining
©Jiawei Han and Micheline Kamber
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Analysis of Customer Behavior and Service Modeling
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
©Jiawei Han and Micheline Kamber
Association Rule Mining
Mining Association Rules in Large Databases
Association Analysis: Basic Concepts
Presentation transcript:

Mining Association Rules in Large Databases

What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transactional databases, relational databases, and other information repositories  Motivation (market basket analysis): If customers are buying milk, how likely is that they also buy bread? Such rules help retailers to:  plan the shelf space: by placing milk close to bread they may increase the sales  provide advertisements/recommendation to customers that are likely to buy some products  put items that are likely to be bought together on discount, in order to increase the sales

What Is Association Rule Mining?  Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.  Rule form: “Body ead [support, confidence]”.  Examples. buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%] major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]

Association Rules: Basic Concepts  Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)  Find: all rules that correlate the presence of one set of items with that of another set of items E.g., 98% of people who purchase tires and auto accessories also get automotive services done

What are the components of rules?  In data mining, a set of items is referred to as an itemset  Let D be database of transactions e.g.:  Let I be the set of items that appear in the database, e.g., I={A,B,C,D,E,F}  A rule is defined by X  Y, where XI, YI, and XY= e.g.: {B,C}  {E} is a rule

Are all the rules interesting?  The number of potential rules is huge. We may not be interested in all of them.  We are interesting in rules that: their items appear frequently in the database they hold with a high probability  We use the following thresholds: the support of a rule indicates how frequently its items appear in the database the confidence of a rule indicates the probability that if the left hand side appears in a T, also the right hand side will.

Rule Measures: Support and Confidence  Find all the rules X  Y with minimum confidence and support support, s, probability that a transaction contains {X  Y} confidence, c, conditional probability that a transaction having X also contains Y Let minimum support 50%, and minimum confidence 50%, we have A  C (50%, 66.6%) C  A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer

TIDdateitems_bought 10010/10/99{F,A,D,B} 20015/10/99{D,A,C,E,B} 30019/10/99{C,A,B,E} 40020/10/99{B,A,D} Example  What is the support and confidence of the rule: {B,D}  {A}  Support: percentage of tuples that contain {A,B,D} =  Confidence: 75% 100% Remember: conf(X  Y) =

Association Rule Mining  Boolean vs. quantitative associations (Based on the types of values handled) buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]  Single dimension vs. multiple dimensional associations (see ex. Above)  Single level vs. multiple-level analysis age(x, “30..39”) buys(x, “laptop”) age(x, “30..39”) buys(x, “computer”)  Various extensions Correlation, causality analysis  Association does not necessarily imply correlation or causality Maximal frequent itemsets: no frequent supersets frequent closed itemsets: no superset with the same support

Mining Association Rules—An Example For rule A  C: support = support({A C}) = 50% confidence = support({A C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent! Min. support 50% Min. confidence 50%

Mining Frequent Itemsets: the Key Step  Find the frequent itemsets: the sets of items that have minimum support A subset of a frequent itemset must also be a frequent itemset  i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to m (m-itemset): Use frequent k-itemsets to explore (k+1)-itemsets.  Use the frequent itemsets to generate association rules.

Visualization of the level-wise process: Level 1 (frequent items): {A}, {B}, {D}, {F} Level 2 (frequent pairs):{AB}, {AD}, {BF}, {BD}, {DF} Level 3 (frequent triplets):{ABD}, {BDF} Level 4 (frequent quadruples):{} Remember: All subsets of a frequent itemset must be frequent Question: Can ADF be frequent? NO: because AF is not frequent

The Apriori Algorithm (the general idea) 1. Find frequent items and put them to L k (k=1) 2. Use L k to generate a collection of candidate itemsets C k+1 with size (k+1) 3. Scan the database to find which itemsets in C k+1 are frequent and put them into L k+1 4. If L k+1 is not empty  k=k+1  GOTO 2

The Apriori Algorithm  Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support (frequent) end return  k L k ;  Important steps in candidate generation: Join Step: C k+1 is generated by joining L k with itself Prune Step: Any k-itemset that is not frequent cannot be a subset of a frequent (k+1)-itemset

The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 min_sup=2=50%

How to Generate Candidates?  Suppose the items in L k are listed in an order  Step 1: self-joining L k (IN SQL) insert into C k+1 select p.item 1, p.item 2, …, p.item k, q.item k from L k p, L k q where p.item 1 =q.item 1, …, p.item k-1 =q.item k-1, p.item k < q.item k  Step 2: pruning forall itemsets c in C k+1 do forall k-subsets s of c do if (s is not in L k ) then delete c from C k+1

Example of Candidates Generation  L 3 ={abc, abd, acd, ace, bcd}  Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace  Pruning: acde is removed because ade is not in L 3  C 4 ={abcd} {a,c,d}{a,c,e} {a,c,d,e} acdaceade cde  X X

How to Count Supports of Candidates?  Why counting supports of candidates a problem? The total number of candidates can be huge One transaction may contain many candidates  Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction

Example of the hash-tree for C 3 Hash function: mod 3 H 1,4,.. 2,5,..3,6,.. H Hash on 1 st item HH H Hash on 2 nd item Hash on 3 rd item

Example of the hash-tree for C 3 Hash function: mod 3 H 1,4,.. 2,5,..3,6,.. H Hash on 1 st item HH H Hash on 2 nd item Hash on 3 rd item look for 1XX 2345 look for 2XX 345 look for 3XX

Example of the hash-tree for C 3 Hash function: mod 3 H 1,4,.. 2,5,..3,6,.. H Hash on 1 st item HH H Hash on 2 nd item look for 1XX 2345 look for 2XX 345 look for 3XX look for 12X look for 13X (null) look for 14X 

AprioriTid: Use D only for first pass  The database is not used after the 1 st pass.  Instead, the set C k ’ is used for each step, C k ’ = : each X k is a potentially frequent itemset in transaction with id=TID.  At each step C k ’ is generated from C k-1 ’ at the pruning step of constructing C k and used to compute L k.  For small values of k, C k ’ could be larger than the database!

AprioriTid Example (min_sup=2) Database D L1L1 L2L2 C2C2 C3’C3’ TIDSets of itemsets 100{{1},{3},{4}} 200{{2},{3},{5}} 300{{1},{2},{3},{5}} 400{{2},{5}} C1’C1’ TIDSets of itemsets 100{{1 3}} 200{{2 3},{2 5},{3 5}} 300{{1 2},{1 3},{1 5}, {2 3},{2 5},{3 5}} 400{{2 5}} C1’C1’ C3C3 TIDSets of itemsets 200{{2 3 5}} 300{{2 3 5}} L3L3

Methods to Improve Apriori’s Efficiency  Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent  Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans  Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB  Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness  Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent  