1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Item Mining.
Rakesh Agrawal Ramakrishnan Srikant
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Mining Association Rules in Large Databases
Association Rule Mining - MaxMiner. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and.
Mining Association Rules
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Performance and Scalability: Apriori Implementation.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Chapter 2: Mining Frequent Patterns, Associations and Correlations
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
What Is Association Mining? l Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Chapter 6: Mining Frequent Patterns, Association and Correlations
Dept. of Information Management, Tamkang University
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Data Mining Find information from data data ? information.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Information Management course
Association rule mining
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Frequent-Pattern Tree
Department of Computer Science National Tsing Hua University
Association Rule Mining
Association Analysis: Basic Concepts
What Is Association Mining?
Presentation transcript:

1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods2 What Is Frequent Pattern Mining? Frequent patterns: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] Frequent pattern mining: finding regularities in data What products were often purchased together? What are the subsequent purchases after buying a PC?

Frequent-pattern mining methods3 Why Is Frequent Pattern Mining an Essential Task in Data Mining? Foundation for many essential data mining tasks Association, correlation, causality Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression) Broad applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis Web log (click stream) analysis, DNA sequence analysis, etc.

Frequent-pattern mining methods4 Basic Concepts: Frequent Patterns and Association Rules Itemset X={x 1, …, x k } Find all the rules X  Y with min confidence and support support, s, probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y. Let min_support = 50%, min_conf = 50%: A  C (50%, 66.7%) C  A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer Transaction-idItems bought 10A, B, C 20A, C 30A, D 40B, E, F

Frequent-pattern mining methods5 Concept: Frequent Itemsets OutlookTemperatureHumidityPlay sunnyhothighno sunnyhothighno overcasthothighyes rainymildhighyes rainycoolnormalyes rainycoolnormalno overcastcoolnormalyes sunnymildhighno sunnycoolnormalyes rainymildnormalyes sunnymildnormalyes overcastmildhighyes overcasthotnormalyes rainymildhighno Minimum support=2 {sunny, hot, no} {sunny, hot, high, no} {rainy, normal} Min Support =3 ? How strong is {sunny, no}? Count = Percentage =

Frequent-pattern mining methods6 Concept: Itemset  Rules {sunny, hot, no} = {Outlook=Sunny, Temp=hot, Play=no} Generate a rule: Outlook=sunny and Temp=hot  Play=no How strong is this rule? Support of the rule = support of the itemset {sunny, hot, no} = 2 = Pr({sunny, hot, no}) Either expressed in count form or percentage form Confidence = Pr(Play=no | {Outlook=sunny, Temp=hot}) In general LHS  RHS, Confidence = Pr(RHS|LHS) Confidence =Pr(RHS|LHS) =count(LHS and RHS) / count(LHS) What is the confidence of Outlook=sunny  Play=no?

Frequent-pattern mining methods7 Frequent Patterns Patterns = Item Sets {i1, i2, … in}, where each item is a pair: (Attribute=value) Frequent Patterns Itemsets whose support >= minimum support Support count(itemset)/count(database)

Frequent-pattern mining methods8 Frequent Itemset Generation Given d items, there are 2 d possible candidate itemsets

Frequent-pattern mining methods9 Max-patterns Max-pattern: frequent patterns without proper frequent super pattern BCDE, ACD are max-patterns BCD is not a max-pattern TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F Min_sup=2

Frequent-pattern mining methods10 Maximal Frequent Itemset Border Infrequent Itemsets Maximal Itemsets An itemset is maximal frequent if none of its immediate supersets is frequent

Frequent-pattern mining methods11 Frequent Max Patterns Succinct Expression of frequent patterns Let {a, b, c} be frequent Then, {a, b}, {b, c}, {a, c} must also be frequent Then {a}, {b}, {c}, must also be frequent By writing down {a, b, c} once, we save lots of computation Max Pattern If {a, b, c} is a frequent max pattern, then {a, b, c, x} is NOT a frequent pattern, for any other item x.

Frequent-pattern mining methods12 Find Frequent Max Patterns OutlookTemperatureHumidityPlay sunnyhothighno sunnyhothighno overcasthothighyes rainymildhighyes rainycoolnormalyes rainycoolnormalno overcastcoolnormalyes sunnymildhighno sunnycoolnormalyes rainymildnormalyes sunnymildnormalyes overcastmildhighyes overcasthotnormalyes rainymildhighno Minimum support=2 {sunny, hot, no} ??

Frequent-pattern mining methods13 Closed Patterns An itemset is closed if none of its immediate supersets has the same support as the itemset {a, b}, {a, b, d}, {a, b, c} are closed patterns But, {a, b} is not a max pattern See where changes happen Reduce # of patterns and rules N. Pasquier et al. In ICDT’99 TIDItems 10a, b, c 20a, b, c 30a, b, d 40a, b, d, 50c, e, f

Frequent-pattern mining methods14 Maximal vs Closed Itemsets Transaction Ids Not supported by any transactions indexes beside an item set is the transaction #s.

Frequent-pattern mining methods15 Maximal vs Closed Frequent Itemsets Minimum support = 2 # Closed = 9 # Maximal = 4 Closed and maximal Closed but not maximal

Frequent-pattern mining methods16 Note on Closed Patterns Closed patterns have no need to specify the minimum support Given dataset, we can find a set of closed patterns from it, so that for any minimum support values, we can immediately find the set of patterns (a subset of the closed patterns). Closed frequent patterns Both closed and above the min support

Frequent-pattern mining methods17 Maximal vs Closed Itemsets

Frequent-pattern mining methods18 Mining Association Rules—an Example For rule A  C: support = support({A}  {C}) = 50% confidence = support({A}  {C})/support({A}) = 66.6% Min. support 50% Min. confidence 50% Transaction-idItems bought 10A, B, C 20A, C 30A, D 40B, E, F Frequent patternSupport {A}75% {B}50% {C}50% {A, C}50%

Frequent-pattern mining methods19 Method 1: Apriori: A Candidate Generation-and-test Approach Any subset of a frequent itemset must be frequent if {beer, diaper, nuts} is frequent, so is {beer, diaper} Every transaction having {beer, diaper, nuts} also contains {beer, diaper} Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: generate length (k+1) candidate itemsets from length k frequent itemsets, and test the candidates against DB The performance studies show its efficiency and scalability Agrawal & Srikant 1994, Mannila, et al. 1994

Frequent-pattern mining methods20 The Apriori Algorithm — An Example Database TDB 1 st scan C1C1 L1L1 L2L2 C2C2 C2C2 2 nd scan C3C3 L3L3 3 rd scan TidItems 10A, C, D 20B, C, E 30A, B, C, E 40B, E Itemsetsup {A}2 {B}3 {C}3 {D}1 {E}3 Itemsetsup {A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemsetsup {A, B}1 {A, C}2 {A, E}1 {B, C}2 {B, E}3 {C, E}2 Itemsetsup {A, C}2 {B, C}2 {B, E}3 {C, E}2 Itemset {B, C, E} Itemsetsup {B, C, E}2

21 Speeding up Association rules Dynamic Hashing and Pruning technique Thanks to Cheng Hong & Hu Haibo

Frequent-pattern mining methods22 DHP: Reduce the Number of Candidates A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95

Frequent-pattern mining methods23 Still challenging, the niche for DHP DHP ( Park ’95 ): Dynamic Hashing and Pruning Candidate large 2-itemsets are huge. DHP: trim them using hashing Transaction database is huge that one scan per iteration is costly DHP: prune both number of transactions and number of items in each transaction after each iteration

Frequent-pattern mining methods24 Hash Table Construction Consider two items sets, all itesms are numbered as i1, i2, …in. For any any pair (x, y), has according to Hash function bucket #= h({x y}) = ((order of x)*10+(order of y)) % 7 Example: Items = A, B, C, D, E, Order = 1, 2, 3 4, 5, H({C, E})= (3*10 + 5)% 7 = 0 Thus, {C, E} belong to bucket 0.

Frequent-pattern mining methods25 How to trim candidate itemsets In k-iteration, hash all candidate k+1 itemsets in a hash table, and count all the itemsets in each bucket. In k+1 iteration, examine each of the candidate itemset to see if its correspondent bucket value is above the support ( necessary condition )

Frequent-pattern mining methods26 Example TIDItems 100A C D 200B C E 300A B C E 400B E Figure1. An example transaction database

Frequent-pattern mining methods27 Generation of C1 & L1(1st iteration) C1 L1 ItemsetSup {A}2 {B}3 {C}3 {D}1 {E}3 ItemsetSup {A}2 {B}3 {C}3 {E}3

Frequent-pattern mining methods28 Hash Table Construction Find all 2-itemset of each transaction TID2-itemset 100{A C} {A D} {C D} 200{B C} {B E} {C E} 300{A B} {A C} {A E} {B C} {B E} {C E} 400{B E}

Frequent-pattern mining methods29 Hash Table Construction (2) Hash function h({x y}) = ((order of x)*10+(order of y)) % 7 Hash table {C E} {A E} {B C} {B E} {A B} {A C} {C E} {B C} {B E} {C D} {A D} {B E} {A C} bucket

Frequent-pattern mining methods30 C2 Generation (2nd iteration) L1*L1 # in the bucket {A B}1 {A C}3 {A E}1 {B C}2 {B E}3 {C E}3 Resulted C2 {A C} {B C} {B E} {C E} C2 of Apriori {A B} {A C} {A E} {B C} {B E} {C E}

Frequent-pattern mining methods31 Effective Database Pruning Apriori Don’t prune database. Prune C k by support counting on the original database. DHP More efficient support counting can be achieved on pruned database.

Frequent-pattern mining methods32 Performance Comparison

Frequent-pattern mining methods33 Performance Comparison (2)

Frequent-pattern mining methods34 FP-growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets

Frequent-pattern mining methods35 FP-tree construction null A:1 B:1 null A:1 B:1 C:1 D:1 After reading TID=1: After reading TID=2:

Frequent-pattern mining methods36 FP-Tree Construction null A:7 B:5 B:3 C:3 D:1 C:1 D:1 C:3 D:1 E:1 Pointers are used to assist frequent itemset generation D:1 E:1 Transaction Database Header table

Frequent-pattern mining methods37 FP-growth null A:4 B:2 B:1 C:1 D:1 C:1 D:1 C:1 D:1 Conditional Pattern base for D: (PB | D) = {(A:1,B:1,C:1), (A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)} Recursively apply FP-growth on PB, and then append to D Thus, frequent Itemsets found from PB|D (with min support = 2): AD, BD, CD, ABD, ACD, BCD D:1

Frequent-pattern mining methods38 FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K