Information Management course

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

LOGO Association Rule Lecturer: Dr. Bo Yuan
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Frequent Item Mining.
Rakesh Agrawal Ramakrishnan Srikant
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Mining Association Rules
Mining Association Rules
Performance and Scalability: Apriori Implementation.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Chapter 2: Mining Frequent Patterns, Associations and Correlations
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
What Is Association Mining? l Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
Chapter 4 – Frequent Pattern Mining Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage:
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Chapter 6: Mining Frequent Patterns, Association and Correlations
Dept. of Information Management, Tamkang University
What is Frequent Pattern Analysis?
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Data Mining Find information from data data ? information.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Association rule mining
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Market Baskets Frequent Itemsets A-Priori Algorithm
Mining Association Rules in Large Databases
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Department of Computer Science National Tsing Hua University
Association Rule Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Mining Association Rules in Large Databases
Association Analysis: Basic Concepts
What Is Association Mining?
Presentation transcript:

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 11: 20/11/2012

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. 2 2 2 2

Frequent Itemset Mining Methods Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Basic Concepts Frequent Itemset Mining Methods Which Patterns Are Interesting?—Pattern Evaluation Methods Summary 3 3

What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami (1993) in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing ..., Web log (click stream) analysis, and DNA sequence analysis. 4 4

Why Is Freq. Pattern Mining Important? Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications 5 5

Basic Concepts: Frequent Patterns Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: number of occurrences of an itemset X in the dataset (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys diaper buys both buys beer 6 6

Basic Concepts: Association Rules Tid Items bought Find all the rules X  Y fixing a minimum support and confidence support, s, probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper Customer buys beer Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association rules: (many more!) Beer  Diaper (60%, 100%) Diaper  Beer (60%, 75%) 7 7

Closed Patterns and Max-Patterns A (long) pattern contains a combinatorial number of sub- patterns, e.g., {a1, …, a100} contains sub-patterns! Idea: restrict to closed and maximal patterns An itemset X is a closed p. if X is frequent and there exists no super-pattern Y ⊃ X, with the same support as X An itemset X is a maximal p. if X is frequent and there exists no frequent superpattern Y ⊃ X Closed pattern is a lossless compression of freq. Patterns: reducing the # of patterns and rules 100 1 + 100 2 +...+ 100 100 = 2 100 −1=1.27⋅ 10 30 8 8

Closed Patterns and Max-Patterns Exercise. DB = {<a1, …, a100>, < a1, …, a50>} Min_sup = 1. What is the set of closed itemset? <a1, …, a100>: 1 < a1, …, a50>: 2 What is the set of maximal pattern? What is the set of all patterns? !! 9 9

Computational Complexity of Frequent Itemset Mining How many itemsets are potentially to be generated in the worst case? The number of frequent itemsets to be generated is sensitive to the minsup threshold When minsup is low, there exist potentially an exponential number of frequent itemsets The worst case: MN where M: # distinct items, and N: max length of transactions The worst case complexty vs. the expected probability Ex. Suppose Walmart has 104 kinds of products The chance to pick up one product 10-4 The chance to pick up a particular set of 10 products: ~10-40 What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? 10 10

Frequent Itemset Mining Methods Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Basic Concepts Frequent Itemset Mining Methods Which Patterns Are Interesting?—Pattern Evaluation Methods Summary 11 11

Scalable Frequent Itemset Mining Methods Apriori: Candidate Generate&Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach ECLAT: Frequent Pattern Mining with Vertical Data Format 12 12

The Downward Closure Property and Scalable Mining Methods The downward closure property of frequent patterns Any subset of a frequent itemset is frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Vertical data format approach (Charm—Zaki & Hsiao @SDM’02) 13 13

Apriori: A Candidate Generate & Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 14 14

The Apriori Algorithm—An Example Supmin = 2 Itemset sup {A} 2 {B} 3 {C} {D} 1 {E} Database TDB Itemset sup {A} 2 {B} 3 {C} {E} L1 C1 Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1st scan C2 C2 Itemset sup {A, B} 1 {A, C} 2 {A, E} {B, C} {B, E} 3 {C, E} Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} L2 2nd scan Itemset sup {A, C} 2 {B, C} {B, E} 3 {C, E} C3 L3 Itemset {B, C, E} 3rd scan Itemset sup {B, C, E} 2 15 15

The Apriori Algorithm (Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with enough support end return k Lk; 16 16

Implementation of Apriori How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4 = {abcd} 17 17

How to Count Supports of Candidates? Why counting supports of candidates is a problem? The total number of candidates can be very huge Each transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 18 18

Counting Supports of Candidates Using Hash Tree 1,4,7 2,5,8 3,6,9 Subset function Transaction: 1 2 3 5 6 1 + 2 3 5 6 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1 3 + 5 6 1 2 + 3 5 6 Build: store only frequent candidates and their count; do it incrementally while building Lk Query for a candidate: visit the tree; Query for an itemset: perform a visit for each sub-itemset; 19 19

Generating Association Rules from frequent itemsets When all frequent itemsets are found, generate strong association rules: Pick each frequent itemset f, generate all its nonempty subsets For each such subset s, test the rule s → (f \ s) support(s → (f \ s)) is above the threshold (as f is frequent by construction) confidence (s → (f \ s)) = P( (f \ s) | s ) = count(f) / count(s) count(f) and count(s) are known, and so checking is quick 20 20

Candidate Generation: An SQL Implementation SQL Implementation of candidate generation Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD’98] 21 21

Scalable Frequent Itemset Mining Methods Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach ECLAT: Frequent Pattern Mining with Vertical Data Format Mining Close Frequent Patterns and Maxpatterns 22 22

Further Improvement of the Apriori Method Major computational challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates 23 23

Partition: Scan Database Only Twice Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski and S. Navathe ’95 24

Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked Example: check abcd instead of ab, ac, …, etc. Scan database again to find missed frequent patterns H. Toivonen. Sampling large databases for association rules. In VLDB’96 26 26

Dynamic Itemset Counting: Reduce Number of Scans ABCD Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins ABC ABD ACD BCD AB AC BC AD BD CD Transactions 1-itemsets A B C D 2-itemsets Apriori … {} Itemset lattice 1-itemsets 2-items S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. SIGMOD’97 DIC 3-items 27 27