Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS

Slides:



Advertisements
Similar presentations
Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip.
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Institut für Scientific Computing - Universität WienP.Brezany 1 Datamining Methods Mining Association Rules and Sequential Patterns.
Data Mining Techniques Association Rule
Frequent Closed Pattern Search By Row and Feature Enumeration
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Mining Association Rules in Large Databases
Mining Association Rules in Large Databases
Mining Association Rules
Mining Association Rules
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Mining various kinds of Association Rules
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Association Rules presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) ICS, Polish Academy of Sciences.
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Mining Association Rules in Large Database This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed.
Data Mining Find information from data data ? information.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Association rule mining
Association Rules Repoussis Panagiotis.
Mining Association Rules
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Association Rules.
Association Rules Zbigniew W. Ras*,#) presented by
Market Basket Analysis and Association Rules
Market Basket Many-to-many relationship between different objects
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining II: Association Rule mining & Classification
Mining Association Rules in Large Databases
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Frequent patterns and Association Rules
Market Basket Analysis and Association Rules
©Jiawei Han and Micheline Kamber
Association Analysis: Basic Concepts
Presentation transcript:

Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS

Introduction Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear in a data set frequently. example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern. A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern. Finding such frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data.

Market Basket Analysis: A Motivating Example Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional or relational data sets. The discovery of interesting correlation relationships among huge db. help in many business decision-making processes, Customer shopping behavior analysis. example of frequent itemset mining is market basket analysis. It analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets” help retailers develop marketing strategies

Market basket analysis We consider a universe as set of items available at store Each item is boolean variable representing presence/absent of item Each basket is represented as boolean vector These can be represented in form of association rules Association rules are interesting if they satisfy both minimum support and minimum confidence Computer anti virus [support=2%,confidence=50%] Rule support and confidence are two measures of rule interestingness. They reflect the usefulness and certainty of discovered rules. A support of 2% for Association Rule means that 2% of all the transactions under analysis show that computer and antivirus software are purchased together. A confidence of 60% means that 60% of the customers who purchased a computer also bought the software. Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold Such thresholds can be set by users or domain experts.

Frequent Itemsets, Let I =f{I1, I2, Im} be a set of items. Let D, the task-relevant data, be a set of database transactions each transaction T is a set of items Each transaction is associated with an identifier, called TID An association rule is an implication of the form A->B, where A< I , B< I , and A ^ B=Φ The rule A->B holds in the transaction set D with support s, s is the percentage of transactions in D that contain A u B (i.e., the union of sets A and B The rule A-> B has confidence c in the transaction set D c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability, P(B/A).

Frequent Item sets support(A->B) = P(A U B) confidence(A)B) = P(B| A) Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence threshold (min conf) are called strong An item set that contains k items is a k-itemset. The set of {computer, antivirus software} is a 2-itemset. item set I satisfies a prespecified minimum support threshold and confidence then I is a frequent itemset. The set of frequent k-itemsets is commonly denoted by Lk confidence(A->B) = P(B|A) = support(A U B)/support(A) = support count(A U B) /support count(A)

Association rule mining association rule mining can be viewed as a two-step process: 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence. itemset is frequent, each of its subsets is frequent as well

Further Improvement of the Apriori Method Major computational challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans

The Apriori Algorithm (Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

Improving the Efficiency of Apriori “How can we further improve the efficiency of Apriori- based mining?” Hash-based technique (hashing itemsets into corresponding buckets): A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck, for k > 1. when scanning each transaction in the database to generate the frequent 1-itemsets, L1, from the candidate 1-itemsets in C1, we can generate all of the 2-itemsets for each transaction, hash (i.e., map) them into the different buckets of a hash table structure, and increase the corresponding bucket counts

Transaction reduction Transaction reduction (reducing the number of transactions scanned in future iterations): A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-item sets. Therefore, such a transaction can be marked or removed because subsequent scans of the database for j-itemsets, will not require it. Partitioning (partitioning the data to find candidate itemsets): A partitioning technique can be used that requires just two database scans to mine the frequent itemsets It consists of two phases. In Phase I, the algorithm subdivides the transactions of D into n nonoverlapping partitions. If the minimum support threshold for transactions in D is min sup, minimum support count for a partition is min supthe number of transactions in that partition. For each partition, all frequent itemsets within the partition are found.

Improving the Efficiency of Apriori Sampling (mining on a subset of the given data): The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D. In this way, we trade off some degree of accuracy. The sample size of S is such that the search for frequent itemsets in S can be done in main memory Dynamic itemset counting (adding candidate itemsets at different points during a scan): A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points. The technique is dynamic in that it estimates the support of all of the itemsets that have been counted so far, adding new candidate itemsets if all of their subsets are estimated to be frequent. The resulting algorithm requires fewer database scans than Apriori.

DIC: Reduce Number of Scans ABCD Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins ABC ABD ACD BCD AB AC BC AD BD CD Transactions 1-itemsets A B C D 2-itemsets Apriori … {} Itemset lattice 1-itemsets 2-items DIC 3-items

Mining Various Kinds of Association Rules mining multilevel association rules, multidimensional association rules, and quantitative association rules in transactional and/or relational databases Mining Multilevel Association Rules it is difficult to find strong associations among data items data mining systems should provide capabilities for mining association rules at multiple levels of abstraction, with sufficient flexibility

Example The concept hierarchy of Figure 5.10 has five levels, respectively referred to as levels 0 to 4, starting with level 0 at the root node for all (the most general abstraction level). Here, level 1 includes computer, software, printer&camera, and computer accessory, level 2 includes laptop computer, desktop computer, office software, antivirus software, . . . , and level 3 includes IBM desktop computer, . . . , Microsoft office software, and so on. Level 4 is the most specific abstraction level of this hierarchy Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel association rules

top-down strategy is employed, where counts are accumulated for the calculation of frequent itemsets at each concept level, starting at the concept level 1 and working downward in the hierarchy toward the more specific concept levels, until no more frequent itemsets can be found. For each level, any algorithm for discovering frequent itemsets may be used

Using uniform minimum support for all levels (referred to as uniform support): The same minimum support threshold is used when mining at each level of abstraction. For example, in Figure 5.11, a minimum support threshold of 5% is used throughout search procedure is simplified. The method is also simple in that users are required to specify only one minimumsupport threshold

uniform support approach, however, has some difficulties uniform support approach, however, has some difficulties. It is unlikely that items at lower levels of abstraction will occur as frequently as those at higher levels of abstraction. If the minimum support threshold is set too high, it could miss some meaningful associations occurring at low abstraction levels. If the threshold is set too low, it may generate many uninteresting associations occurring at high abstraction levels.

reduced support Using reduced minimum support at lower levels (referred to as reduced support): Each level of abstraction has its own minimum support threshold. The deeper the level of abstraction, the smaller the corresponding threshold is. For example, in Figure 5.12, the minimum support thresholds for levels 1 and 2 are 5% and 3%, respectively.

Using itemor group-based minimum support Using itemor group-based minimum support (referred to as group-based support): Because users or experts often have insight as to which groups are more important than others, it is sometimes more desirable to set up user-specific, item, or groupbased minimal support thresholds when miningmultilevel rules. For example, a user could set up the minimum support thresholds based on product price, or on items of interest, Apriori property may not always hold uniformly across all of the items when mining under reduced support and group-based support

Mining Multidimensional Association Rules from Relational Databases and DataWarehouses association rules that imply a single predicate, that is, the predicate buys. For instance, in mining our AllElectronics database, we may discover the Boolean association rule buys(X, “digital camera”))buys(X, “HP printer”). a singledimensional or intradimensional association rule because it contains a single distinct predicate (e.g., buys)withmultiple occurrences