Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation 12.3.2008 Lauri Lahti.

Slides:



Advertisements
Similar presentations
Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip.
Advertisements

Mining Association Rules from Microarray Gene Expression Data.
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
CSE 634 Data Mining Techniques
Data Mining Techniques Association Rule
Association rules and frequent itemsets mining
Mining Multiple-level Association Rules in Large Databases
Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Rakesh Agrawal Ramakrishnan Srikant
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Data Mining Association Analysis: Basic Concepts and Algorithms
Spring 2003Data Mining by H. Liu, ASU1 5. Association Rules Market Basket Analysis and Itemsets APRIORI Efficient Association Rules Multilevel Association.
Spring 2005CSE 572, CBS 598 by H. Liu1 5. Association Rules Market Basket Analysis and Itemsets APRIORI Efficient Association Rules Multilevel Association.
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.
Association Rules Presented by: Anilkumar Panicker Presented by: Anilkumar Panicker.
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Fast Algorithms for Association Rule Mining
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Efficient Discovery of Concise Association Rules from Large Databases Vikram Pudi IIIT Hyderabad.
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
1 Mining Association Rules Mohamed G. Elfeky. 2 Introduction Data mining is the discovery of knowledge and useful information from the large amounts of.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.
Fast Algorithms For Mining Association Rules By Rakesh Agrawal and R. Srikant Presented By: Chirayu Modi.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Data Mining Association Analysis: Basic Concepts and Algorithms
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining II: Association Rule mining & Classification
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
Market Basket Analysis and Association Rules
Presentation transcript:

Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti

Association rules Techniques for data mining and knowledge discovery in databases Five important algorithms in the development of association rules (Yilmaz et al., 2003): AIS algorithm 1993 SETM algorithm 1995 Apriori, AprioriTid and AprioriHybrid 1994

Apriori algorithm Developed by Agrawal and Srikant 1994 Innovative way to find association rules on large scale, allowing implication outcomes that consist of more than one item Based on minimum support threshold (already used in AIS algorithm) Three versions: – Apriori (basic version) faster in first iterations – AprioriTid faster in later iteratons – AprioriHybrid can change from Apriori to AprioriTid after first iterations

Limitations of Apriori algorithm Needs several iterations of the data Uses a uniform minimum support threshold Difficulties to find rarely occuring events Alternative methods (other than appriori) can address this by using a non-uniform minimum support thresold Some competing alternative approaches focus on partition and sampling

Phases of knowledge discovery 1) data selection 2) data cleansing 3) data enrichment (integration with additional resources) 4) data transformation or encoding 5) data mining 6) reporting and display (visualization) of the discovered knowledge (Elmasri and Navathe, 2000)

Application of data mining Data mining can typically be used with transactional databases (for ex. in shopping cart analysis) Aim can be to build association rules about the shopping events Based on item sets, such as {milk, cocoa powder} 2-itemset {milk, corn flakes, bread} 3-itemset

Association rules Items that occur often together can be associated to each other These together occuring items form a frequent itemset Conclusions based on the frequent itemsets form association rules For ex. {milk, cocoa powder} can bring a rule cocoa powder  milk

Sets of database Transactional database D All products an itemset I = {i 1, i 2,…, i m } Unique shopping event T  I T contains itemset X iff X  T Based on itemsets X and Y an association rule can be X  Y It is required that X  I, Y  I and X  Y = 

Properties of rules Types of item values: boolen, quantitative, categorical Dimensions of rules – 1D: buys(cocoa powder)  buys(milk) – 3D: age(X,’under 12’)  gender(X,’male’)  buys(X,’comic book’) Latter one is an example of a profile association rule Intradimension rules, interdimension rules, hybrid- dimension rules (Han and Kamber, 2001) Concept hierarchies and multilevel association rules

Quality of rules Interestingness problem (Liu et al., 1999): – some generated rules can be self-evident – some marginal events can dominate – interesting events can be rarely occuring Need to estimate how interesting the rules are Subjective and objective measures

Subjective measures Often based on earlier user experiences and beliefs Unexpectedness: rules are interesting if they are unknown or contradict the existing knowledge (or expectations). Actionability: rules are interesting if users can get advantage by using them Weak and strong beliefs

Objective measures Based on threshold values controlled by the user Some typical measures (Han and Kamber, 2001): – simplicity – support (utility) – confidence (certainty)

Simplicity Focus on generating simple association rules Length of rule can be limited by user- defined threshold With smaller itemsets the interpretation of rules is more intuitive Unfortunately this can increase the amount of rules too much Quantitative values can be quantized (for ex. age groups)

Simplicity, example One association rule that holds association between cocoa powder and milk buys(cocoa powder)  buys(bread,milk,salt) More simple and intuitive might be buys(cocoa powder)  buys(milk)

Support (utility) Usefulness of a rule can be measured with a minimum support threshold This parameter lets to measure how many events have such itemsets that match both sides of the implication in the association rule Rules for events whose itemsets do not match boths sides sufficiently often (defined by a threshold value) can be excluded

Support (utility) (2) Database D consists of events T 1, T 2,… T m, that is D = {T 1, T 2,…, T m } Let there be an itemset X that is a subregion of event T k, that is X  T k The support can be defined as | {T k  D | X  T k } | sup(X) = |D| This relation compares number of events containing itemset X to number of all events in database

Support (utility), example Let’s assume D = {(1,2,3), (2,3,4), (1,2,4), (1,2,5), (1,3,5)} The support for itemset (1,2) is | {T k  D | X  T k } | sup((1,2)) = = 3/5 |D| That is: relation of number of events containing itemset (1,2) to number of all events in database

Confidence (certainty) Certainty of a rule can be measured with a threshold for confidence This parameter lets to measure how often an event’s itemset that matches the left side of the implication in the association rule also matches for the right side Rules for events whose itemsets do not match sufficiently often the right side while mathching the left (defined by a threshold value) can be excluded

Confidence (certainty) (2) Database D consists of events T 1, T 2,… T m, that is: D = {T 1, T 2,…, T m } Let there be a rule X a  X b so that itemsets X a and X b are subregions of event T k, that is: X a  T k  X b  T k Also let X a  X b =  The confidence can be defined as sup(X a  X b ) conf(X a,X b ) = sup(X a ) This relation compares number of events containing both itemsets X a and X b to number of events containing an itemset X a

Confidence (certainty), example Let’s assume D = {(1,2,3), (2,3,4), (1,2,4), (1,2,5), (1,3,5)} The confidence for rule 1  2 sup(1  2) 3/5 conf((1,2)) = = = 3/4 sup(1) 4/5 That is: relation of number of events containing both itemsets X a and X b to number of events containing an itemset X a

Support and confidence If confidence gets a value of 100 % the rule is an exact rule Even if confidence reaches high values the rule is not useful unless the support value is high as well Rules that have both high confidence and support are called strong rules Some competing alternative approaches (other that Apriori) can generate useful rules even with low support values

Generating association rules Usually consists of two subproblems (Han and Kamber, 2001): 1) Finding frequent itemsets whose occurences exceed a predefined minimum support threshold 2) Deriving association rules from those frequent itemsets (with the constrains of minimum confidence threshold) These two subproblems are soleved iteratively until new rules no more emerge The second subproblem is quite straight- forward and most of the research focus is on the first subproblem

Use of Apriori algorithm Initial information: transactional database D and user-defined numeric minimun support threshold min_sup Algortihm uses knowledge from previous iteration phase to produce frequent itemsets This is reflected in the Latin origin of the name that means ”from what comes before”

Creating frequent sets Let’s define: C k as a candidate itemset of size k L k as a frequent itemset of size k Main steps of iteration are: 1) Find frequent set L k-1 2) Join step: C k is generated by joining L k-1 with itself (cartesian product L k-1 x L k-1 ) 3) Prune step (apriori property): Any (k − 1) size itemset that is not frequent cannot be a subset of a frequent k size itemset, hence should be removed 4) Frequent set L k has been achieved

Creating frequent sets (2) Algorithm uses breadth-first search and a hash tree structure to make candidate itemsets efficiently Then occurance frequency for each candidate itemset is counted Those candidate itemsets that have higher frequency than minimum support threshold are qualified to be frequent itemsets

Apriori algorithm in pseudocode L 1 = {frequent items}; for (k= 2; L k-1 != ∅ ; k++) do begin C k = candidates generated from L k-1 (that is: cartesian product L k-1 x L k-1 and eliminating any k-1 size itemset that is not frequent); for each transaction t in database do increment the count of all candidates in C k that are contained in t L k = candidates in C k with min_sup end return  k L k ; (

Apriori algorithm in pseudocode (2)  Join step and prune step (Wikipedia)