Data Mining CSCI 307, Spring 2019 Lecture 18

Slides:



Advertisements
Similar presentations
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Advertisements

Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Lazy Associative Classification By Adriano Veloso,Wagner Meira Jr., Mohammad J. Zaki Presented by: Fariba Mahdavifard Department of Computing Science University.
Instance-based representation
Constructing Decision Trees. A Decision Tree Example The weather data example. ID codeOutlookTemperatureHumidityWindyPlay abcdefghijklmnabcdefghijklmn.
WEKA Evaluation of WEKA Waikato Environment for Knowledge Analysis Presented By: Manoj Wartikar & Sameer Sagade.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Associations and Frequent Item Analysis. 2 Outline  Transactions  Frequent itemsets  Subset Property  Association rules  Applications.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Association Rule Mining
1Weka Tutorial 5 - Association © 2009 – Mark Polczynski Weka Tutorial 5 – Association Technology Forge Version 0.1 ?
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Associations and Frequent Item Analysis. 2 Outline  Transactions  Frequent itemsets  Subset Property  Association rules  Applications.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.5: Mining Association Rules Rodney Nielsen.
A C B. Will I play tennis today? Features – Outlook: {Sun, Overcast, Rain} – Temperature:{Hot, Mild, Cool} – Humidity:{High, Normal, Low} – Wind:{Strong,
Data Warehouse [ Example ] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN Data Mining: Concepts and.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.3: Association Rules Rodney Nielsen Many / most of these slides were adapted from:
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Data Mining Practical Machine Learning Tools and Techniques
Data Science Algorithms: The Basic Methods
Data Mining Association Analysis: Basic Concepts and Algorithms
Naïve Bayes Classifier
Classification Algorithms
CSE543: Machine Learning Lecture 2: August 6, 2014
Teori Keputusan (Decision Theory)
Prepared by: Mahmoud Rafeek Al-Farra
Data Science Algorithms: The Basic Methods
Decision Trees: Another Example
Naïve Bayes Classifier
Frequent Pattern Mining
William Norris Professor and Head, Department of Computer Science
Decision Tree Saed Sayad 9/21/2018.
Market Basket Many-to-many relationship between different objects
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Machine Learning Techniques for Data Mining
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
DATAWAREHOUSING AND DATAMINING
Privacy Preserving Data Mining
Clustering.
Machine Learning: Lecture 3
Play Tennis ????? Day Outlook Temperature Humidity Wind PlayTennis
MIS2502: Data Analytics Association Rule Mining
Dept. of Computer Science University of Liverpool
Association Rule Mining
MIS2502: Data Analytics Association Rule Learning
Data Mining CSCI 307, Spring 2019 Lecture 15
Association Analysis: Basic Concepts
Data Mining CSCI 307, Spring 2019 Lecture 6
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

Data Mining CSCI 307, Spring 2019 Lecture 18 Association Rules

Mining Association Rules Naive method for finding association rules: Use separate-and-conquer method Treat every possible combination of attribute values as a separate class Two problems: Computational complexity Resulting number of rules (which would have to be pruned on the basis of support/coverage and confidence/accuracy) But: we can look for high support/coverage rules directly!

Item Sets Support/Coverage: number of instances correctly covered by association rule The same as the number of instances covered by all tests in the rule (LHS and RHS!) Item: an attribute-value pair Item set: all items occurring in a rule Goal: only rules that exceed pre-defined support ==> Do it by finding all item sets with the given minimum support and generating rules from them!

Weather Data Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No

Sample of Item Sets for Weather Data Four-Item Sets Outlook = Sunny Temperature = Hot Humidity = High Play = No (2) Outlook = Rainy Temperature = Mild Windy = False Play = Yes (2) Three-Item Sets Humidity = High (2) Windy = False (2) Two-Item Sets Temperature = Hot (2) Humidity = High (3) One-Item Sets Outlook = Sunny (5) Temperature = Cool (4) In total: (with minimum support/coverage of two) 12 one-item sets (3 values for outlook, 3 for temperature, 2 for humidity, 2 for windy, and 2 for play) 47 two-item sets 39 three-item sets 6 four-item sets 0 five-item sets

Sample Item Sets: Weather Data Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Two-item set: Humidity = Normal and Play = Yes (Support is 6!)

Generating Potential Rules from an Item Set Once all item sets with minimum support have been generated, turn them into rules: Example (a 3-item set): Seven (2N-1) potential rules: humidity = normal, windy = false, play = yes if humidity = normal and windy = false then play = yes if humidity = normal and play = yes then windy = false if windy = false and play = yes then humidity = normal if humidity = normal then windy = false and play = yes if windy = false then humidity = normal and play = yes if play = yes then humidity = normal and windy = false if true then humidity = normal and windy = false and play = yes

Generating Potential Rules from an Item Set continued Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Generating Potential Rules from an Item Set continued if humidity = normal and windy = false then play = yes if humidity = normal and play = yes then windy = false if windy = false and play = yes then humidity = normal if humidity = normal then windy = false and play = yes if windy = false then humidity = normal and play = yes if play = yes then humidity = normal and windy = false if true then humidity = normal and windy = false and play = yes

Another Example (a 2-item set): Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Another Example (a 2-item set): temperature = cool humidity = normal if humidity = normal then temperature = cool if temperature = cool then humidity = normal if true then temperature = cool and humidity = normal

Multiple Rules from the Same Set One of the six 4-item sets: Three Resulting rules (with 100% confidence): due to the following "frequent" (i.e. the coverage is 2) subset item sets: temperature = cool, humidity = normal, windy = false, play = yes temperature = cool, windy = false ==> humidity = normal, play = yes temperature = cool, windy = false, humidity = normal ==> play = yes temperature = cool, windy = false, play = yes ==> humidity = normal temperature = cool, windy = false temperature = cool, humidity = normal, windy = false temperature = cool, windy = false, play = yes

Rules for Weather Data Rules with support > 1 and confidence = 100% Association rule Sup. Conf. 1 Humidity=Normal Windy=False ⇒ Play=Yes 4 100% 2 Temperature=Cool ⇒ Humidity=Normal 3 Outlook=Overcast Temperature=Cold Play=Yes ..... ... 58 Outlook=Sunny Temperature=Hot ⇒ Humidity=High In total: 3 rules with support four 5 with support three 50 with support two

Generating Item Sets Efficiently How can we efficiently find all frequent item sets? Finding one-item sets is easy Idea: use the one-item sets to generate two-item sets, use the two-item sets to generate three-item sets, … If (A B) is frequent item set, then (A) and (B) have to be frequent item sets as well! In general: if X is frequent k-item set, then all (k-1)-item subsets of X are also frequent ==> Compute k-item set by merging (k-1)-item sets

Example Given: four two-item sets (ordered lexicographically) To find the candidates for two-item sets, take all the one-items and combine into pairs. To find candidates for three-item sets, take all the two-item sets and "pair" them, then see if all the two-item subsets exist. Given: four two-item sets (ordered lexicographically) (F G), (F H), (F I), (G H) Candidate three-item sets:

Example Given: five three-item sets (A B C), (A B D), (A C D), (A C E), (B C D) Lexicographically ordered, so only need to consider pairs with same first two members Candidate four-item sets: Final check to determine if the candidate has minimum coverage (and should be kept) by counting instances in dataset!

Generating Rules Efficiently We are looking for all high-confidence rules Support of antecedent obtained from hash table But: brute-force method is (2N-1) Better way: building (c + 1)-consequent rules from c-consequent ones Observation: (c + 1)-consequent rule can only hold if all corresponding c-consequent rules also hold

Example 1-consequent rules: Corresponding 2-consequent rule: if outlook = sunny and windy = false and play = no then humidity = high (2/2) if humidity = high and windy = false and play = no then outlook = sunny (2/2) if windy = false and play = no then outlook = sunny and humidity = high (2/2)

Association Rules: Discussion Above method makes one pass through the data for each different size item set Other possibility: generate (k+2)-item sets just after (k+1)-item sets have been generated Result: more (k+2)-item sets than necessary will be considered but less passes through the data Makes sense if data too large for main memory Practical issue: generating a certain number of rules (e.g. by incrementally reducing minimum support/coverage)

Other Issues Standard ARFF format very inefficient for typical market basket data Attributes represent items in a basket and most items are usually missing Data should be represented in sparse format Instances are also called transactions Confidence/accuracy is not necessarily the best measure Example: milk occurs in almost every supermarket transaction