M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Advanced Techniques March 11, 2009.

Slides:



Advertisements
Similar presentations
Data Mining Techniques Association Rule
Advertisements

Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
Mining Multiple-level Association Rules in Large Databases
LOGO Association Rule Lecturer: Dr. Bo Yuan
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Association Rules Mining Part III. Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
Spring 2003Data Mining by H. Liu, ASU1 5. Association Rules Market Basket Analysis and Itemsets APRIORI Efficient Association Rules Multilevel Association.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Mining Association Analysis: Basic Concepts and Algorithms
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
Fast Algorithms for Association Rule Mining
Mining Association Rules
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
Mining Association Rules
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Data Mining – Intro.
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
3.4 improving the Efficiency of Apriori A hash-based technique can be uesd to reduce the size of the candidate k- itermsets,Ck,for k>1. For example,when.
Information Systems Data Analysis – Association Mining Prof. Les Sztandera.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Instructor : Prof. Marina Gavrilova. Goal Goal of this presentation is to discuss in detail how data mining methods are used in market analysis.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Elsayed Hemayed Data Mining Course
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Introduction to Machine Learning Lecture 13 Introduction to Association Rules Albert Orriols i Puig Artificial.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining Association Analysis: Basic Concepts and Algorithms
A Research Oriented Study Report By :- Akash Saxena
Mining Association Rules
Frequent Pattern Mining
Waikato Environment for Knowledge Analysis
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Dept. of Computer Science University of Liverpool
Dept. of Computer Science University of Liverpool
Market Basket Analysis and Association Rules
©Jiawei Han and Micheline Kamber
Dept. of Computer Science University of Liverpool
Presentation transcript:

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Advanced Techniques March 11, 2009 Slide 1 COMP527: Data Mining

Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction COMP527: Data Mining ARM: Advanced Techniques March 11, 2009 Slide 2 COMP527: Data Mining Input Preprocessing Attribute Selection Association Rule Mining ARM: Apriori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam

Parallelization Constraints Multi-Level Rule Mining Other Issues Today's Topics ARM: Advanced Techniques March 11, 2009 Slide 3 COMP527: Data Mining

Task based distribution vs Data based distribution of processing Data parallelism divides the database into partitions, one for each node. Task parallelism has each node count a different candidate set (eg node 1 counts the 1-itemsets, node 2 counts the 2-itemsets) etc. Main advantages: By using multiple machines, we can avoid database scans as there's more memory to use -- the total size of all candidates is more likely to fit into all of the combined memory across N machines. Parallelization ARM: Advanced Techniques March 11, 2009 Slide 4 COMP527: Data Mining

Database is divided into N partitions. Each partition can have a different number of records, depending on the capabilities of the node. Each node counts the candidates for its partition, then broadcasts the count to all other nodes. As the counts are received, each node adds up the global support counts so that it has them to determine the candidates in the next level. (eg from 2-itemsets to 3-itemsets)‏ The Count Distribution Algorithm: Data Parallelism ARM: Advanced Techniques March 11, 2009 Slide 5 COMP527: Data Mining

VERY Pseudo-code for CDA approach... At each processor p: while potential frequent itemsets: Using partition Dp of Database D, Count supports in Dp Broadcast Counts On receive(Counts): globalCounts += Counts Determine candidates for level k+1 Count Distribution ARM: Advanced Techniques March 11, 2009 Slide 6 COMP527: Data Mining

Candidates as well as the database are distributed amongst the processors. Each processor counts the candidates given to it, using the database subset given to it. Each processor then broadcasts the database partition to other processors to use for the global count, which are broadcast again, so that each processor can find globally frequent itemsets. The candidates for the next set are then shared amongst the available processors for the next level. Yes, that's a lot of broadcasting, which is a lot of network traffic, which is a lot of SLOW! (Not going to go through the algorithm for this)‏ Task Parallelism ARM: Advanced Techniques March 11, 2009 Slide 7 COMP527: Data Mining

Constrained Association Rule Mining involves simply setting more rules initially as to what is an interesting rule. For example:  Statistics: Support, Confidence, Lift, Correlation  Data: Specify task relevant data to include in transactions  Dimensions: Dimensions of heirarchical data to be used (next time)‏  Meta-Rules: Form of the useful rules to be found Constraints ARM: Advanced Techniques March 11, 2009 Slide 8 COMP527: Data Mining

Examples: Rule templates Max/Min number of predicates in antecedent/consequent Types of relationship among attributes or attribute values Eg: Interested only in pairs of attributes for a customer that buys a certain type of item: P(x,a) AND Q(x,b) => R(x,c)‏ eg: age(x, ) AND income(x, 20k..30k) => buys(x, computer)‏ Meta-Rules ARM: Advanced Techniques March 11, 2009 Slide 9 COMP527: Data Mining

Or, the Rare Item problem. If you have a very rare item, then its support may not be much higher than the minimum support given for an interesting rule. For example, 48” plasma TVs are sold very infrequently. But these rules could be interesting, especially if they meant it was more likely for someone to buy a big TV. Solution: Multiple Minimum Support thresholds. Simply give rare items a lower threshold to the rest of the dataset. Which could be extended out to one threshold per item... Item Level Thresholds ARM: Advanced Techniques March 11, 2009 Slide 10 COMP527: Data Mining

Minimum Item Support A-Priori. The minimum support required for an itemset is the minimum support for any item in the itemset. This breaks our lovely A-Priori downward closure principle :( eg minimum supports: {A 20%, B 3%, C 4%} actual supports: {A 18%, B 4%, C 3%} A is infrequent, but AB is frequent because the threshold of AB is 3% and both A and B meet that threshold. Solution: Sort items by ascending MIS value, then candidate generation only looks at items which are after the current one in this list. MISAPriori ARM: Advanced Techniques March 11, 2009 Slide 11 COMP527: Data Mining

Our examples have been supermarket baskets. But you don't buy 'bread' you buy a certain brand of bread, with a certain flavour and thickness. eg White Warburton's Toast bread. 2 litre bottle of Tesco's Semi-skimmed milk, not 'milk' We could compact all of the 'milks' and 'breads' together before data mining, but what if buying 'white bread' and 'semi-skimmed milk' together is an interesting rule? As compared to 'skim milk' and 'whole grain bread'. Or Tesco's milk and Tesco's bread? Or... We need a heirarchy of products to capture these different levels. Multi-level Rule Mining ARM: Advanced Techniques March 11, 2009 Slide 12 COMP527: Data Mining

We could have a large tree of how the products inter-relate: Multi-level Rule Mining ARM: Advanced Techniques March 11, 2009 Slide 13 COMP527: Data Mining All Products BreadMilk White White/Toast BrownWholeSemi-skim Whole/2Litre Brown/Tesco

We can count support for the items at the bottom level and propogate them upwards. Or count each level for frequency as a top-down approach. Note that what we really need is some sort of clever cluster system with different axes: bread has color, size, brand, thickness... Milk on the other hand has size, brand, skimmed-ness... Beer has a totally different set of properties. But maybe those axes have the same value... Tesco has a milk range and a bread range... but not a beer range... Let's leave that alone :)‏ Multi-level Rule Mining ARM: Advanced Techniques March 11, 2009 Slide 14 COMP527: Data Mining

To avoid the rare item problem, each level in the tree could have a reduced minimum support threshold. Eg level 1 could be 8%, level 2 (more specific) needs a lower threshold of 5%, then 3%, 2%, etc. (And in our graph, it would be path distance, rather than tree level)‏ We need some search strategies to crawl the tree in comparison to the transaction database. Multi-level Rule Mining ARM: Advanced Techniques March 11, 2009 Slide 15 COMP527: Data Mining

Level-by-level Independent: full breadth search. May examine a lot of infrequent items! Cross Filtering by Itemset: A k-itemset at (i)th level is examined only if the corresponding k-itemset at (i-1)th level is frequent Might filter out valuable patterns (eg the 20%, 3% issue)‏ Cross Filtering by Item: An item at (i)th level will only be examined if its parent node at (i-1)th level is frequent One compromise between the previous two. Multi-level Rule Mining ARM: Advanced Techniques March 11, 2009 Slide 16 COMP527: Data Mining

Controlled Cross Filtering by Single Item: Two thresholds at each level. One for frequency at that level, and one called a level passage threshold. This controls which items can pass down to the next level. If the item doesn't makes the threshold, it doesn't pass down. This threshold is typically between the two levels' support thresholds. None of these address cross-level association rules. Eg rules that link buying items at one level with items at a different level. Multi-level Rule Mining ARM: Advanced Techniques March 11, 2009 Slide 17 COMP527: Data Mining

Many similar rules can be generated between different levels. eg: white bread -> skim milk is similar to bread -> milk, and white toast bread -> 2l skim milk and... If we allow cross levels, these become astronomical. If we allow cross levels, we can have totally redundant rules: milk -> bread skim milk -> bread tesco milk -> bread Multi-level Rule Mining ARM: Advanced Techniques March 11, 2009 Slide 18 COMP527: Data Mining

We could mine other dimensions than 'buys', assuming that we have some knowledge about the buyer. For example: age(20..29) & buys(milk) => buys(bread)‏ occupation(student) & buys(laptop) => buys(blank dvds)‏ This isn't necessarily any more difficult, it just involves putting these items into the transaction to be mined. Can be usefully combined with meta-rules or constraints. Multi-dimensional Rule Mining ARM: Advanced Techniques March 11, 2009 Slide 19 COMP527: Data Mining

We have the same 'range' problem we have with numeric data, but in spades. We don't want to classify by it, we want to find arbitrary rules using arbitrary ranges. For example, we might want age() somehow linked to buying.. but we don't know how to discretize it. Equally we might want some sort of distance based association rule, where the distance between data points is important. Either physical (item A is spatially close to B), or similarity (item A is similar to item B)‏ Discretization ARM: Advanced Techniques March 11, 2009 Slide 20 COMP527: Data Mining

Not only could we discretize single numeric attributes, we can have a number attached to each item: I might buy 10 cans of cat food, 2 bottles of coke, 3 packets of chicken pieces... We could then look for rules that use this quantity (orthogonally to all of the other dimensions we've looked at). Eg: buys(cat food, 5+) -> buys(cat litter, 1)‏ buys(soda, 2) -> buys(potato chips, 2+)‏ (I feel sympathy for your encroaching headaches!)‏ Quantity ARM: Advanced Techniques March 11, 2009 Slide 21 COMP527: Data Mining

(But not that much sympathy!)‏ You could use association rule mining techniques to find episodic rules. For example that I buy cheese every 3 weeks, milk and bread every week, and dvds apparently randomly. The metric could be number of transactions rather than calendar days/weeks. If the items were a sequence of events, then the order is important in the transaction and that could be mined for rules. Trend rules examine the same attribute over time, eg trends in the stock market. Which could be applied to many attributes concurrently. Time ARM: Advanced Techniques March 11, 2009 Slide 22 COMP527: Data Mining

A final note to say that once association rules have been discovered, they can be used to form a classifier. For example by adding a constraint that the consequent must be one of the attributes that are specified as a class. Classification ARM ARM: Advanced Techniques March 11, 2009 Slide 23 COMP527: Data Mining

The rest of Zhang! Berry and Browne, Chapters 15, 16 Han 5.3, 5.5 Dunham 6.4, 6.7 Further Reading ARM: Advanced Techniques March 11, 2009 Slide 24 COMP527: Data Mining