Data Analytics CMIS Short Course part II Day 1 Part 1: Clustering Sam Buttrey December 2015.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Mining Association Rules from Microarray Gene Expression Data.
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Data Mining Techniques Association Rule
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
Data Mining of Very Large Data
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used.
Data Mining, Frequent-Itemset Mining
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
Data Mining, Frequent-Itemset Mining. Data Mining Some mining problems Find frequent itemsets in "market-basket" data – "50% of the people who buy hot.
Fast Algorithms for Association Rule Mining
Lecture14: Association Rules
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Information Retrieval from Data Bases for Decisions Dr. Gábor SZŰCS, Ph.D. Assistant professor BUTE, Department Information and Knowledge Management.
Basic Data Mining Techniques
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
Data Mining An Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.
Frequent-Itemset Mining. Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small.
Association Rule Mining
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Charles Tappert Seidenberg School of CSIS, Pace University
CURE Clustering Using Representatives Handles outliers well. Hierarchical, partition First a constant number of points c, are chosen from each cluster.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data Mining and Decision Support
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining Association Analysis: Basic Concepts and Algorithms
A Research Oriented Study Report By :- Akash Saxena
Frequent Pattern Mining
Waikato Environment for Knowledge Analysis
CPS216: Advanced Database Systems Data Mining
Market Basket Analysis and Association Rules
Market Basket Many-to-many relationship between different objects
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining II: Association Rule mining & Classification
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
SEG 4630 E-Commerce Data Mining — Final Review —
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Market Basket Analysis and Association Rules
Presentation transcript:

Data Analytics CMIS Short Course part II Day 1 Part 1: Clustering Sam Buttrey December 2015

Association Rules Another unsupervised technique Given categorical data, construct rules A rule usually has the form If variable A has value a and variable B has value, then variable D will have value d Here we have two antecedents (conditions) and one consequent Usually rules have antecedents joined by “and” and exactly one consequent

Rules Rules generally concern categorical variables (sets) Continuous variables must be discretized Binary variables should often be treated asymmetrically No response variable: any variable can be “predictor” or “response” Example: market basket

Market Baskets Items are items, baskets are baskets, look for shoppers’ similarities Items are words, baskets are documents, look for linked concepts Items are sentences, baskets are documents, look for plagiarism Items are genes, baskets are people…

Scale Humans have about 30,000 genes and there are six-seven billion of us Wal-Mart sells about 100,000 items and can store millions of baskets The Web has 100,000,000 words and billions of pages Data off-line, need few passes

Support and Confidence Object: find rules with high support (frequency, coverage) and high confidence (accuracy) Support: proportion of sample meeting the conditions of the antecedents Confidence: proportion of supported sample meeting the consequent

Example I go to work by tunnel, by Rte 68, or through the Presidio of Monterey “If I take the tunnel, I’ll be late for class” Each rule carries a probability “If I take the tunnel, I’ll be late with probability 90%” compared to “Today I will be late with probability 20%” The difference between the 90% and the 20% is one measure of the rule’s usefulness – unless I never take the tunnel (which is measured by “coverage”)

Example RouteRte 68TunnelPresidio Not Late 4%20%24% Late21% 10% Marginal25%41%34% The rule “If Rte 68, then Late” has 25% support. The confidence is the usual estimate of the conditional probability Pr (L | 68) = Pr (L & 68) / Pr (68) =.21/.25 = 84%

Rules vs. Classification Trees In a tree, there is one response variable The “rules” are mutually exclusive and exhaustive (no overlap) Association rules use any variable as a response Many observations are covered by more than one rule (much overlap)

Finding Rules The hard part is finding “frequent itemset patterns,” sets of conditions whose support exceeds a threshold s In one pass we can find frequent items defined by one condition “A and B” is frequent only if A and B are both frequent: Pr (A & B) < Pr (A) In general a set is frequent only if all its subsets are themselves frequent

Finding Frequent Sets On pass two we can examine all pairs of frequent sets to compute their frequency On pass three, examine all triplets –The number of “sets of sets” grows quickly, so s needs to be chosen wisely –Number of conditions can be limited One more pass generates rules –For a set A, enumerate all possible rules like “if A, then B”, evaluate accuracy Lots of rules, lots of overlap

More on Rules Evaluating “If A then B” : 1.Confidence: Pr (B | A) 2.Conf. difference: Pr (B|A) – Pr (B) 3.Lift: Pr (B|A) / Pr (B) –Or Pr(A&B)/Pr(A)Pr(B) in our package 4.Conf. ratio: [Pr (B|A) / Pr (B)] – 1 5.Information Difference: Gain A – Gain 6.Normalized  2 (Cramer’s coefficient) These seem like the two most commonly used

“A Priori” Notes R implementation in library (arules) Tabular vs. transactional data –Transactional data: only the items present in the basket are passed Items that are rarer than the minimum support level can never appear in a rule –Some algorithms let you specify a different minimum support for each item –We need to keep very common items out of rules for rare items –Note: our “support” is Pr(A), but apriori() uses Pr (A and B)

Example Database of tunnels under the U.S. border Goal 1: characterize tunnels by entrance, length, sophistication, etc. Goal 2: identify locations where tunnels are more likely to appear Let’s do this thing!

Course Recap Day 1: Classification & Regression Trees –Virtually automatic supervised models –Lots of strengths, not always too accurate –Ensembles: the strength of weak learners Day 2: Unsupervised models –Principal components: dimensionality reduction –Clustering, measuring inter-object distances –Association rules Ask me questions at 15