DATA MINING E0 261 Jayant Haritsa Computer Science and Automation

Slides:



Advertisements
Similar presentations
Association Rules Evgueni Smirnov.
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Data Mining Techniques Association Rule
COMP5318 Knowledge Discovery and Data Mining
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Spring 2003Data Mining by H. Liu, ASU1 5. Association Rules Market Basket Analysis and Itemsets APRIORI Efficient Association Rules Multilevel Association.
Association Analysis: Basic Concepts and Algorithms.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Association Rule Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Mining Association Rules
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
DATA MINING LECTURE 3 Frequent Itemsets Association Rules.
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Fast Algorithms For Mining Association Rules By Rakesh Agrawal and R. Srikant Presented By: Chirayu Modi.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
CS Data Mining1 Data Mining The Extraction of useful information from data The automated extraction of hidden predictive information from (large)
Elsayed Hemayed Data Mining Course
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Data Mining Find information from data data ? information.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
A Research Oriented Study Report By :- Akash Saxena
Data Mining: Concepts and Techniques
Association rule mining
Association Rules Repoussis Panagiotis.
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
CPS216: Advanced Database Systems Data Mining
Market Basket Analysis and Association Rules
Market Basket Many-to-many relationship between different objects
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sangeeta Devadiga CS 157B, Spring 2007
Market Baskets Frequent Itemsets A-Priori Algorithm
Data Analysis.
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Gyozo Gidofalvi Uppsala Database Laboratory
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
I don’t need a title slide for a lecture
Association Rule Mining
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
Association Analysis: Basic Concepts and Algorithms
Approximate Frequency Counts over Data Streams
Market Basket Analysis and Association Rules
©Jiawei Han and Micheline Kamber
Fast Algorithms for Mining Association Rules
Association Analysis: Basic Concepts
Presentation transcript:

DATA MINING E0 261 Jayant Haritsa Computer Science and Automation Indian Institute of Science

A Typical Web-Service Form (e.g. Amazon.com)

Why collect this information? Serves as the input to Data Mining Tools Data Mining: Hot area of computer science research linking Database Management Systems (DBMS) with AI (machine learning) and Statistics Automated and efficient extraction of “interesting” statistical patterns (or models) from enormous disk-resident archival databases Petabyte (1015 bytes) databases are a reality today! “like prospectors searching for gold in a mine, we are trying to discover nuggets of crucial information from mountains of raw data”

Interesting Patterns Associations: Capture object attribute relationships E.g. If student “state = AP”, high likelihood of having taken “Ramaiah coaching classes” Clustering: Group similar objects together E.g. Google groups web-pages with similar answers “In order to show you the most relevant results, we have omitted some entries very similar to the 97 already displayed.’’ Classification: Assign objects to categories E.g. Vehicle insurance categories (low-risk, medium-risk and high-risk) are based on owner’s age, vehicle color Deviations: Detect “abnormal” behavior E.g. Doping in Olympics ! Match-fixing !

Data Mining: Applications Business Strategy (marketing, advertising,…) Fraud Detection (credit card, telephone) Gene and Protein Sequencing Medical Prognosis and Diagnosis Sports ! (NBA stats - IBM Advanced Scout) .... Among top 5 technologies of the decade [Gartner]

Benefits of Data Mining Better business strategies “Action movies released in July usually succeed at the box office” Improved customer services [amazon.com]: “People who bought Macbeth were also inclined to read The Count of Monte Cristo”

DATA MINING vs. DBMS (1) QUARRYING QUERYING DATABASE List the names of employees who retired in 1996 People who buy milk and coffee usually buy sugar as well Discovery process Search process QUARRYING QUERYING DATABASE Data Mining is the study of efficient techniques for quarrying

Data Mining vs. DBMS (2) Operational Data Processing Historical Data Processing

Association Rules

Association Rules Co-occurence of events: On supermarket purchases, indicates which items are typically bought together 80 percent of customers purchasing coffee also purchased milk. Coffee  Milk (0.8) To ensure statistical significance, need to also compute the “support’’ – coffee and milk are purchased together by 60 percent of customers. Coffee  Milk (0.8,0.6)

Problem Formulation Given Coffee Sugar Milk Bread Y N Y N N Y Y Y Y N ... ... ... ... Find all rules X  Y (c,s) where c > min_confidence s > min_support c = P (Y/X) s = P (X  Y) X and Y are disjoint

Problem Breakup 1) Find all itemsets I such that the support of I is greater than minimum support specified by user example: min_support = 60% coffee(75%), milk(100%), bread (75%) coffee-milk (75%) Called “large’’ or “frequent” itemsets Hard to compute

Problem Breakup (contd) 2) Use the frequent itemsets to generate rules simple to compute for every frequent itemset f, find all subsets of f for every subset s , output rule s  (f – s) if support (f) / support (s) > min_conf

Simple Solution The set of all itemsets is a lattice. Do one scan of the database, incrementing the count for each itemset present in each tuple. Problem: Number of counters is 2m (m = | I |) m can be in thousands (KDD cup data had 150000 !) wasteful since most itemsets will be infrequent ABC AB AC BC A B C

Apriori Algorithm

Procedure Multiple scans over the complete database In scan i, Read database row by row Count occurrences of “candidate” itemsets of length i (counters stored in special data structure) At end of scan, determine the frequent itemsets of length i : Fi Determine “candidate” itemsets (length i+1) for the next scan Return Fi ABC AB AC BC A B C

First Pass Read database row by row Count occurrences of all individual items (called “1-itemsets”) counters stored in a 1-D array At end of scan, determine F1 Determine “candidate” itemsets (length 2) for the next scan F1 * F1

Second Pass Read database row by row Count occurrences of candidate 2-itemsets counters stored in a lower 2-D triangular array At end of scan, determine F2 Determine “candidate” itemsets (length 3) for the next scan AprioriGen fn,I f1,I

AprioriGen To generate candidates Ck Join Fk-1 * Fk-1 Prune insert into Ck Select p.i1, p.i2, ..., p.ik-1, q.ik-1 from Fk-1 p, Fk-1 q where p.i1 = q.i1, ..., p.ik-2 = q.ik-2 and p.ik-1 < q.ik-1 Prune for all itemsets c  Ck do for all (k-1)-subsets s of c do if s  Fk-1 then delete c from Ck

Example F2 = AB, AC, CD, CE, DE Join: ABC, CDE Prune: Remove ABC since BC is not in F2

Monotonicity Property Any subset of a frequent itemset must itself be frequent. basic foundation of all association rule mining algorithms implies build frequent itemsets bottom-up

Third (and following) passes Same as before except that candidate itemsets are stored in a hash-tree. Non-leaf nodes contain a hash table, with each entry in the bucket pointing to another node at next level. Leaf nodes contain a list of itemsets. In Pass k, the maximum height of the tree is equal to k . A B E D C F - ABC, ACD ADF

Hash-Tree Transaction ACDF in Pass 3: ABC, ACD ADF Subsets to be checked are ACD, ACF, ADF, CDF A B C D E F B C D E F - C D E F - - ABC, ACD ADF

Subset Search To check all candidate subsets of transaction T: if at leaf, find which itemsets there are in T if at an internal node that has been reached by hashing on item i, hash on each item after i in turn, and continue recursively for each such successor.

Summary Apriori is considered the “classical” association rule mining algorithm Used in IBM’s IntelliMiner product Number of passes proportional to longest frequent itemset Works for sparse matrices

END DATA MINING E0 261