Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization
Data Mining Outline Introduction Classification Clustering Association Rules
Data Mining Outline Introduction Classification Clustering Association Rules
Introduction Data is growing at a phenomenal rate Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING
Data Mining Definition Finding hidden information in a database Fit data to a model: descriptive or predictive Similar terms –Exploratory data analysis –Data driven discovery –Deductive learning
But it isn’t Magic You must know what you are looking for You must know how to look for it Suppose you knew that a specific cave had gold: What would you look for? How would you look for it? Might need an expert miner
“ If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” Description BehaviorAssociations Classification Clustering Link Analysis “ If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.”
Query Examples Database Data Mining – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchase more than $10,000 in last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)
KDD Process Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. © Prentice Hall
Data Mining Outline Introduction Classification – Assign data to a predefined class –Decision Trees –Neural Networks –Distance Based Clustering Association Rules
Insect ID Abdomen Length Antennae Length Insect Class Grasshopper Katydid Grasshopper Grasshopper Katydid Grasshopper Katydid Grasshopper Katydid Katydid ??????? The classification problem can now be expressed as: Given a training database predict the class label of a previously unseen instance Given a training database predict the class label of a previously unseen instance previously unseen instance =
Classification Process (1): Model Construction Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)
Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?
Training Dataset This follows an example from Quinlan’s ID3
Output: A Decision Tree for “ buys_computer ” age? overcast student?credit rating? noyes fair excellent <=30 >40 no yes
Neural Network Example Tuple Input Output
Data Mining Outline Introduction Classification Clustering – Place data into groups –Hierarchical –K-Means –Partitional Association Rules
Clustering Examples Segment customer database based on similar buying patterns. Group houses in a town into neighborhoods based on similar features. Identify new plant species Identify similar Web usage patterns
Clustering vs. Classification No prior knowledge –Number of clusters –Meaning of clusters Unsupervised learning
Data Mining Outline Introduction Classification Clustering Association Rules – Find relationships between data –Apriori
Association Rules Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60%
Association Rules Ex (cont’d)
AR & Market Baskets Determine items often purchased together (Marketbasket Data) Determine optimal placement of data on store floor Determine items for sales and/or specials Increase sales of items
Summary Data Mining is a fast growing area with many applications. Data Mining algorithms are usually computationally expensive. Data Mining tools may be difficult to use effectively.