DATA MINING Part I IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275, USA Some slides extracted from Data Mining, Introductory and Advanced Topics, Prentice Hall, Support provided by Fulbright Grant and IIIT Allahabad IIIT Allahabad 1
2
Data Mining Outline Part I: Introduction (19/1 – 20/1) Part I: Introduction (19/1 – 20/1) Part II: Classification (24/1 – 27/1) Part II: Classification (24/1 – 27/1) Part III: Clustering (31/1 – 3/2) Part III: Clustering (31/1 – 3/2) Part IV: Association Rules (7/2 – 10/2) Part IV: Association Rules (7/2 – 10/2) Part V: Applications (14/2 – 17/2) Part V: Applications (14/2 – 17/2) 3IIIT Allahabad
Class Structure Each class is two hours Each class is two hours Tuesday/Wednesday presentation Tuesday/Wednesday presentation Thursday/Friday Lab Thursday/Friday Lab 4IIIT Allahabad
Data Mining Part I Introduction Outline Lecture Lecture –Define data mining –Data mining vs. databases –Basic data mining tasks –Data mining development –Data mining issues Lab Lab –Download XLMiner and Weka –Analyze simple dataset 5IIIT Allahabad Goal: Provide an overview of data mining.
Introduction Data is growing at a phenomenal rate Data is growing at a phenomenal rate Users expect more sophisticated information Users expect more sophisticated information How? How? 6IIIT Allahabad UNCOVER HIDDEN INFORMATION DATA MINING
Data Mining Definition Finding hidden information in a database Finding hidden information in a database Fit data to a model Fit data to a model Similar terms Similar terms –Exploratory data analysis –Data driven discovery –Deductive learning 7IIIT Allahabad
Data Mining Algorithm Objective: Fit Data to a Model Objective: Fit Data to a Model –Descriptive –Predictive Preference – Technique to choose the best model Preference – Technique to choose the best model Search – Technique to search the data Search – Technique to search the data –“Query” 8IIIT Allahabad
Database Processing vs. Data Mining Processing Query Query –Well defined –SQL Query –Poorly defined –No precise query language 9IIIT Allahabad Data Data – Operational data Output Output – Precise – Subset of database Data Data – Not operational data Output Output – Fuzzy – Not a subset of database
Query Examples Database Database Data Mining Data Mining 10IIIT Allahabad – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)
Basic Data Mining Tasks Classification maps data into predefined groups or classes Classification maps data into predefined groups or classes –Supervised learning –Prediction –Regression Clustering groups similar data together into clusters. Clustering groups similar data together into clusters. –Unsupervised learning –Segmentation –Partitioning 11IIIT Allahabad
Basic Data Mining Tasks (cont’d) Link Analysis uncovers relationships among data. Link Analysis uncovers relationships among data. –Affinity Analysis –Association Rules –Sequential Analysis determines sequential patterns. 12IIIT Allahabad
CLASSIFICATION Assign data into predefined groups or classes. Assign data into predefined groups or classes. 13IIIT Allahabad
But it isn’t Magic You must know what you are looking for You must know what you are looking for You must know how to look for you You must know how to look for you 14IIIT Allahabad Suppose you knew that a specific cave had gold: What would you look for? What would you look for? How would you look for it? How would you look for it? Might need an expert miner Might need an expert miner
“If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” “If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” 15IIIT Allahabad Description Behavior Associations Classification Clustering Link Analysis () (Similarity) (Profiling) (Similarity) “If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.”
Classification Ex: Grading 16IIIT Allahabad >=90<90 x >=80<80 x >=70<70 x F B A >=60<50 x C D
17IIIT Allahabad Grasshoppers Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. (c) Eamonn Keogh,
18IIIT Allahabad Insect ID AbdomenLengthAntennaeLength Insect Class Grasshopper Katydid Grasshopper Grasshopper Katydid Grasshopper Katydid Grasshopper Katydid Katydid ??????? ??????? The classification problem can now be expressed as: Given a training database predict the class label of a previously unseen instance Given a training database predict the class label of a previously unseen instance previously unseen instance = (c) Eamonn Keogh,
19IIIT Allahabad Antenna Length Grasshoppers Katydids Abdomen Length (c) Eamonn Keogh,
20IIIT Allahabad Facial Recognition (c) Eamonn Keogh,
21IIIT Allahabad Handwriting Recognition George Washington Manuscript (c) Eamonn Keogh,
Anomaly Detection 22IIIT Allahabad
23IIIT Allahabad
CLUSTERING Partition data into previously undefined groups. Partition data into previously undefined groups. 24IIIT Allahabad
25IIIT Allahabad
26IIIT Allahabad What is Similarity? (c) Eamonn Keogh,
Two Types of Clustering 27IIIT Allahabad Hierarchical Partitional (c) Eamonn Keogh,
Hierarchical Clustering Example Iris Data Set 28IIIT Allahabad Setosa Versicolor Virginica The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland,
29IIIT Allahabad
Microarray Data Analysis Each probe location associated with gene Each probe location associated with gene Color indicates degree of gene expression Color indicates degree of gene expression Compare different samples (normal/disease) Compare different samples (normal/disease) Track same sample over time Track same sample over time Questions Questions –Which genes are related to this disease? –Which genes behave in a similar manner? –What is the function of a gene? Clustering Clustering –Hierarchical –K-means 30IIIT Allahabad
Microarray Data - Clustering 31IIIT Allahabad Gene expression profiling identifies clinically relevant subtypes " Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, , January 20, 2004 Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, , January 20, 2004
ASSOCIATION RULES/ LINK ANALYSIS Find relationships between data Find relationships between data 32IIIT Allahabad
ASSOCIATION RULES EXAMPLES People who buy diapers also buy beer People who buy diapers also buy beer If gene A is highly expressed in this disease then gene A is also expressed If gene A is highly expressed in this disease then gene A is also expressed Relationships between people Relationships between people Book Stores Book Stores Department Stores Department Stores Advertising Advertising Product Placement Product Placement 33IIIT Allahabad
34IIIT Allahabad Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, DILBERT reprinted by permission of United Feature Syndicate, Inc.
35IIIT Allahabad Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
No/Little Cheating 36IIIT Allahabad Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
Rampant Cheating 37IIIT Allahabad Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
38IIIT Allahabad Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005, p. 287.
Ex: Stock Market Analysis Example: Stock Market Example: Stock Market Predict future values Predict future values Determine similar patterns over time Determine similar patterns over time Classify behavior Classify behavior 39IIIT Allahabad
Ex: Stock Market Analysis 40IIIT Allahabad
Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. 41IIIT Allahabad
KDD Process Selection: Obtain data from various sources. Selection: Obtain data from various sources. Preprocessing: Cleanse data. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. Interpretation/Evaluation: Present results to user in meaningful manner. 42IIIT Allahabad Modified from [FPSS96C]
KDD Process Ex: Web Log Selection: Selection: –Select log data (dates and locations) to use Preprocessing: Preprocessing: – Remove identifying URLs; Remove error logs Transformation: Transformation: –Sessionize (sort and group) Data Mining: Data Mining: –Identify and count patterns; Construct data structure Interpretation/Evaluation: Interpretation/Evaluation: –Identify and display frequently accessed sequences. Potential User Applications: Potential User Applications: –Cache prediction –Personalization 43IIIT Allahabad
Related Topics Databases Databases OLTP OLTP OLAP OLAP Information Retrieval Information Retrieval 44IIIT Allahabad
45 DB & OLTP Systems Schema Schema –(ID,Name,Address,Salary,JobNo) Data Model Data Model –ER –Relational Transaction Transaction Query: Query: SELECT Name FROM T WHERE Salary > DM: Only imprecise queries IIIT Allahabad
46 Classification/Prediction is Fuzzy Loan Amnt SimpleFuzzy Accept Reject IIIT Allahabad
47 Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Information Retrieval (IR): retrieving desired information from textual data. Library Science Library Science Digital Libraries Digital Libraries Web Search Engines Web Search Engines Traditionally keyword based Traditionally keyword based Sample query: Sample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data. Mine text/Web data. IIIT Allahabad
48 Information Retrieval (cont’d) Similarity: measure of how close a query is to a document. Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Documents which are “close enough” are retrieved. Metrics: Metrics: –Precision = |Relevant and Retrieved| |Retrieved| |Retrieved| –Recall = |Relevant and Retrieved| |Relevant| |Relevant| IIIT Allahabad
49 IR Query Result Measures and Classification IRClassification IIIT Allahabad
50 OLAP Online Analytic Processing (OLAP): provides more complex queries than OLTP. Online Analytic Processing (OLAP): provides more complex queries than OLTP. OnLine Transaction Processing (OLTP): traditional database/transaction processing. OnLine Transaction Processing (OLTP): traditional database/transaction processing. Dimensional data; cube view Dimensional data; cube view Visualization of operations: Visualization of operations: –Slice: examine sub-cube. –Dice: rotate cube to look at another dimension. –Roll Up/Drill Down DM: May use OLAP queries. IIIT Allahabad
51 DM vs. Related Topics IIIT Allahabad
Data Mining Development 52IIIT Allahabad Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Neural Networks Decision Tree Algorithms Algorithm Design Techniques Algorithm Analysis Data Structures Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques
KDD Issues Human Interaction Human Interaction Overfitting Overfitting Outliers Outliers Interpretation Interpretation Visualization Visualization Large Datasets Large Datasets High Dimensionality High Dimensionality 53IIIT Allahabad
Overfitting Suppose we want to predict whether an individual is short, medium, or tall in height. What is wrong with this data? Suppose we want to predict whether an individual is short, medium, or tall in height. What is wrong with this data? 54IIIT Allahabad NameGenderHeightOutput MaryF1.6Short MaggieF1.9Medium MarthaF1.88Medium StephanieF1.7Short BobM1.85Medium KathyF1.6Short GeorgeM1.7Short DebbieF1.8Medium ToddM1.95Medium KimF1.9Medium AmyF1.8Medium WynetteF1.75Medium
KDD Issues (cont’d) Multimedia Data Multimedia Data Missing Data Missing Data Irrelevant Data Irrelevant Data Noisy Data Noisy Data Changing Data Changing Data Integration Integration Application Application 55IIIT Allahabad
WARNING With data mining you don’t always know what you are looking for. With data mining you don’t always know what you are looking for. There is not one right answer. There is not one right answer. The data you are using is noisy The data you are using is noisy Data Mining is a very applied discipline. Data Mining is a very applied discipline. A data mining course provides you tools to use to analyze data. A data mining course provides you tools to use to analyze data. Experience provides you knowledge of how to use these tools. Experience provides you knowledge of how to use these tools. 56IIIT Allahabad
57IIIT Allahabad r=32236
58IIIT Allahabad
Social Implications of DM Privacy Privacy Profiling Profiling Unauthorized use Unauthorized use Invalid results and claims Invalid results and claims 59IIIT Allahabad
Data Mining Metrics Usefulness Usefulness Return on Investment (ROI) Return on Investment (ROI) Accuracy Accuracy … Space/Time Space/Time 60IIIT Allahabad
Visualization Techniques Graphical Graphical Geometric Geometric Icon-based Icon-based Pixel-based Pixel-based Hierarchical Hierarchical Hybrid Hybrid 61IIIT Allahabad
Models Based on Summarization Visualization: Frequency distribution, mean, variance, median, mode, etc. Visualization: Frequency distribution, mean, variance, median, mode, etc. Box Plot: Box Plot: 62IIIT Allahabad
Scatter Diagram 63IIIT Allahabad
DM Tools XLMiner – Easy addin to Excel XLMiner – Easy addin to Excel Weka – Open Source; Visualization, Functionality, Interface Weka – Open Source; Visualization, Functionality, Interface SAS (JMP) – Commercial Product SAS (JMP) – Commercial Product SPSS – Commercial Product SPSS – Commercial Product MATLAB – Statistical/Math Applications MATLAB – Statistical/Math Applications R – Programming R – Programming 64IIIT Allahabad