DATA MINING Part I IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275,

Slides:



Advertisements
Similar presentations
Supporting End-User Access
Advertisements

DATA MINING Introductory
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
2/25/13 - Union University 1 ADVENTURES IN DATA MINING Margaret H. Dunham Southern Methodist University Dallas, Texas This material.
DATA MINING Introductory and Advanced Topics Part I
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
1 DATA MINING. 2 Introduction Outline Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues.
Data Mining By Archana Ketkar.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining – Intro.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
CIS 674 Introduction to Data Mining
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Chapter 5: Data Mining for Business Intelligence
Data Mining Techniques
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
11/05/05SMU Homecoming1 DATA MINING AND TERRORISM Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Understanding Data Analytics and Data Mining Introduction.
Southern Methodist University
1 DATA MINING Source : Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook.
Chapter 1 Introduction to Data Mining
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Fox MIS Spring 2011 Data Mining Week 9 Introduction to Data Mining.
Vision + Focus + Execution Meiliu Lu, RVR 5016, For CSc 209 Spring 2003, 5/6/03.
© Prentice Hall1 CIS 674 Introduction to Data Mining Srinivasan Parthasarathy Office Hours: TTH 4:30-5:25PM DL693.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
1 Improving quality of graduate students by data mining Asst. Prof. Kitsana Waiyamai, Ph.D. Dept. of Computer Engineering Faculty of Engineering, Kasetsart.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part I Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
CSE 5331/7331 F'071 CSE 5331/7331 Fall 2007 Image Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University.
CSE 8331 Spring CSE 8331 Spring 2010 Image Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr. M.H.Dunham, Data Mining,
MIS2502: Data Analytics Advanced Analytics - Introduction.
CSE 5331/7331 F'071 CSE 5331/7331 Fall 2007 Dimensional Modeling Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University.
Data Mining and Decision Support
Academic Year 2014 Spring Academic Year 2014 Spring.
1 DATA MINING Introductory and Advanced Topics Part I References from Dunham.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
DATA MINING LECTURE 1 INTRODUCTION TO DATA MINING.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Data Mining – Intro.
Data Mining ICCM
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
DATA MINING CSE 8331 Spring 2002 Part I
DATA MINING Introductory and Advanced Topics Part I
Data Mining: Concepts and Techniques Course Outline
Sangeeta Devadiga CS 157B, Spring 2007
DATA MINING Introductory and Advanced Topics Part I
Data Warehousing and Data Mining
Supporting End-User Access
DATA MINING Introductory and Advanced Topics Part I
Data Mining: Introduction
DATA MINING Introductory and Advanced Topics Part I
DATA MINING Source : Margaret H. Dunham
Presentation transcript:

DATA MINING Part I IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275, USA Some slides extracted from Data Mining, Introductory and Advanced Topics, Prentice Hall, Support provided by Fulbright Grant and IIIT Allahabad IIIT Allahabad 1

2

Data Mining Outline Part I: Introduction (19/1 – 20/1) Part I: Introduction (19/1 – 20/1) Part II: Classification (24/1 – 27/1) Part II: Classification (24/1 – 27/1) Part III: Clustering (31/1 – 3/2) Part III: Clustering (31/1 – 3/2) Part IV: Association Rules (7/2 – 10/2) Part IV: Association Rules (7/2 – 10/2) Part V: Applications (14/2 – 17/2) Part V: Applications (14/2 – 17/2) 3IIIT Allahabad

Class Structure Each class is two hours Each class is two hours Tuesday/Wednesday presentation Tuesday/Wednesday presentation Thursday/Friday Lab Thursday/Friday Lab 4IIIT Allahabad

Data Mining Part I Introduction Outline Lecture Lecture –Define data mining –Data mining vs. databases –Basic data mining tasks –Data mining development –Data mining issues Lab Lab –Download XLMiner and Weka –Analyze simple dataset 5IIIT Allahabad Goal: Provide an overview of data mining.

Introduction Data is growing at a phenomenal rate Data is growing at a phenomenal rate Users expect more sophisticated information Users expect more sophisticated information How? How? 6IIIT Allahabad UNCOVER HIDDEN INFORMATION DATA MINING

Data Mining Definition Finding hidden information in a database Finding hidden information in a database Fit data to a model Fit data to a model Similar terms Similar terms –Exploratory data analysis –Data driven discovery –Deductive learning 7IIIT Allahabad

Data Mining Algorithm Objective: Fit Data to a Model Objective: Fit Data to a Model –Descriptive –Predictive Preference – Technique to choose the best model Preference – Technique to choose the best model Search – Technique to search the data Search – Technique to search the data –“Query” 8IIIT Allahabad

Database Processing vs. Data Mining Processing Query Query –Well defined –SQL Query –Poorly defined –No precise query language 9IIIT Allahabad Data Data – Operational data Output Output – Precise – Subset of database Data Data – Not operational data Output Output – Fuzzy – Not a subset of database

Query Examples Database Database Data Mining Data Mining 10IIIT Allahabad – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)

Basic Data Mining Tasks Classification maps data into predefined groups or classes Classification maps data into predefined groups or classes –Supervised learning –Prediction –Regression Clustering groups similar data together into clusters. Clustering groups similar data together into clusters. –Unsupervised learning –Segmentation –Partitioning 11IIIT Allahabad

Basic Data Mining Tasks (cont’d) Link Analysis uncovers relationships among data. Link Analysis uncovers relationships among data. –Affinity Analysis –Association Rules –Sequential Analysis determines sequential patterns. 12IIIT Allahabad

CLASSIFICATION Assign data into predefined groups or classes. Assign data into predefined groups or classes. 13IIIT Allahabad

But it isn’t Magic You must know what you are looking for You must know what you are looking for You must know how to look for you You must know how to look for you 14IIIT Allahabad Suppose you knew that a specific cave had gold: What would you look for? What would you look for? How would you look for it? How would you look for it? Might need an expert miner Might need an expert miner

“If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” “If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” 15IIIT Allahabad Description Behavior Associations Classification Clustering Link Analysis () (Similarity) (Profiling) (Similarity) “If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.”

Classification Ex: Grading 16IIIT Allahabad >=90<90 x >=80<80 x >=70<70 x F B A >=60<50 x C D

17IIIT Allahabad Grasshoppers Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. (c) Eamonn Keogh,

18IIIT Allahabad Insect ID AbdomenLengthAntennaeLength Insect Class Grasshopper Katydid Grasshopper Grasshopper Katydid Grasshopper Katydid Grasshopper Katydid Katydid ??????? ??????? The classification problem can now be expressed as: Given a training database predict the class label of a previously unseen instance Given a training database predict the class label of a previously unseen instance previously unseen instance = (c) Eamonn Keogh,

19IIIT Allahabad Antenna Length Grasshoppers Katydids Abdomen Length (c) Eamonn Keogh,

20IIIT Allahabad Facial Recognition (c) Eamonn Keogh,

21IIIT Allahabad Handwriting Recognition George Washington Manuscript (c) Eamonn Keogh,

Anomaly Detection 22IIIT Allahabad

23IIIT Allahabad

CLUSTERING Partition data into previously undefined groups. Partition data into previously undefined groups. 24IIIT Allahabad

25IIIT Allahabad

26IIIT Allahabad What is Similarity? (c) Eamonn Keogh,

Two Types of Clustering 27IIIT Allahabad Hierarchical Partitional (c) Eamonn Keogh,

Hierarchical Clustering Example Iris Data Set 28IIIT Allahabad Setosa Versicolor Virginica The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland,

29IIIT Allahabad

Microarray Data Analysis Each probe location associated with gene Each probe location associated with gene Color indicates degree of gene expression Color indicates degree of gene expression Compare different samples (normal/disease) Compare different samples (normal/disease) Track same sample over time Track same sample over time Questions Questions –Which genes are related to this disease? –Which genes behave in a similar manner? –What is the function of a gene? Clustering Clustering –Hierarchical –K-means 30IIIT Allahabad

Microarray Data - Clustering 31IIIT Allahabad Gene expression profiling identifies clinically relevant subtypes " Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, , January 20, 2004 Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, , January 20, 2004

ASSOCIATION RULES/ LINK ANALYSIS Find relationships between data Find relationships between data 32IIIT Allahabad

ASSOCIATION RULES EXAMPLES People who buy diapers also buy beer People who buy diapers also buy beer If gene A is highly expressed in this disease then gene A is also expressed If gene A is highly expressed in this disease then gene A is also expressed Relationships between people Relationships between people Book Stores Book Stores Department Stores Department Stores Advertising Advertising Product Placement Product Placement 33IIIT Allahabad

34IIIT Allahabad Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, DILBERT reprinted by permission of United Feature Syndicate, Inc.

35IIIT Allahabad Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

No/Little Cheating 36IIIT Allahabad Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

Rampant Cheating 37IIIT Allahabad Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

38IIIT Allahabad Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005, p. 287.

Ex: Stock Market Analysis Example: Stock Market Example: Stock Market Predict future values Predict future values Determine similar patterns over time Determine similar patterns over time Classify behavior Classify behavior 39IIIT Allahabad

Ex: Stock Market Analysis 40IIIT Allahabad

Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. 41IIIT Allahabad

KDD Process Selection: Obtain data from various sources. Selection: Obtain data from various sources. Preprocessing: Cleanse data. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. Interpretation/Evaluation: Present results to user in meaningful manner. 42IIIT Allahabad Modified from [FPSS96C]

KDD Process Ex: Web Log Selection: Selection: –Select log data (dates and locations) to use Preprocessing: Preprocessing: – Remove identifying URLs; Remove error logs Transformation: Transformation: –Sessionize (sort and group) Data Mining: Data Mining: –Identify and count patterns; Construct data structure Interpretation/Evaluation: Interpretation/Evaluation: –Identify and display frequently accessed sequences. Potential User Applications: Potential User Applications: –Cache prediction –Personalization 43IIIT Allahabad

Related Topics Databases Databases OLTP OLTP OLAP OLAP Information Retrieval Information Retrieval 44IIIT Allahabad

45 DB & OLTP Systems Schema Schema –(ID,Name,Address,Salary,JobNo) Data Model Data Model –ER –Relational Transaction Transaction Query: Query: SELECT Name FROM T WHERE Salary > DM: Only imprecise queries IIIT Allahabad

46 Classification/Prediction is Fuzzy Loan Amnt SimpleFuzzy Accept Reject IIIT Allahabad

47 Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Information Retrieval (IR): retrieving desired information from textual data. Library Science Library Science Digital Libraries Digital Libraries Web Search Engines Web Search Engines Traditionally keyword based Traditionally keyword based Sample query: Sample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data. Mine text/Web data. IIIT Allahabad

48 Information Retrieval (cont’d) Similarity: measure of how close a query is to a document. Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Documents which are “close enough” are retrieved. Metrics: Metrics: –Precision = |Relevant and Retrieved| |Retrieved| |Retrieved| –Recall = |Relevant and Retrieved| |Relevant| |Relevant| IIIT Allahabad

49 IR Query Result Measures and Classification IRClassification IIIT Allahabad

50 OLAP Online Analytic Processing (OLAP): provides more complex queries than OLTP. Online Analytic Processing (OLAP): provides more complex queries than OLTP. OnLine Transaction Processing (OLTP): traditional database/transaction processing. OnLine Transaction Processing (OLTP): traditional database/transaction processing. Dimensional data; cube view Dimensional data; cube view Visualization of operations: Visualization of operations: –Slice: examine sub-cube. –Dice: rotate cube to look at another dimension. –Roll Up/Drill Down DM: May use OLAP queries. IIIT Allahabad

51 DM vs. Related Topics IIIT Allahabad

Data Mining Development 52IIIT Allahabad Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Neural Networks Decision Tree Algorithms Algorithm Design Techniques Algorithm Analysis Data Structures Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques

KDD Issues Human Interaction Human Interaction Overfitting Overfitting Outliers Outliers Interpretation Interpretation Visualization Visualization Large Datasets Large Datasets High Dimensionality High Dimensionality 53IIIT Allahabad

Overfitting Suppose we want to predict whether an individual is short, medium, or tall in height. What is wrong with this data? Suppose we want to predict whether an individual is short, medium, or tall in height. What is wrong with this data? 54IIIT Allahabad NameGenderHeightOutput MaryF1.6Short MaggieF1.9Medium MarthaF1.88Medium StephanieF1.7Short BobM1.85Medium KathyF1.6Short GeorgeM1.7Short DebbieF1.8Medium ToddM1.95Medium KimF1.9Medium AmyF1.8Medium WynetteF1.75Medium

KDD Issues (cont’d) Multimedia Data Multimedia Data Missing Data Missing Data Irrelevant Data Irrelevant Data Noisy Data Noisy Data Changing Data Changing Data Integration Integration Application Application 55IIIT Allahabad

WARNING With data mining you don’t always know what you are looking for. With data mining you don’t always know what you are looking for. There is not one right answer. There is not one right answer. The data you are using is noisy The data you are using is noisy Data Mining is a very applied discipline. Data Mining is a very applied discipline. A data mining course provides you tools to use to analyze data. A data mining course provides you tools to use to analyze data. Experience provides you knowledge of how to use these tools. Experience provides you knowledge of how to use these tools. 56IIIT Allahabad

57IIIT Allahabad r=32236

58IIIT Allahabad

Social Implications of DM Privacy Privacy Profiling Profiling Unauthorized use Unauthorized use Invalid results and claims Invalid results and claims 59IIIT Allahabad

Data Mining Metrics Usefulness Usefulness Return on Investment (ROI) Return on Investment (ROI) Accuracy Accuracy … Space/Time Space/Time 60IIIT Allahabad

Visualization Techniques Graphical Graphical Geometric Geometric Icon-based Icon-based Pixel-based Pixel-based Hierarchical Hierarchical Hybrid Hybrid 61IIIT Allahabad

Models Based on Summarization Visualization: Frequency distribution, mean, variance, median, mode, etc. Visualization: Frequency distribution, mean, variance, median, mode, etc. Box Plot: Box Plot: 62IIIT Allahabad

Scatter Diagram 63IIIT Allahabad

DM Tools XLMiner – Easy addin to Excel XLMiner – Easy addin to Excel Weka – Open Source; Visualization, Functionality, Interface Weka – Open Source; Visualization, Functionality, Interface SAS (JMP) – Commercial Product SAS (JMP) – Commercial Product SPSS – Commercial Product SPSS – Commercial Product MATLAB – Statistical/Math Applications MATLAB – Statistical/Math Applications R – Programming R – Programming 64IIIT Allahabad