Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATA MINING Part I IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275,

Similar presentations


Presentation on theme: "DATA MINING Part I IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275,"— Presentation transcript:

1

2 DATA MINING Part I IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275, USA mhd@lyle.smu.edu http://lyle.smu.edu/~mhd/dmiiit.html Some slides extracted from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. Support provided by Fulbright Grant and IIIT Allahabad IIIT Allahabad 1

3 2

4 Data Mining Outline Part I: Introduction (19/1 – 20/1) Part I: Introduction (19/1 – 20/1) Part II: Classification (24/1 – 27/1) Part II: Classification (24/1 – 27/1) Part III: Clustering (31/1 – 3/2) Part III: Clustering (31/1 – 3/2) Part IV: Association Rules (7/2 – 10/2) Part IV: Association Rules (7/2 – 10/2) Part V: Applications (14/2 – 17/2) Part V: Applications (14/2 – 17/2) 3IIIT Allahabad

5 Class Structure Each class is two hours Each class is two hours Tuesday/Wednesday presentation Tuesday/Wednesday presentation Thursday/Friday Lab Thursday/Friday Lab 4IIIT Allahabad

6 Data Mining Part I Introduction Outline Lecture Lecture –Define data mining –Data mining vs. databases –Basic data mining tasks –Data mining development –Data mining issues Lab Lab –Download XLMiner and Weka –Analyze simple dataset 5IIIT Allahabad Goal: Provide an overview of data mining.

7 Introduction Data is growing at a phenomenal rate Data is growing at a phenomenal rate Users expect more sophisticated information Users expect more sophisticated information How? How? 6IIIT Allahabad UNCOVER HIDDEN INFORMATION DATA MINING

8 Data Mining Definition Finding hidden information in a database Finding hidden information in a database Fit data to a model Fit data to a model Similar terms Similar terms –Exploratory data analysis –Data driven discovery –Deductive learning 7IIIT Allahabad

9 Data Mining Algorithm Objective: Fit Data to a Model Objective: Fit Data to a Model –Descriptive –Predictive Preference – Technique to choose the best model Preference – Technique to choose the best model Search – Technique to search the data Search – Technique to search the data –“Query” 8IIIT Allahabad

10 Database Processing vs. Data Mining Processing Query Query –Well defined –SQL Query –Poorly defined –No precise query language 9IIIT Allahabad Data Data – Operational data Output Output – Precise – Subset of database Data Data – Not operational data Output Output – Fuzzy – Not a subset of database

11 Query Examples Database Database Data Mining Data Mining 10IIIT Allahabad – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)

12 Basic Data Mining Tasks Classification maps data into predefined groups or classes Classification maps data into predefined groups or classes –Supervised learning –Prediction –Regression Clustering groups similar data together into clusters. Clustering groups similar data together into clusters. –Unsupervised learning –Segmentation –Partitioning 11IIIT Allahabad

13 Basic Data Mining Tasks (cont’d) Link Analysis uncovers relationships among data. Link Analysis uncovers relationships among data. –Affinity Analysis –Association Rules –Sequential Analysis determines sequential patterns. 12IIIT Allahabad

14 CLASSIFICATION Assign data into predefined groups or classes. Assign data into predefined groups or classes. 13IIIT Allahabad

15 But it isn’t Magic You must know what you are looking for You must know what you are looking for You must know how to look for you You must know how to look for you 14IIIT Allahabad Suppose you knew that a specific cave had gold: What would you look for? What would you look for? How would you look for it? How would you look for it? Might need an expert miner Might need an expert miner

16 “If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” “If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” 15IIIT Allahabad Description Behavior Associations Classification Clustering Link Analysis () (Similarity) (Profiling) (Similarity) “If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.”

17 Classification Ex: Grading 16IIIT Allahabad >=90<90 x >=80<80 x >=70<70 x F B A >=60<50 x C D

18 17IIIT Allahabad Grasshoppers Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. (c) Eamonn Keogh, eamonn@cs.ucr.edu

19 18IIIT Allahabad Insect ID AbdomenLengthAntennaeLength Insect Class 12.75.5Grasshopper 28.09.1Katydid 30.94.7Grasshopper 41.13.1Grasshopper 55.48.5Katydid 62.91.9Grasshopper 76.16.6Katydid 80.51.0Grasshopper 98.36.6Katydid 108.14.7Katydid 11 5.1 7.0 ??????? ??????? The classification problem can now be expressed as: Given a training database predict the class label of a previously unseen instance Given a training database predict the class label of a previously unseen instance previously unseen instance = (c) Eamonn Keogh, eamonn@cs.ucr.edu

20 19IIIT Allahabad Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Grasshoppers Katydids Abdomen Length (c) Eamonn Keogh, eamonn@cs.ucr.edu

21 20IIIT Allahabad Facial Recognition (c) Eamonn Keogh, eamonn@cs.ucr.edu

22 21IIIT Allahabad Handwriting Recognition George Washington Manuscript 0 50100150200250300350400450 0 0.5 1 (c) Eamonn Keogh, eamonn@cs.ucr.edu

23 Anomaly Detection 22IIIT Allahabad

24 23IIIT Allahabad

25 CLUSTERING Partition data into previously undefined groups. Partition data into previously undefined groups. 24IIIT Allahabad

26 25IIIT Allahabad http://149.170.199.144/multivar/ca.htm

27 26IIIT Allahabad What is Similarity? (c) Eamonn Keogh, eamonn@cs.ucr.edu

28 Two Types of Clustering 27IIIT Allahabad Hierarchical Partitional (c) Eamonn Keogh, eamonn@cs.ucr.edu

29 Hierarchical Clustering Example Iris Data Set 28IIIT Allahabad Setosa Versicolor Virginica The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188. Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster.http://www.cs.umd.edu/hcil/multi-cluster

30 http://www.time.com/time/magazine/article/0,9171,1541283,00.html http://www.time.com/time/magazine/article/0,9171,1541283,00.html 29IIIT Allahabad

31 Microarray Data Analysis Each probe location associated with gene Each probe location associated with gene Color indicates degree of gene expression Color indicates degree of gene expression Compare different samples (normal/disease) Compare different samples (normal/disease) Track same sample over time Track same sample over time Questions Questions –Which genes are related to this disease? –Which genes behave in a similar manner? –What is the function of a gene? Clustering Clustering –Hierarchical –K-means 30IIIT Allahabad

32 Microarray Data - Clustering 31IIIT Allahabad Gene expression profiling identifies clinically relevant subtypes " Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811- 816, January 20, 2004 Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811- 816, January 20, 2004

33 ASSOCIATION RULES/ LINK ANALYSIS Find relationships between data Find relationships between data 32IIIT Allahabad

34 ASSOCIATION RULES EXAMPLES People who buy diapers also buy beer People who buy diapers also buy beer If gene A is highly expressed in this disease then gene A is also expressed If gene A is highly expressed in this disease then gene A is also expressed Relationships between people Relationships between people Book Stores Book Stores Department Stores Department Stores Advertising Advertising Product Placement Product Placement 33IIIT Allahabad

35 34IIIT Allahabad Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003. DILBERT reprinted by permission of United Feature Syndicate, Inc.

36 35IIIT Allahabad Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

37 No/Little Cheating 36IIIT Allahabad Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

38 Rampant Cheating 37IIIT Allahabad Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

39 38IIIT Allahabad Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005, p. 287.

40 Ex: Stock Market Analysis Example: Stock Market Example: Stock Market Predict future values Predict future values Determine similar patterns over time Determine similar patterns over time Classify behavior Classify behavior 39IIIT Allahabad

41 Ex: Stock Market Analysis 40IIIT Allahabad

42 Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. 41IIIT Allahabad

43 KDD Process Selection: Obtain data from various sources. Selection: Obtain data from various sources. Preprocessing: Cleanse data. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. Interpretation/Evaluation: Present results to user in meaningful manner. 42IIIT Allahabad Modified from [FPSS96C]

44 KDD Process Ex: Web Log Selection: Selection: –Select log data (dates and locations) to use Preprocessing: Preprocessing: – Remove identifying URLs; Remove error logs Transformation: Transformation: –Sessionize (sort and group) Data Mining: Data Mining: –Identify and count patterns; Construct data structure Interpretation/Evaluation: Interpretation/Evaluation: –Identify and display frequently accessed sequences. Potential User Applications: Potential User Applications: –Cache prediction –Personalization 43IIIT Allahabad

45 Related Topics Databases Databases OLTP OLTP OLAP OLAP Information Retrieval Information Retrieval 44IIIT Allahabad

46 45 DB & OLTP Systems Schema Schema –(ID,Name,Address,Salary,JobNo) Data Model Data Model –ER –Relational Transaction Transaction Query: Query: SELECT Name FROM T WHERE Salary > 100000 DM: Only imprecise queries IIIT Allahabad

47 46 Classification/Prediction is Fuzzy Loan Amnt SimpleFuzzy Accept Reject IIIT Allahabad

48 47 Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Information Retrieval (IR): retrieving desired information from textual data. Library Science Library Science Digital Libraries Digital Libraries Web Search Engines Web Search Engines Traditionally keyword based Traditionally keyword based Sample query: Sample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data. Mine text/Web data. IIIT Allahabad

49 48 Information Retrieval (cont’d) Similarity: measure of how close a query is to a document. Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Documents which are “close enough” are retrieved. Metrics: Metrics: –Precision = |Relevant and Retrieved| |Retrieved| |Retrieved| –Recall = |Relevant and Retrieved| |Relevant| |Relevant| IIIT Allahabad

50 49 IR Query Result Measures and Classification IRClassification IIIT Allahabad

51 50 OLAP Online Analytic Processing (OLAP): provides more complex queries than OLTP. Online Analytic Processing (OLAP): provides more complex queries than OLTP. OnLine Transaction Processing (OLTP): traditional database/transaction processing. OnLine Transaction Processing (OLTP): traditional database/transaction processing. Dimensional data; cube view Dimensional data; cube view Visualization of operations: Visualization of operations: –Slice: examine sub-cube. –Dice: rotate cube to look at another dimension. –Roll Up/Drill Down DM: May use OLAP queries. IIIT Allahabad

52 51 DM vs. Related Topics IIIT Allahabad

53 Data Mining Development 52IIIT Allahabad Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Neural Networks Decision Tree Algorithms Algorithm Design Techniques Algorithm Analysis Data Structures Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques

54 KDD Issues Human Interaction Human Interaction Overfitting Overfitting Outliers Outliers Interpretation Interpretation Visualization Visualization Large Datasets Large Datasets High Dimensionality High Dimensionality 53IIIT Allahabad

55 Overfitting Suppose we want to predict whether an individual is short, medium, or tall in height. What is wrong with this data? Suppose we want to predict whether an individual is short, medium, or tall in height. What is wrong with this data? 54IIIT Allahabad NameGenderHeightOutput MaryF1.6Short MaggieF1.9Medium MarthaF1.88Medium StephanieF1.7Short BobM1.85Medium KathyF1.6Short GeorgeM1.7Short DebbieF1.8Medium ToddM1.95Medium KimF1.9Medium AmyF1.8Medium WynetteF1.75Medium

56 KDD Issues (cont’d) Multimedia Data Multimedia Data Missing Data Missing Data Irrelevant Data Irrelevant Data Noisy Data Noisy Data Changing Data Changing Data Integration Integration Application Application 55IIIT Allahabad

57 WARNING With data mining you don’t always know what you are looking for. With data mining you don’t always know what you are looking for. There is not one right answer. There is not one right answer. The data you are using is noisy The data you are using is noisy Data Mining is a very applied discipline. Data Mining is a very applied discipline. A data mining course provides you tools to use to analyze data. A data mining course provides you tools to use to analyze data. Experience provides you knowledge of how to use these tools. Experience provides you knowledge of how to use these tools. 56IIIT Allahabad

58 57IIIT Allahabad http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumbe r=32236

59 58IIIT Allahabad

60 Social Implications of DM Privacy Privacy Profiling Profiling Unauthorized use Unauthorized use Invalid results and claims Invalid results and claims 59IIIT Allahabad

61 Data Mining Metrics Usefulness Usefulness Return on Investment (ROI) Return on Investment (ROI) Accuracy Accuracy … Space/Time Space/Time 60IIIT Allahabad

62 Visualization Techniques Graphical Graphical Geometric Geometric Icon-based Icon-based Pixel-based Pixel-based Hierarchical Hierarchical Hybrid Hybrid 61IIIT Allahabad

63 Models Based on Summarization Visualization: Frequency distribution, mean, variance, median, mode, etc. Visualization: Frequency distribution, mean, variance, median, mode, etc. Box Plot: Box Plot: 62IIIT Allahabad

64 Scatter Diagram 63IIIT Allahabad

65 DM Tools XLMiner – Easy addin to Excel XLMiner – Easy addin to Excel http://www.solver.com/xlminer/index.html Weka – Open Source; Visualization, Functionality, Interface Weka – Open Source; Visualization, Functionality, Interface http://www.cs.waikato.ac.nz/ml/weka/ SAS (JMP) – Commercial Product SAS (JMP) – Commercial Product SPSS – Commercial Product SPSS – Commercial Product MATLAB – Statistical/Math Applications MATLAB – Statistical/Math Applications R – Programming R – Programming 64IIIT Allahabad


Download ppt "DATA MINING Part I IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275,"

Similar presentations


Ads by Google