2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Classification: Alternative Techniques
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
CLUSTERING PROXIMITY MEASURES
Distance and Similarity Measures
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Data Mining Techniques: Clustering
Introduction to Data Mining with XLMiner
Lecture Notes for Chapter 2 Introduction to Data Mining
1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Carla P. Gomes Module: Nearest Neighbor Models (Reading: Chapter.
Chapter 2: Pattern Recognition
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Distance Measures Tan et al. From Chapter 2.
Cluster Analysis (1).
What is Cluster Analysis?
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
CS Instance Based Learning1 Instance Based Learning.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
1 Lazy Learning – Nearest Neighbor Lantz Ch 3 Wk 2, Part 1.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Inductive learning Simplest form: learn a function from examples
2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY.
Chapter 9 Neural Network.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
1 Pattern Recognition Concepts How should objects be represented? Algorithms for recognition/matching * nearest neighbors * decision tree * decision functions.
Classification and Prediction (cont.) Pertemuan 10 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
Chapter 2: Getting to Know Your Data
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining and Decision Support
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
What Is Cluster Analysis?
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Instance Based Learning
Chapter 7 – K-Nearest-Neighbor
Classification and Prediction
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
CSCI N317 Computation for Scientific Applications Unit Weka
Intro to Machine Learning
Data Transformations targeted at minimizing experimental variance
Nearest Neighbors CSC 576: Data Mining.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Group 9 – Data Mining: Data
Junheng, Shengming, Yunsheng 11/09/2018
Data Pre-processing Lecture Notes for Chapter 2
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Lecture 16. Classification (II): Practical Considerations
Practice Project Overview
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School2 Basic Concepts = Key Steps The key steps in a data mining project usually invoke and/or follow these basic concepts: –Data browse, preview, and selection –Data cleaning and preparation –Feature selection –Data normalization and transformation –Similarity/Distance metric selection –... Select the data mining method –... Apply the data mining method –... Gather and analyze data mining results –Accuracy estimation –Avoiding overfitting

2008 NVO Summer School3 Key Concept for Data Mining: Data Previewing Data Previewing allows you to get a sense of the good, bad, and ugly parts of the database This includes: –Histograms of attribute distributions –Scatter plots of attribute combinations –Max-Min value checks (versus expectations) –Summarizations, aggregations (GROUP BY) –SELECT UNIQUE values (versus expectations) –Checking physical units (and scale factors) –External checks (cross-DB comparisons) –Verify with input DB

2008 NVO Summer School4 Key Concept for Data Mining: Data Preparation = Cleaning the Data Data Preparation can take 40-80% (or more) of the effort in a data mining project This includes: –Dealing with NULL (missing) values –Dealing with errors –Dealing with noise –Dealing with outliers (unless that is your science!) –Transformations: units, scale, projections –Data normalization –Relevance analysis: Feature Selection –Remove redundant attributes –Dimensionality Reduction

2008 NVO Summer School5 Key Concept for Data Mining: Feature Selection – the Feature Vector A feature vector is the attribute vector for a database record (tuple). The feature vector’s components are database attributes: v = {w,x,y,z} It contains the set of database attributes that you have chosen to represent (describe) uniquely each data element (tuple). –This is only a subset of all possible attributes in the DB. Example: Sky Survey database object feature vector: –Generic: {RA, Dec, mag, redshift, color, size} –Specific: {ra2000, dec2000, r, z, g-r, R_eff } →

2008 NVO Summer School6 Key Concept for Data Mining: Data Types Different data types: –Continuous: Numeric (e.g., salaries, ages, temperatures, rainfall, sales) –Discrete: Binary (0 or 1; Yes/No; Male/Female) Boolean (True/False) Specific list of allowed values (e.g., zip codes; country names; chemical elements; amino acids; planets) –Categorical: Non-numeric (character/text data) (e.g., people’s names) Can be Ordinal (ordered) or Nominal (not ordered) Reference: Examples of Data Mining Classification Techniques: –Regression for continuous numeric data –Logistic Regression for discrete data –Bayesian Classification for categorical data

2008 NVO Summer School7 Key Concept for Data Mining: Data Normalization & Data Transformation Data Normalization transforms data values for different database attributes into a uniform set of units or into a uniform scale (i.e., to a common min-max range). Data Normalization assigns the correct numerical weighting to the values of different attributes. For example: –Transform all numerical values from min to max on a 0 to 1 scale (or 0 to Weight ; or -1 to 1; or 0 to 100; …). –Convert discrete or character (categorical) data into numeric values. –Transform ordinal data to a ranked list (numeric). –Discretize continuous data into bins.

2008 NVO Summer School8 Key Concept for Data Mining: Similarity and Distance Metrics Similarity between complex data objects is one of the central notions in data mining. The fundamental problem is to determine whether any selected pair of data objects exhibit similar characteristics. The problem is both interesting and difficult because the similarity measures should allow for imprecise matches. Similarity and its inverse – Distance – provide the basis for all of the fundamental data mining clustering techniques and for many data mining classification techniques.

2008 NVO Summer School9 Similarity and Distance Measures Most clustering algorithms depend on a distance or similarity measure, to determine (a) the closeness or “alikeness” of cluster members, and (b) the distance or “unlikeness” of members from different clusters. General requirements for any similarity or distance metric: –Non-negative: dist(A,B) > 0 and sim(A,B) > 0 –Symmetric: dist(A,B)=dist(B,A) and sim(A,B)=sim(B,A) In order to calculate the “distance” between different attribute values, those attributes must be transformed or normalized (either to the same units, or else normalized to a similar scale). The normalization of both categorical (non-numeric) data and numerical data with units generally requires domain expertise. This is part of the pre-processing (data preparation) step in any data mining activity.

2008 NVO Summer School10 Popular Similarity and Distance Measures General L p distance = ||x-y|| p = [sum{|x-y| p }] 1/p Euclidean distance: p=2 –D E = sqrt[(x 1 -y 1 ) 2 + (x 2 -y 2 ) 2 + (x 3 -y 3 ) 2 + … ] Manhattan distance: p=1 (# of city blocks walked) –D M = |x 1 -y 1 | + |x 2 -y 2 | + |x 3 -y 3 | + … Cosine distance = angle between two feature vectors: –d(X,Y) = arccos [ X ٠ Y / ||X||. ||Y|| ] –d(X,Y) = arccos [ (x 1 y 1 +x 2 y 2 +x 3 y 3 ) / ||X||. ||Y|| ] Similarity function: s(x,y) = 1 / [1+d(x,y)] –s varies from 1 to 0, as distance d varies from 0 to. 8

2008 NVO Summer School11 Data Mining Clustering and Nearest Neighbor Algorithms – Issues Clustering algorithms and nearest neighbor algorithms (for classification) require a distance or similarity metric. You must be especially careful with categorical data, which can be a problem. For example: –What is the distance between blue and green? Is it larger than the distance from green to red? –How do you “metrify” different attributes (color, shape, text labels)? This is essential in order to calculate distance in multi- dimensions. Is the distance from blue to green larger or smaller than the distance from round to square? Which of these are most similar?

2008 NVO Summer School12 Typical Error Matrix: Class-A Class-B Totals Class-A Class-B Totals TRAINING DATA (actual classes) NEURAL NETWORK CLASSIFICATION (output) (FN) (TN) (FP) 2834 (TP) True PositiveFalse Positive False Negative True Negative Key Concept for Data Mining: Classification Accuracy

2008 NVO Summer School13 Typical Measures of Accuracy Overall Accuracy = (TP+TN)/(TP+TN+FP+FN) Producer’s Accuracy (Class A) = TP/(TP+FN) Producer’s Accuracy (Class B) = TN/(FP+TN) User’s Accuracy (Class A) = TP/(TP+FP) User’s Accuracy (Class B) = TN/(TN+FN) Accuracy of our Classification on preceding slide: Overall Accuracy = 92.4% Producer’s Accuracy (Class A) = 89.9% Producer’s Accuracy (Class B)= 94.7% User’s Accuracy (Class A) = 94.2% User’s Accuracy (Class B) = 90.7%

2008 NVO Summer School14 Key Concept for Data Mining: Overfitting d(x) g(x) is a poor fit (fitting a straight line through the points) h(x) is a good fit d(x) is a very poor fit (fitting every point) = Overfitting

2008 NVO Summer School15 How to Avoid Overfitting in Data Mining Models In Data Mining, the problem arises because you are training the model on a set of training data (i.e., a subset of the total database). That training data set is simply intended to be representative of the entire database, not a precise exact copy of the database. So, if you try to fit every nuance in the training data, then you will probably over-constrain the problem and produce a bad fit. This is where a TEST DATA SET comes in very handy. You can train the data mining model (Decision Tree or Neural Network) on the TRAINING DATA, and then measure its accuracy with the TEST DATA, prior to unleashing the model (e.g., Classifier) on some real new data. Different ways of subsetting the TRAINING and TEST data sets: (50% of data used to TRAIN, 50% used to TEST) 10 different sets of (90% for TRAINING, 10% for TESTING)

2008 NVO Summer School16 Schematic Approach to Avoiding Overfitting Error Training Epoch Test Set error Training Set error To avoid overfitting, you need to know when to stop training the model. Although the Training Set error may continue to decrease, you may simply be overfitting the Training Data. Test this by applying the model to Test Data (not part of Training Set). If the Test Set error starts to increase, then you know that you are overfitting the Training Set and it is time to stop! STOP Training HERE !