Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.

Slides:



Advertisements
Similar presentations
On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.
Advertisements

Data Mining Tools Overview Business Intelligence for Managers.
Naïve-Bayes Classifiers Business Intelligence for Managers.
Machine Learning Instance Based Learning & Case Based Reasoning Exercise Solutions.
PARTITIONAL CLUSTERING
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Indian Statistical Institute Kolkata
Distance and Similarity Measures
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Chapter 7 – K-Nearest-Neighbor
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Carla P. Gomes Module: Nearest Neighbor Models (Reading: Chapter.
Data Mining Classification: Alternative Techniques
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
CES 514 – Data Mining Lec 9 April 14 Mid-term k nearest neighbor.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Cluster Analysis (1).
CS Instance Based Learning1 Instance Based Learning.
Distance and Similarity Measures
Evaluating Performance for Data Mining Techniques
Module 04: Algorithms Topic 07: Instance-Based Learning
1 Lazy Learning – Nearest Neighbor Lantz Ch 3 Wk 2, Part 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
GEOMETRIC VIEW OF DATA David Kauchak CS 451 – Fall 2013.
K Nearest Neighborhood (KNNs)
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
CPS 270: Artificial Intelligence Machine learning Instructor: Vincent Conitzer.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Learning from observations
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 6: Nearest and k-nearest Neighbor Classification.
Outline K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Lazy Learners K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Overview Data Mining - classification and clustering
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Data Mining By: Johan Johansson. Mining Techniques Association Rules Association Rules Decision Trees Decision Trees Clustering Clustering Nearest Neighbor.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Jawad Tahsin Danish Mustafa Zaidi Kazim Zaidi Zulfiqar Hadi.
Distance and Similarity Measures
Data Science Algorithms: The Basic Methods
Efficient Image Classification on Vertically Decomposed Data
Classification Nearest Neighbor
Instance Based Learning (Adapted from various sources)
Efficient Image Classification on Vertically Decomposed Data
K Nearest Neighbor Classification
Data Mining Practical Machine Learning Tools and Techniques
Collaborative Filtering Nearest Neighbor Approach
Classification Nearest Neighbor
A Fast and Scalable Nearest Neighbor Based Classification
Prepared by: Mahmoud Rafeek Al-Farra
Data Mining extracting knowledge from a large amount of data
Instance Based Learning
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Nearest Neighbors CSC 576: Data Mining.
CSE4334/5334 Data Mining Lecture 7: Classification (4)
Statistical Models and Machine Learning Algorithms --Review
Presentation transcript:

Supervised Learning and k Nearest Neighbors Business Intelligence for Managers

Supervised learning and classification Given: dataset of instances with known categories Goal: using the “knowledge” in the dataset, classify a given instance predict the category of the given instance that is rationally consistent with the dataset

Classifiers Classifier… feature values category X1 X2 X3 Xn Y DB collection of instances with known categories

Algorithms K Nearest Neighbors (kNN) Naïve-Bayes Decision trees Many others (support vector machines, neural networks, genetic algorithms, etc)

K - Nearest Neighbors For a given instance T, get the top k dataset instances that are “nearest” to T Select a reasonable distance measure Inspect the category of these k instances, choose the category C that represent the most instances Conclude that T belongs to category C

Example 1 Determining decision on scholarship application based on the following features: Household income (annual income in millions of pesos) Number of siblings in family High school grade (on a QPI scale of 1.0 – 4.0) Intuition (reflected on data set): award scholarships to high-performers and to those with financial need

Distance formula Euclidian distance: squareroot of sum of squares of differences for two features: (  x) 2 + (  y) 2 Intuition: similar samples should be close to each other May not always apply (example: quota and actual sales)

Incomparable ranges The Euclidian distance formula has the implicit assumption that the different dimensions are comparable Features that span wider ranges affect the distance value more than features with limited ranges

Example revisited Suppose household income was instead indicated in thousands of pesos per month and that grades are given on a scale Note different results produced by kNN algorithm on the same dataset

Non-numeric data Feature values are not always numbers Example Boolean values: Yes or no, presence or absence of an attribute Categories: Colors, educational attainment, gender How do these values factor into the computation of distance?

Dealing with non-numeric data Boolean values => convert to 0 or 1 Applies to yes-no/presence-absence attributes Non-binary characterizations Use natural progression when applicable; e.g., educational attainment: GS, HS, College, MS, PHD => 1,2,3,4,5 Assign arbitrary numbers but be careful about distances; e.g., color: red, yellow, blue => 1,2,3 How about unavailable data? (0 value not always the answer)

Preprocessing your dataset Dataset may need to be preprocessed to ensure more reliable data mining results Conversion of non-numeric data to numeric data Calibration of numeric data to reduce effects of disparate ranges Particularly when using the Euclidean distance metric

k-NN variations Value of k Larger k increases confidence in prediction Note that if k is too large, decision may be skewed Weighted evaluation of nearest neighbors Plain majority may unfairly skew decision Revise algorithm so that closer neighbors have greater “vote weight” Other distance measures

City-block distance (Manhattan dist) Add absolute value of differences Cosine similarity Measure angle formed by the two samples (with the origin) Jaccard distance Determine percentage of exact matches between the samples (not including unavailable data) Others

k-NN Time Complexity Suppose there are m instances and n features in the dataset Nearest neighbor algorithm requires computing m distances Each distance computation involves scanning through each feature value Running time complexity is proportional to m X n