Math 5364 Notes Chapter 5: Alternative Classification Techniques Jesse Crawford Department of Mathematics Tarleton State University.

Slides:

Advertisements

Similar presentations

Support Vector Machine

Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

An Introduction of Support Vector Machine

Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Supervised Learning Recap

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.

Chapter 4: Linear Models for Classification

Computer vision: models, learning and inference

Classification and Decision Boundaries

Data Mining Classification: Alternative Techniques

Discriminative and generative methods for bags of features

Artificial Neural Networks

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Lecture 14 – Neural Networks

x – independent variable (input)

Classification and risk prediction

Model Evaluation Metrics for Performance Evaluation

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Classification II (continued) Model Evaluation

Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Classification Techniques: Bayesian Classification

Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:

1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.

Linear Discrimination Reading: Chapter 2 of textbook.

Non-Bayes classifiers. Linear discriminants, neural networks.

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

Linear Models for Classification

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Classification Ensemble Methods 1

ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.

DATA MINING LECTURE 10b Classification k-nearest neighbor classifier

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Evaluating Classifiers

An Empirical Comparison of Supervised Learning Algorithms

Data Mining Classification: Alternative Techniques

Classification of class-imbalanced data

CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis

Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.

Linear Discrimination

Presentation transcript:

Math 5364 Notes Chapter 5: Alternative Classification Techniques Jesse Crawford Department of Mathematics Tarleton State University

Today's Topics The k-Nearest Neighbors Algorithm Methods for Standardizing Data in R The class package, knn, and knn.cv

k-Nearest Neighbors Divide data into training and test data. For each record in the test data Find the k closest training records Find the most frequently occurring class label among them The test record is classified into that category Ties are broken at random Example If k = 1, classify green point as  If k = 3, classify green point as If k = 2, classify green point as  or (chosen randomly)

k-Nearest Neighbors Algorithm Algorithm depends on a distance metric d

Euclidean Distance Metric Example 1 x = (percentile rank, SAT) x 1 = (90, 1300) x 2 = (85, 1200) d(x 1, x 2 ) = Example 2 x 1 = (70, 950) x 2 = (40, 880) d(x 1, x 2 ) = Euclidean distance is sensitive to measurement scales. Need to standardize variables!

Standardizing Variables mean percentile rank = st dev percentile rank = mean SAT = st dev SAT = Example 1 x = (percentile rank, SAT) x 1 = (90, 1300) x 2 = (85, 1200) z 1 = (1.23, 2.43) z 2 = (0.97, 1.68) d(z 1, z 2 ) = 0.80 Example 2 x 1 = (70, 950) x 2 = (40, 880) z 1 = (0.16, -0.21) z 2 = (-1.45, -0.74) d(z 1, z 2 ) = 1.70

Standardizing iris Data x=iris[,1:4] xbar=apply(x,2,mean) xbarMatrix=cbind(rep(1,150))%*%xbar s=apply(x,2,sd) sMatrix=cbind(rep(1,150))%*%s z=(x-xbarMatrix)/sMatrix apply(z,2,mean) apply(z,2,sd) plot(z[,3:4],col=iris$Species)

Another Way to Split Data #Split iris into 70% training and 30% test data. set.seed=5364 train=sample(nrow(z),nrow(z)*.7) z[train,] #This is the training data z[-train,] #This is the test data

The class Package and knn Function library(class) Species=iris$Species predSpecies=knn(train=z[train,],test=z[-train,],cl=Species[train],k=3) confmatrix(Species[-train],predSpecies) Accuracy = 93.33%

Leave-one-out CV with knn predSpecies=knn.cv(train=z,cl=Species,k=3) confmatrix(Species,predSpecies) CV estimate for accuracy is 94.67%

Optimizing k with knn.cv accvect=1:10 for(k in 1:10){ predSpecies=knn.cv(train=z,cl=Species,k=k) accvect[k]=confmatrix(Species,predSpecies)$accuracy } which.max(accvect) For binary classification problems, odd values of k avoid ties.

General Comments about k Smaller values of k result in greater model complexity. If k is too small, model is sensitive to noise. If k is too large, many records will start to be classified simply into the most frequent class.

Today's Topics Weighted k-Nearest Neighbors Algorithm Kernels The kknn package Minkowski Distance Metric

Indicator Functions

max and argmax

k-Nearest Neighbors Algorithm

Kernel Functions

Weighted k-Nearest Neighbors

kknn Package train.kknn uses leave-one-out cross-validation to optimize k and the kernel kknn gives predictions for a specific choice of k and kernel (see R script) R Documentation Hechenbichler, K. and Schliep, K.P. (2004) "Weighted k-Nearest-Neighbor Techniques and Ordinal Classification".

Minkowski Distance Metric Euclidean distance is Minkowski distance with q = 2

Today's Topics Naïve Bayes Classification

HouseVotes84 Data Want to calculate P(Y = Republican | X 1 = no, X 2 = yes, …, X 16 = yes) Possible Method Look at all records where X 1 = no, X 2 = yes, …., X 16 = yes Calculate the proportion of those records with Y = Republican Problem: There are 2 16 = 65,536 combinations of X j 's, but only 435 records Possible solution: Use Bayes' Theorem

Setting for Naïve Bayes p.m.f. for Y Prior distribution for Y Joint conditional distribution of X j 's given Y Conditional distribution of X j given Y Assumption: X j 's are conditionally independent given Y

Bayes' Theorem

Prior Probabilities Conditional Probabilities Posterior Probability How can we estimate prior probabilities?

Prior Probabilities Conditional Probabilities Posterior Probability How can we estimate conditional probabilities?

Prior Probabilities Conditional Probabilities Posterior Probability How can we calculate posterior probabilities?

Naïve Bayes Classification

Naïve Bayes with Quantitative Predictors

Testing Normality qq Plots Straight line: evidence of normality Deviates from straight line: evidence against normality

Naïve Bayes with Quantitative Predictors Option 2: Discretize predictor variables using cut function. (convert variable into a categorical variables by breaking into bins)

Today's Topics The Class Imbalance Problem Sensitivity, Specificity, Precision, and Recall Tuning probability thresholds

Class Imbalance Problem Confusion Matrix Predicted Class +- Actual Class +f ++ f +- -f -+ f -- Class Imbalance: One class is much less frequent than the other Rare class: Presence of an anomaly (fraud, disease, loan default, flight delay, defective product). + Anomaly is present - Anomaly is absent

Confusion Matrix Predicted Class +- Actual Class +f ++ (TP)f +- (FN) -f -+ (FP)f -- (TN) TP = True Positive FP = False Positive TN = True Negative FN = False Negative

Confusion Matrix Predicted Class +- Actual Class +f ++ (TP)f +- (FN) -f -+ (FP)f -- (TN) TP = True Positive FP = False Positive TN = True Negative FN = False Negative

Confusion Matrix Predicted Class +- Actual Class +f ++ (TP)f +- (FN) -f -+ (FP)f -- (TN) TP = True Positive FP = False Positive TN = True Negative FN = False Negative

Confusion Matrix Predicted Class +- Actual Class +f ++ (TP)f +- (FN) -f -+ (FP)f -- (TN) TP = True Positive FP = False Positive TN = True Negative FN = False Negative

F 1 is the harmonic mean of p and r Large values of F 1 ensure reasonably large values of p and r

Probability Threshold

We can modify the probability threshold p 0 to optimize performance metrics

Today's Topics Receiver Operating Curves (ROC) Cost Sensitive Learning Oversampling and Undersampling

Receiver Operating Curves (ROC) Plot of True Positive Rate vs False Positive Rate Plot of Sensitivity vs 1 – Specificity AUC = Area under curve

AUC is a measure of model discrimination How good is the model at discriminating between +'s and –'s

Cost Sensitive Learning Confusion Matrix Predicted Class +- Actual Class +f ++ (TP)f +- (FN) -f -+ (FP)f -- (TN)

Example: Flight Delays Confusion MatrixPredicted Class Delay +Ontime - Actual Class Delay + f ++ (TP)f +- (FN) Ontime - f -+ (FP)f -- (TN)

Undersampling and Oversampling Split training data into cases with Y = + and Y = - Take a random sample with replacement from each group Combine samples together to create new training set Undersampling: decreasing frequency of one of the groups Oversampling: increasing frequency of one of the groups

Today's Topics Support Vector Machines

Hyperplanes

Equation of a Hyperplane

Rank-nullity Theorem

Support Vector Machines Goal: Separate different classes with a hyperplane

Support Vector Machines Goal: Separate different classes with a hyperplane Here, it's possible This is a linearly separable problem

Support Vector Machines Another hyperplane that works

Support Vector Machines Many possible hyperplanes

Support Vector Machines Which one is better?

Support Vector Machines Want the hyperplane with the maximal margin

Support Vector Machines Want the hyperplane with the maximal margin How can we find this hyperplane?

Support Vector Machines

Karush-Kuhn-Tucker Theorem Want to maximize this subject to these constraints

Karush-Kuhn-Tucker Theorem Kuhn, H.W. and Tucker, A.W. (1951). "Nonlinear Programming". Proceedings of 2nd Berkeley Symposium. pp. 481–492. Derivations of SVM's Cortes, C. and Vapnik, V. (1995). "Support Vector Networks". Machine Learning, 20, p. 273—297.

Key Results

Today's Topics Soft Margin Support Vector Machines Nonlinear Support Vector Machines Kernel Methods

Soft Margin SVM Allows points to be on the wrong side of hyperplane Uses slack variables

Soft Margin SVM Want to minimize this

Soft Margin SVM

Relationship Between Soft and Hard Margins

Nonlinear SVM

Can be computationally expensive

Kernel Trick

Kernels

Today's Topics Neural Networks

The Logistic Function

Neural Networks

Probabilities This flower would be classified as setosa

Gradient Descent

Gradient Descent for Multiple Regression Models

Neural Network (Perceptron)

Gradient for Neural Network

Neural network with one hidden layer 30 neurons in hidden layer Classification accuracy = 98.7%

Two-layer Neural Networks Two-layer Neural Network (One hidden layer) A two-layer neural network with sigmoid (logistic) activation functions can model any decision boundary

Multi-layer Perceptron 91% Accuracy

Gradient Descent for Multi-layer Perceptron Error Back Propagation Algorithm At each iteration Feed inputs forward through the neural network using current weights. Use a recursion formula (back propagation) to obtain the gradient with respect to all weights in the neural network. Update the weights using gradient descent.

Today's Topics Ensemble Methods Bagging Random Forests Boosting

Ensemble Methods

Bagging (Bootstrap Aggregating)

Random Forests Uses Bagging Uses Decision Trees Features used to split decision tree are randomized

Boosting Idea: Create classifiers sequentially Later classifiers focus on mistakes of previous classifiers

Boosting Idea: Create classifiers sequentially Later classifiers focus on mistakes of previous classifiers

Boosting Idea: Create classifiers sequentially Later classifiers focus on mistakes of previous classifiers

Today's Topics The Multiclass Problem One-against-one approach One-against-rest approach

The Multiclass Problem Binary dependent variable y: Only two possible values Multiclass dependent variable y: More than two possible values How can we deal with multiclass variables?

Classification Algorithms Decision Trees k-Nearest Neighbors Naïve Bayes Neural Networks Support Vector Machines How can we extend SVM to multiclass problems? How can we extend other algorithms to multiclass problems? Deals with multiclass output by default Only deals with binary classification problems

Classification Algorithms Decision Trees k-Nearest Neighbors Naïve Bayes Neural Networks Support Vector Machines How can we extend SVM to multiclass problems? How can we extend other algorithms to multiclass problems? Deals with multiclass output by default Only deals with binary classification problems

One-against-one Approach

One-against-rest Approach