Classification Methods

Slides:

Advertisements

Similar presentations

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Advertisements

Logistic Regression Psy 524 Ainsworth.

Pattern Recognition and Machine Learning

Classification and risk prediction

Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Basics of discriminant analysis

MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.

1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.

1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.

Outline Separating Hyperplanes – Separable Case

Classification (Supervised Clustering) Naomi Altman Nov '06.

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.

1 E. Fatemizadeh Statistical Pattern Recognition.

Linear Discriminant Analysis and Its Variations Abu Minhajuddin CSE 8331 Department of Statistical Science Southern Methodist University April 27, 2002.

New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Classification Ensemble Methods 1

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Estimating standard error using bootstrap

Unsupervised Learning

BINARY LOGISTIC REGRESSION

Chapter 7. Classification and Prediction

Classification Methods

Bagging and Random Forests

Week 2 Presentation: Project 3

STT : Intro. to Statistical Learning

Machine Learning Logistic Regression

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

LECTURE 03: DECISION SURFACES

Machine Learning Basics

Learning Hongfei Yan March 1, 2016.

Support Vector Machines (SVM)

Data Mining Lecture 11.

Overview of Supervised Learning

Machine Learning Logistic Regression

9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Introduction to Predictive Modeling

Linear Model Selection and regularization

Pattern Recognition and Machine Learning

Abdur Rahman Department of Statistics

Generally Discriminant Analysis

Statistics II: An Overview of Statistics

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Mathematical Foundations of BME

Logistic Regression Chapter 7.

Multivariate Methods Berlin Chen, 2005 References:

Multiple Regression Berlin Chen

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Unsupervised Learning

STT : Intro. to Statistical Learning

Outlines Introduction & Objectives Methodology & Workflow

Classification Methods

Classification Methods

Presentation transcript:

Classification Methods STT592-002: Intro. to Statistical Learning Classification Methods Chapter 04 (part 02) Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning

STT592-002: Intro. to Statistical Learning Linear Discriminant Analysis (LDA) & Quadratic Discriminant Analysis (QDA) Overview: Classification methods Chap4: Logistic regression; LDA/QDA; KNN Chap7: Generalized Additive Models Chap8: Trees, Random Forest, Boosting Chap9: Support Vector Machines (SVM)

STT592-002: Intro. to Statistical Learning Outline Overview of LDA Why not Logistic Regression? Estimating Bayes’ Classifier LDA Example with One Predictor (p=1) LDA Example with more than One Predictor (P>1) LDA on Default Data Overview of QDA Comparison between LDA and QDA Overview of KNN Classification

Linear Discriminant Analysis STT592-002: Intro. to Statistical Learning Linear Discriminant Analysis LDA undertakes the same task as Logistic Regression. Classify data based on categorical response variables. Eg: Making profit or not Buy a product or not Satisfied customer or not Political party voting intention

Why Linear? Why Discriminant? STT592-002: Intro. to Statistical Learning Why Linear? Why Discriminant? LDA involves the determination of linear equation (just like linear regression) that will predict which group the case belongs to. D: discriminant function v: discriminant coefficient or weight for the variable X: predictors with p-dimension a: constant

STT592-002: Intro. to Statistical Learning Purpose of LDA Choose v’s in a way to maximize distance between means of different categories Good predictors tend to have large |v’s| (weight) We want to discriminate between different categories Think of food recipe. Changing proportions (weights) of ingredients will change characteristics of finished cakes. Hopefully that will produce different types of cake!

Why LDA, when we have Logistic Regression STT592-002: Intro. to Statistical Learning Why LDA, when we have Logistic Regression When classes are well-separated, parameter estimates for logistic regression are surprisingly unstable. LDA does not suffer from this problem. If n is small and distribution of X is approximately normal in each of classes, LDA is again more stable than logistic regression. LDA is popular when we have more than two response classes (polytomous responses).

STT592-002: Intro. to Statistical Learning Assumptions of LDA Y Predictor variable for each group is normally distributed.

STT592-002: Intro. to Statistical Learning Notations pk(X) = Pr(Y = k|X): the posterior probability that an observation posterior X = x belongs to the kth class. Objective: classifies an observation to class for which pk(X) is largest Bayes’ Theorem:

Estimating Bayes’ Classifier STT592-002: Intro. to Statistical Learning Estimating Bayes’ Classifier With Logistic Regression we modeled probability of Y being from the kth class as = However, Bayes’ Theorem states : Probability of coming from class k (prior probability) : Density function for X given that X is an observation from class k

LDA for p=1 (single predictor) STT592-002: Intro. to Statistical Learning LDA for p=1 (single predictor) The Bayes classifier involves assigning an observation X = x to class for which (4.12) is largest. Or say, make (4.13) largest. Linear log-odds function implies decision boundary between classes k and ℓ—the set where Pr(G = k|X = x) = Pr(G = ℓ|X = x)—is linear in x; in p dimensions a hyperplane. This is of course true for any pair of classes, so all decision boundaries are linear Discriminant Function

STT592-002: Intro. to Statistical Learning Apply LDA LDA starts by assuming that each class has a normal distribution with a common variance The mean and variance are estimated Finally, Bayes’ theorem is used to compute pk and observation is assigned to class with maximum prob. among all k probabilities

STT592-002: Intro. to Statistical Learning Estimate and We can estimate and to compute The most common model for is the Normal Density Using the density, we only need to estimate three quantities to compute : Discriminant Function

Use Training Data set for Estimation STT592-002: Intro. to Statistical Learning Use Training Data set for Estimation Mean could be estimated by average of all training observations from kth class. Variance could be estimated as weighted average of variances of all k classes. And, is estimated as proportion of training observations that belong to kth class. Pooled Variance Discriminant Function:

A Simple Example with One Predictor (p =1) STT592-002: Intro. to Statistical Learning A Simple Example with One Predictor (p =1) Suppose we have only one predictor (p = 1) Two normal density function f1(x) and f2(x), represent two distinct classes The two density functions overlap, so there is some uncertainty about class to which an observation with an unknown class belongs The dashed vertical line represents Bayes’ decision boundary

STT592-002: Intro. to Statistical Learning The Bayes classifier involves assigning an observation X = x to class for which (4.12) is largest. Or say, make (4.13) largest. LDA for p=1 and K=2 Discriminant Function Bayes classifier assigns ob.'s to class 1, if x < 0 and class 2 otherwise

STT592-002: Intro. to Statistical Learning LDA for p=1 and K=2 20 observations were drawn from each of the two classes The dashed vertical line is the Bayes’ decision boundary The solid vertical line is the LDA decision boundary Bayes’ error rate: 10.6% LDA error rate: 11.1% Thus, LDA is performing pretty well!

STT592-002: Intro. to Statistical Learning An Example When p > 1 If X is multidimensional (p > 1), we use exactly same approach except density function f(x) is modeled using multivariate normal density

LDA for p>1 (Multiple predictors –similar to p=1) STT592-002: Intro. to Statistical Learning LDA for p>1 (Multiple predictors –similar to p=1) Bayes classifier assigns an observation X = x to the class for which Discriminant Function is largest:

LDA for p>1 (Multiple predictors) STT592-002: Intro. to Statistical Learning LDA for p>1 (Multiple predictors) Two predictors (p =2); three classes (K=3) 20 observations were generated from each class Three ellipses represent regions that contain 95% of prob. for each of three classes. Dashed lines: Bayes’ boundaries; Solid lines: LDA boundaries

LDA for p>1 (Multiple predictors) STT592-002: Intro. to Statistical Learning LDA for p>1 (Multiple predictors)

STT592-002: Intro. to Statistical Learning The Default Dataset library(MASS); library(ISLR) data(Default) head(Default) attach(Default) lda.fit=lda(default~student + balance) lda.fit plot(lda.fit) lda.pred=predict(lda.fit, default) names(lda.pred) lda.class=lda.pred$class table(lda.class,default) mean(lda.class==default) mean(default=="Yes") #[1] 0.0333 Note: Training error rates will usually be lower than test error rates, which are the real quantity of interest. In other words, we might expect this classifier to perform worse if we use it to predict whether or not a new set of individuals will default.

Running LDA on Default Data STT592-002: Intro. to Statistical Learning Running LDA on Default Data LDA makes 252+ 23 mistakes on 10000 predictions (2.75% misclassification error rate) But LDA miss-predicts 252/333 = 75.5% of defaulters! Perhaps, we shouldn’t use 0.5 as threshold for predicting default?

STT592-002: Intro. to Statistical Learning LDA for Default Overall accuracy = 97.25%. Now the total number of mistakes is 252+23 = 275 (2.75% misclassification error rate) But we miss-predicted 252/333 = 75.7% of defaulters Examine error rate with other thresholds: sensitivity and specificity. Sensitivity=Pr(Predicted=“YES” | defaulters) = True Positive 81/333 = 0.2432432 Specificity=Pr(Predicted=“NO” | NON-defaulters) = 1-False Positive 9644/9667 = 0.9976208 Eg: Sensitivity = % of true defaulters that are identified = 24.3% (low). Specificity = % of non-defaulters that are correctly identified = 99.8%.

Use 0.2 as Threshold for Default STT592-002: Intro. to Statistical Learning Use 0.2 as Threshold for Default Now the total number of mistakes is 138+235 = 373 (3.73% misclassification error rate) But we miss-predicted 138/333 = 41.4% of defaulters Examine error rate with other thresholds: sensitivity and specificity. Eg: Sensitivity = % of true defaulters that are identified = 58.6% (higher). Specificity = % of non-defaulters that are correctly identified = 97.6%.

Default Threshold Values vs. Error Rates STT592-002: Intro. to Statistical Learning Default Threshold Values vs. Error Rates Black solid: overall error rate Blue dashed: Fraction of defaulters missed Orange dotted: non defaulters incorrectly classified

Receiver Operating Characteristics (ROC) curve STT592-002: Intro. to Statistical Learning Receiver Operating Characteristics (ROC) curve Overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug top left corner, so the larger the AUC, the better the classifier. For this data the AUC is 0.95, which is close to the maximum of one so would be considered very good.

Receiver Operating Characteristics (ROC ) curve STT592-002: Intro. to Statistical Learning Receiver Operating Characteristics (ROC ) curve False Positive = (Truth=“NO”) & (Predict = “YES”) = incorrectly predicted to be Positive = incorrectly predicted POSTIVE Eg: In the Default data, “+” indicates an individual who defaults, and “−” indicates one who does not. Connect to classical hypothesis testing literature, we think of “−” as the null hypothesis and “+” as the alternative (non-null) hypothesis.

Receiver Operating Characteristics (ROC ) curve STT592-002: Intro. to Statistical Learning Receiver Operating Characteristics (ROC ) curve https://en.wikipedia.org/wiki/Sensitivity_and_specificity

Quadratic Discriminant Analysis (QDA) STT592-002: Intro. to Statistical Learning Quadratic Discriminant Analysis (QDA) LDA assumed that every class has the same variance/ covariance. However, LDA may perform poorly if this assumption is far from true. QDA works identically as LDA except that it estimates separate/different variances/ covariance for each class.

Which is better? LDA or QDA? STT592-002: Intro. to Statistical Learning Which is better? LDA or QDA? Since QDA allows for different variances among classes, the resulting boundaries become quadratic Which approach is better: LDA or QDA? QDA will work best when the variances are very different between classes and we have enough/very-large observations to accurately estimate the variances LDA will work best when the variances are similar among classes or we don’t have enough data to accurately estimate the variances

STT592-002: Intro. to Statistical Learning Comparing LDA to QDA Black dotted: LDA boundary Purple dashed: Bayes’ boundary Green solid: QDA boundary Left: variances of the classes are equal (LDA is better fit) Right: variances of the classes are not equal (QDA is better fit)

K-Nearest Neighbors (KNN) classifier (Sec2.2) STT592-002: Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) Given a positive integer K and a test observation x0, the KNN classifier first identifies the neighbors K points in the training data that are closest to x0, represented by N0. It then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j: Finally, KNN applies Bayes rule and classifies the test observation x0 to the class with the largest probability.

K-Nearest Neighbors (KNN) classifier (Sec2.2) STT592-002: Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) A small training data set: 6 blue and 6 orange observations. Goal: to make a prediction for the black cross. Consider K=3. KNN identify 3 observations that are closest to the cross. This neighborhood is shown as a circle. It consists of 2 blue points and 1 orange point, resulting in estimated probabilities of 2/3 for blue class and 1/3 for the orange class. KNN predict that the black cross belongs to the blue class.

K-Nearest Neighbors (KNN) classifier (Sec2.2) STT592-002: Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2)

K-Nearest Neighbors (KNN) classifier (Sec2.2) STT592-002: Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) Note: The choice of K has a drastic effect on the KNN classifier obtained.

K-Nearest Neighbors (KNN) classifier (Sec2.2) STT592-002: Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) Note: The choice of K has a drastic effect on the KNN classifier obtained.

Comparison of Classification Methods STT592-002: Intro. to Statistical Learning Comparison of Classification Methods KNN (Chapter 2.2) Logistic Regression (Chapter 4) LDA (Chapter 4) QDA (Chapter 4)

Review: Why LDA, when we have Logistic Regression STT592-002: Intro. to Statistical Learning Review: Why LDA, when we have Logistic Regression When classes are well-separated, parameter estimates for logistic regression are surprisingly unstable. LDA does not suffer from this problem. If n is small and distribution of X is approximately normal in each of classes, LDA is again more stable than logistic regression. LDA is popular when we have more than two response classes (polytomous responses).

Logistic Regression vs. LDA STT592-002: Intro. to Statistical Learning Logistic Regression vs. LDA Similarity: Both Logistic Regression and LDA produce linear boundaries Difference: LDA assumes that the observations are drawn from the normal distribution with common variance in each class, while logistic regression does not have this assumption. LDA would do better than Logistic Regression if the assumption of normality hold, otherwise logistic regression can outperform LDA

KNN vs. (LDA and Logistic Regression) STT592-002: Intro. to Statistical Learning KNN vs. (LDA and Logistic Regression) KNN takes a completely different approach KNN is completely non-parametric: No assumptions are made about the shape of the decision boundary! Advantage of KNN: We can expect KNN to dominate both LDA and Logistic Regression when the decision boundary is highly non-linear Disadvantage of KNN: KNN does not tell us which predictors are important (no table of coefficients)

QDA vs. (LDA, Logistic Regression, and KNN) STT592-002: Intro. to Statistical Learning QDA vs. (LDA, Logistic Regression, and KNN) QDA is a compromise between non-parametric KNN method and the linear LDA and logistic regression Summary: If the true decision boundary is: Linear: LDA and Logistic outperforms Moderately Non-linear: QDA outperforms Complicated Non-linear decision boundary: KNN is superior

QDA, LDA, Logistic Regression, and KNN STT592-002: Intro. to Statistical Learning QDA, LDA, Logistic Regression, and KNN Scenario 1: uncorrelated random normal variables with a different mean; Scenario 2: same to (1), but of correlation -0.5; Scenario 3: both from t-distribution;

QDA, LDA, Logistic Regression, and KNN STT592-002: Intro. to Statistical Learning QDA, LDA, Logistic Regression, and KNN Scenario 4: both X~Normal, r1=0.5, and r2=-0.5 (non-constant variances); Scenario 5: both X~Normal, y=function(X1^2, X2^2, X1*X2); [quadratic decision boundary] Scenario 6: both X~Normal, y=Complicated non-linear function; [Non-linear decision boundary]

STT592-002: Intro. to Statistical Learning Iris Data library(MASS) attach(iris) View(iris) names(iris) table(Species) lda.model=lda(Species~., data=iris)