Classification Methods STT592-002: Intro. to Statistical Learning Classification Methods Chapter 04 (part 02) Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning
STT592-002: Intro. to Statistical Learning Linear Discriminant Analysis (LDA) & Quadratic Discriminant Analysis (QDA) Overview: Classification methods Chap4: Logistic regression; LDA/QDA; KNN Chap7: Generalized Additive Models Chap8: Trees, Random Forest, Boosting Chap9: Support Vector Machines (SVM)
STT592-002: Intro. to Statistical Learning Outline Overview of LDA Why not Logistic Regression? Estimating Bayes’ Classifier LDA Example with One Predictor (p=1) LDA Example with more than One Predictor (P>1) LDA on Default Data Overview of QDA Comparison between LDA and QDA Overview of KNN Classification
Linear Discriminant Analysis STT592-002: Intro. to Statistical Learning Linear Discriminant Analysis LDA undertakes the same task as Logistic Regression. Classify data based on categorical response variables. Eg: Making profit or not Buy a product or not Satisfied customer or not Political party voting intention
Why Linear? Why Discriminant? STT592-002: Intro. to Statistical Learning Why Linear? Why Discriminant? LDA involves the determination of linear equation (just like linear regression) that will predict which group the case belongs to. D: discriminant function v: discriminant coefficient or weight for the variable X: predictors with p-dimension a: constant
STT592-002: Intro. to Statistical Learning Purpose of LDA Choose v’s in a way to maximize distance between means of different categories Good predictors tend to have large |v’s| (weight) We want to discriminate between different categories Think of food recipe. Changing proportions (weights) of ingredients will change characteristics of finished cakes. Hopefully that will produce different types of cake!
Why LDA, when we have Logistic Regression STT592-002: Intro. to Statistical Learning Why LDA, when we have Logistic Regression When classes are well-separated, parameter estimates for logistic regression are surprisingly unstable. LDA does not suffer from this problem. If n is small and distribution of X is approximately normal in each of classes, LDA is again more stable than logistic regression. LDA is popular when we have more than two response classes (polytomous responses).
STT592-002: Intro. to Statistical Learning Assumptions of LDA Y Predictor variable for each group is normally distributed.
STT592-002: Intro. to Statistical Learning Notations pk(X) = Pr(Y = k|X): the posterior probability that an observation posterior X = x belongs to the kth class. Objective: classifies an observation to class for which pk(X) is largest Bayes’ Theorem:
Estimating Bayes’ Classifier STT592-002: Intro. to Statistical Learning Estimating Bayes’ Classifier With Logistic Regression we modeled probability of Y being from the kth class as = However, Bayes’ Theorem states : Probability of coming from class k (prior probability) : Density function for X given that X is an observation from class k
LDA for p=1 (single predictor) STT592-002: Intro. to Statistical Learning LDA for p=1 (single predictor) The Bayes classifier involves assigning an observation X = x to class for which (4.12) is largest. Or say, make (4.13) largest. Linear log-odds function implies decision boundary between classes k and ℓ—the set where Pr(G = k|X = x) = Pr(G = ℓ|X = x)—is linear in x; in p dimensions a hyperplane. This is of course true for any pair of classes, so all decision boundaries are linear Discriminant Function
STT592-002: Intro. to Statistical Learning Apply LDA LDA starts by assuming that each class has a normal distribution with a common variance The mean and variance are estimated Finally, Bayes’ theorem is used to compute pk and observation is assigned to class with maximum prob. among all k probabilities
STT592-002: Intro. to Statistical Learning Estimate and We can estimate and to compute The most common model for is the Normal Density Using the density, we only need to estimate three quantities to compute : Discriminant Function
Use Training Data set for Estimation STT592-002: Intro. to Statistical Learning Use Training Data set for Estimation Mean could be estimated by average of all training observations from kth class. Variance could be estimated as weighted average of variances of all k classes. And, is estimated as proportion of training observations that belong to kth class. Pooled Variance Discriminant Function:
A Simple Example with One Predictor (p =1) STT592-002: Intro. to Statistical Learning A Simple Example with One Predictor (p =1) Suppose we have only one predictor (p = 1) Two normal density function f1(x) and f2(x), represent two distinct classes The two density functions overlap, so there is some uncertainty about class to which an observation with an unknown class belongs The dashed vertical line represents Bayes’ decision boundary
STT592-002: Intro. to Statistical Learning The Bayes classifier involves assigning an observation X = x to class for which (4.12) is largest. Or say, make (4.13) largest. LDA for p=1 and K=2 Discriminant Function Bayes classifier assigns ob.'s to class 1, if x < 0 and class 2 otherwise
STT592-002: Intro. to Statistical Learning LDA for p=1 and K=2 20 observations were drawn from each of the two classes The dashed vertical line is the Bayes’ decision boundary The solid vertical line is the LDA decision boundary Bayes’ error rate: 10.6% LDA error rate: 11.1% Thus, LDA is performing pretty well!
STT592-002: Intro. to Statistical Learning An Example When p > 1 If X is multidimensional (p > 1), we use exactly same approach except density function f(x) is modeled using multivariate normal density
LDA for p>1 (Multiple predictors –similar to p=1) STT592-002: Intro. to Statistical Learning LDA for p>1 (Multiple predictors –similar to p=1) Bayes classifier assigns an observation X = x to the class for which Discriminant Function is largest:
LDA for p>1 (Multiple predictors) STT592-002: Intro. to Statistical Learning LDA for p>1 (Multiple predictors) Two predictors (p =2); three classes (K=3) 20 observations were generated from each class Three ellipses represent regions that contain 95% of prob. for each of three classes. Dashed lines: Bayes’ boundaries; Solid lines: LDA boundaries
LDA for p>1 (Multiple predictors) STT592-002: Intro. to Statistical Learning LDA for p>1 (Multiple predictors)
STT592-002: Intro. to Statistical Learning The Default Dataset library(MASS); library(ISLR) data(Default) head(Default) attach(Default) lda.fit=lda(default~student + balance) lda.fit plot(lda.fit) lda.pred=predict(lda.fit, default) names(lda.pred) lda.class=lda.pred$class table(lda.class,default) mean(lda.class==default) mean(default=="Yes") #[1] 0.0333 Note: Training error rates will usually be lower than test error rates, which are the real quantity of interest. In other words, we might expect this classifier to perform worse if we use it to predict whether or not a new set of individuals will default.
Running LDA on Default Data STT592-002: Intro. to Statistical Learning Running LDA on Default Data LDA makes 252+ 23 mistakes on 10000 predictions (2.75% misclassification error rate) But LDA miss-predicts 252/333 = 75.5% of defaulters! Perhaps, we shouldn’t use 0.5 as threshold for predicting default?
STT592-002: Intro. to Statistical Learning LDA for Default Overall accuracy = 97.25%. Now the total number of mistakes is 252+23 = 275 (2.75% misclassification error rate) But we miss-predicted 252/333 = 75.7% of defaulters Examine error rate with other thresholds: sensitivity and specificity. Sensitivity=Pr(Predicted=“YES” | defaulters) = True Positive 81/333 = 0.2432432 Specificity=Pr(Predicted=“NO” | NON-defaulters) = 1-False Positive 9644/9667 = 0.9976208 Eg: Sensitivity = % of true defaulters that are identified = 24.3% (low). Specificity = % of non-defaulters that are correctly identified = 99.8%.
Use 0.2 as Threshold for Default STT592-002: Intro. to Statistical Learning Use 0.2 as Threshold for Default Now the total number of mistakes is 138+235 = 373 (3.73% misclassification error rate) But we miss-predicted 138/333 = 41.4% of defaulters Examine error rate with other thresholds: sensitivity and specificity. Eg: Sensitivity = % of true defaulters that are identified = 58.6% (higher). Specificity = % of non-defaulters that are correctly identified = 97.6%.
Default Threshold Values vs. Error Rates STT592-002: Intro. to Statistical Learning Default Threshold Values vs. Error Rates Black solid: overall error rate Blue dashed: Fraction of defaulters missed Orange dotted: non defaulters incorrectly classified
Receiver Operating Characteristics (ROC) curve STT592-002: Intro. to Statistical Learning Receiver Operating Characteristics (ROC) curve Overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug top left corner, so the larger the AUC, the better the classifier. For this data the AUC is 0.95, which is close to the maximum of one so would be considered very good.
Receiver Operating Characteristics (ROC ) curve STT592-002: Intro. to Statistical Learning Receiver Operating Characteristics (ROC ) curve False Positive = (Truth=“NO”) & (Predict = “YES”) = incorrectly predicted to be Positive = incorrectly predicted POSTIVE Eg: In the Default data, “+” indicates an individual who defaults, and “−” indicates one who does not. Connect to classical hypothesis testing literature, we think of “−” as the null hypothesis and “+” as the alternative (non-null) hypothesis.
Receiver Operating Characteristics (ROC ) curve STT592-002: Intro. to Statistical Learning Receiver Operating Characteristics (ROC ) curve https://en.wikipedia.org/wiki/Sensitivity_and_specificity
Quadratic Discriminant Analysis (QDA) STT592-002: Intro. to Statistical Learning Quadratic Discriminant Analysis (QDA) LDA assumed that every class has the same variance/ covariance. However, LDA may perform poorly if this assumption is far from true. QDA works identically as LDA except that it estimates separate/different variances/ covariance for each class.
Which is better? LDA or QDA? STT592-002: Intro. to Statistical Learning Which is better? LDA or QDA? Since QDA allows for different variances among classes, the resulting boundaries become quadratic Which approach is better: LDA or QDA? QDA will work best when the variances are very different between classes and we have enough/very-large observations to accurately estimate the variances LDA will work best when the variances are similar among classes or we don’t have enough data to accurately estimate the variances
STT592-002: Intro. to Statistical Learning Comparing LDA to QDA Black dotted: LDA boundary Purple dashed: Bayes’ boundary Green solid: QDA boundary Left: variances of the classes are equal (LDA is better fit) Right: variances of the classes are not equal (QDA is better fit)
K-Nearest Neighbors (KNN) classifier (Sec2.2) STT592-002: Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) Given a positive integer K and a test observation x0, the KNN classifier first identifies the neighbors K points in the training data that are closest to x0, represented by N0. It then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j: Finally, KNN applies Bayes rule and classifies the test observation x0 to the class with the largest probability.
K-Nearest Neighbors (KNN) classifier (Sec2.2) STT592-002: Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) A small training data set: 6 blue and 6 orange observations. Goal: to make a prediction for the black cross. Consider K=3. KNN identify 3 observations that are closest to the cross. This neighborhood is shown as a circle. It consists of 2 blue points and 1 orange point, resulting in estimated probabilities of 2/3 for blue class and 1/3 for the orange class. KNN predict that the black cross belongs to the blue class.
K-Nearest Neighbors (KNN) classifier (Sec2.2) STT592-002: Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2)
K-Nearest Neighbors (KNN) classifier (Sec2.2) STT592-002: Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) Note: The choice of K has a drastic effect on the KNN classifier obtained.
K-Nearest Neighbors (KNN) classifier (Sec2.2) STT592-002: Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) Note: The choice of K has a drastic effect on the KNN classifier obtained.
Comparison of Classification Methods STT592-002: Intro. to Statistical Learning Comparison of Classification Methods KNN (Chapter 2.2) Logistic Regression (Chapter 4) LDA (Chapter 4) QDA (Chapter 4)
Review: Why LDA, when we have Logistic Regression STT592-002: Intro. to Statistical Learning Review: Why LDA, when we have Logistic Regression When classes are well-separated, parameter estimates for logistic regression are surprisingly unstable. LDA does not suffer from this problem. If n is small and distribution of X is approximately normal in each of classes, LDA is again more stable than logistic regression. LDA is popular when we have more than two response classes (polytomous responses).
Logistic Regression vs. LDA STT592-002: Intro. to Statistical Learning Logistic Regression vs. LDA Similarity: Both Logistic Regression and LDA produce linear boundaries Difference: LDA assumes that the observations are drawn from the normal distribution with common variance in each class, while logistic regression does not have this assumption. LDA would do better than Logistic Regression if the assumption of normality hold, otherwise logistic regression can outperform LDA
KNN vs. (LDA and Logistic Regression) STT592-002: Intro. to Statistical Learning KNN vs. (LDA and Logistic Regression) KNN takes a completely different approach KNN is completely non-parametric: No assumptions are made about the shape of the decision boundary! Advantage of KNN: We can expect KNN to dominate both LDA and Logistic Regression when the decision boundary is highly non-linear Disadvantage of KNN: KNN does not tell us which predictors are important (no table of coefficients)
QDA vs. (LDA, Logistic Regression, and KNN) STT592-002: Intro. to Statistical Learning QDA vs. (LDA, Logistic Regression, and KNN) QDA is a compromise between non-parametric KNN method and the linear LDA and logistic regression Summary: If the true decision boundary is: Linear: LDA and Logistic outperforms Moderately Non-linear: QDA outperforms Complicated Non-linear decision boundary: KNN is superior
QDA, LDA, Logistic Regression, and KNN STT592-002: Intro. to Statistical Learning QDA, LDA, Logistic Regression, and KNN Scenario 1: uncorrelated random normal variables with a different mean; Scenario 2: same to (1), but of correlation -0.5; Scenario 3: both from t-distribution;
QDA, LDA, Logistic Regression, and KNN STT592-002: Intro. to Statistical Learning QDA, LDA, Logistic Regression, and KNN Scenario 4: both X~Normal, r1=0.5, and r2=-0.5 (non-constant variances); Scenario 5: both X~Normal, y=function(X1^2, X2^2, X1*X2); [quadratic decision boundary] Scenario 6: both X~Normal, y=Complicated non-linear function; [Non-linear decision boundary]
STT592-002: Intro. to Statistical Learning Iris Data library(MASS) attach(iris) View(iris) names(iris) table(Species) lda.model=lda(Species~., data=iris)