Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification Methods

Similar presentations


Presentation on theme: "Classification Methods"— Presentation transcript:

1 Classification Methods
STT : Intro. to Statistical Learning Classification Methods Chapter 04 (part 02) Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning

2 STT592-002: Intro. to Statistical Learning
Linear Discriminant Analysis (LDA) & Quadratic Discriminant Analysis (QDA) Overview: Classification methods Chap4: Logistic regression; LDA/QDA; KNN Chap7: Generalized Additive Models Chap8: Trees, Random Forest, Boosting Chap9: Support Vector Machines (SVM)

3 STT592-002: Intro. to Statistical Learning
Outline Overview of LDA Why not Logistic Regression? Estimating Bayes’ Classifier LDA Example with One Predictor (p=1) LDA Example with more than One Predictor (P>1) LDA on Default Data Overview of QDA Comparison between LDA and QDA Overview of KNN Classification

4 Linear Discriminant Analysis
STT : Intro. to Statistical Learning Linear Discriminant Analysis LDA undertakes the same task as Logistic Regression. Classify data based on categorical response variables. Eg: Making profit or not Buy a product or not Satisfied customer or not Political party voting intention

5 Why Linear? Why Discriminant?
STT : Intro. to Statistical Learning Why Linear? Why Discriminant? LDA involves the determination of linear equation (just like linear regression) that will predict which group the case belongs to. D: discriminant function v: discriminant coefficient or weight for the variable X: predictors with p-dimension a: constant

6 STT592-002: Intro. to Statistical Learning
Purpose of LDA Choose v’s in a way to maximize distance between means of different categories Good predictors tend to have large |v’s| (weight) We want to discriminate between different categories Think of food recipe. Changing proportions (weights) of ingredients will change characteristics of finished cakes. Hopefully that will produce different types of cake!

7 Why LDA, when we have Logistic Regression
STT : Intro. to Statistical Learning Why LDA, when we have Logistic Regression When classes are well-separated, parameter estimates for logistic regression are surprisingly unstable. LDA does not suffer from this problem. If n is small and distribution of X is approximately normal in each of classes, LDA is again more stable than logistic regression. LDA is popular when we have more than two response classes (polytomous responses).

8 STT592-002: Intro. to Statistical Learning
Assumptions of LDA Y Predictor variable for each group is normally distributed.

9 STT592-002: Intro. to Statistical Learning
Notations pk(X) = Pr(Y = k|X): the posterior probability that an observation posterior X = x belongs to the kth class. Objective: classifies an observation to class for which pk(X) is largest Bayes’ Theorem:

10 Estimating Bayes’ Classifier
STT : Intro. to Statistical Learning Estimating Bayes’ Classifier With Logistic Regression we modeled probability of Y being from the kth class as = However, Bayes’ Theorem states : Probability of coming from class k (prior probability) : Density function for X given that X is an observation from class k

11 LDA for p=1 (single predictor)
STT : Intro. to Statistical Learning LDA for p=1 (single predictor) The Bayes classifier involves assigning an observation X = x to class for which (4.12) is largest. Or say, make (4.13) largest. Linear log-odds function implies decision boundary between classes k and ℓ—the set where Pr(G = k|X = x) = Pr(G = ℓ|X = x)—is linear in x; in p dimensions a hyperplane. This is of course true for any pair of classes, so all decision boundaries are linear Discriminant Function

12 STT592-002: Intro. to Statistical Learning
Apply LDA LDA starts by assuming that each class has a normal distribution with a common variance The mean and variance are estimated Finally, Bayes’ theorem is used to compute pk and observation is assigned to class with maximum prob. among all k probabilities

13 STT592-002: Intro. to Statistical Learning
Estimate and We can estimate and to compute The most common model for is the Normal Density Using the density, we only need to estimate three quantities to compute : Discriminant Function

14 Use Training Data set for Estimation
STT : Intro. to Statistical Learning Use Training Data set for Estimation Mean could be estimated by average of all training observations from kth class. Variance could be estimated as weighted average of variances of all k classes. And, is estimated as proportion of training observations that belong to kth class. Pooled Variance Discriminant Function:

15 A Simple Example with One Predictor (p =1)
STT : Intro. to Statistical Learning A Simple Example with One Predictor (p =1) Suppose we have only one predictor (p = 1) Two normal density function f1(x) and f2(x), represent two distinct classes The two density functions overlap, so there is some uncertainty about class to which an observation with an unknown class belongs The dashed vertical line represents Bayes’ decision boundary

16 STT592-002: Intro. to Statistical Learning
The Bayes classifier involves assigning an observation X = x to class for which (4.12) is largest. Or say, make (4.13) largest. LDA for p=1 and K=2 Discriminant Function Bayes classifier assigns ob.'s to class 1, if x < 0 and class 2 otherwise

17 STT592-002: Intro. to Statistical Learning
LDA for p=1 and K=2 20 observations were drawn from each of the two classes The dashed vertical line is the Bayes’ decision boundary The solid vertical line is the LDA decision boundary Bayes’ error rate: 10.6% LDA error rate: 11.1% Thus, LDA is performing pretty well!

18 STT592-002: Intro. to Statistical Learning
An Example When p > 1 If X is multidimensional (p > 1), we use exactly same approach except density function f(x) is modeled using multivariate normal density

19 LDA for p>1 (Multiple predictors –similar to p=1)
STT : Intro. to Statistical Learning LDA for p>1 (Multiple predictors –similar to p=1) Bayes classifier assigns an observation X = x to the class for which Discriminant Function is largest:

20 LDA for p>1 (Multiple predictors)
STT : Intro. to Statistical Learning LDA for p>1 (Multiple predictors) Two predictors (p =2); three classes (K=3) 20 observations were generated from each class Three ellipses represent regions that contain 95% of prob. for each of three classes. Dashed lines: Bayes’ boundaries; Solid lines: LDA boundaries

21 LDA for p>1 (Multiple predictors)
STT : Intro. to Statistical Learning LDA for p>1 (Multiple predictors)

22 STT592-002: Intro. to Statistical Learning
The Default Dataset library(MASS); library(ISLR) data(Default) head(Default) attach(Default) lda.fit=lda(default~student + balance) lda.fit plot(lda.fit) lda.pred=predict(lda.fit, default) names(lda.pred) lda.class=lda.pred$class table(lda.class,default) mean(lda.class==default) mean(default=="Yes") #[1] Note: Training error rates will usually be lower than test error rates, which are the real quantity of interest. In other words, we might expect this classifier to perform worse if we use it to predict whether or not a new set of individuals will default.

23 Running LDA on Default Data
STT : Intro. to Statistical Learning Running LDA on Default Data LDA makes mistakes on predictions (2.75% misclassification error rate) But LDA miss-predicts 252/333 = 75.5% of defaulters! Perhaps, we shouldn’t use 0.5 as threshold for predicting default?

24 STT592-002: Intro. to Statistical Learning
LDA for Default Overall accuracy = 97.25%. Now the total number of mistakes is = 275 (2.75% misclassification error rate) But we miss-predicted 252/333 = 75.7% of defaulters Examine error rate with other thresholds: sensitivity and specificity. Sensitivity=Pr(Predicted=“YES” | defaulters) = True Positive /333 = Specificity=Pr(Predicted=“NO” | NON-defaulters) = 1-False Positive /9667 = Eg: Sensitivity = % of true defaulters that are identified = 24.3% (low). Specificity = % of non-defaulters that are correctly identified = 99.8%.

25 Use 0.2 as Threshold for Default
STT : Intro. to Statistical Learning Use 0.2 as Threshold for Default Now the total number of mistakes is = 373 (3.73% misclassification error rate) But we miss-predicted 138/333 = 41.4% of defaulters Examine error rate with other thresholds: sensitivity and specificity. Eg: Sensitivity = % of true defaulters that are identified = 58.6% (higher). Specificity = % of non-defaulters that are correctly identified = 97.6%.

26 Default Threshold Values vs. Error Rates
STT : Intro. to Statistical Learning Default Threshold Values vs. Error Rates Black solid: overall error rate Blue dashed: Fraction of defaulters missed Orange dotted: non defaulters incorrectly classified

27 Receiver Operating Characteristics (ROC) curve
STT : Intro. to Statistical Learning Receiver Operating Characteristics (ROC) curve Overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug top left corner, so the larger the AUC, the better the classifier. For this data the AUC is 0.95, which is close to the maximum of one so would be considered very good.

28 Receiver Operating Characteristics (ROC ) curve
STT : Intro. to Statistical Learning Receiver Operating Characteristics (ROC ) curve False Positive = (Truth=“NO”) & (Predict = “YES”) = incorrectly predicted to be Positive = incorrectly predicted POSTIVE Eg: In the Default data, “+” indicates an individual who defaults, and “−” indicates one who does not. Connect to classical hypothesis testing literature, we think of “−” as the null hypothesis and “+” as the alternative (non-null) hypothesis.

29 Receiver Operating Characteristics (ROC ) curve
STT : Intro. to Statistical Learning Receiver Operating Characteristics (ROC ) curve

30 Quadratic Discriminant Analysis (QDA)
STT : Intro. to Statistical Learning Quadratic Discriminant Analysis (QDA) LDA assumed that every class has the same variance/ covariance. However, LDA may perform poorly if this assumption is far from true. QDA works identically as LDA except that it estimates separate/different variances/ covariance for each class.

31 Which is better? LDA or QDA?
STT : Intro. to Statistical Learning Which is better? LDA or QDA? Since QDA allows for different variances among classes, the resulting boundaries become quadratic Which approach is better: LDA or QDA? QDA will work best when the variances are very different between classes and we have enough/very-large observations to accurately estimate the variances LDA will work best when the variances are similar among classes or we don’t have enough data to accurately estimate the variances

32 STT592-002: Intro. to Statistical Learning
Comparing LDA to QDA Black dotted: LDA boundary Purple dashed: Bayes’ boundary Green solid: QDA boundary Left: variances of the classes are equal (LDA is better fit) Right: variances of the classes are not equal (QDA is better fit)

33 K-Nearest Neighbors (KNN) classifier (Sec2.2)
STT : Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) Given a positive integer K and a test observation x0, the KNN classifier first identifies the neighbors K points in the training data that are closest to x0, represented by N0. It then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j: Finally, KNN applies Bayes rule and classifies the test observation x0 to the class with the largest probability.

34 K-Nearest Neighbors (KNN) classifier (Sec2.2)
STT : Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) A small training data set: 6 blue and 6 orange observations. Goal: to make a prediction for the black cross. Consider K=3. KNN identify 3 observations that are closest to the cross. This neighborhood is shown as a circle. It consists of 2 blue points and 1 orange point, resulting in estimated probabilities of 2/3 for blue class and 1/3 for the orange class. KNN predict that the black cross belongs to the blue class.

35 K-Nearest Neighbors (KNN) classifier (Sec2.2)
STT : Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2)

36 K-Nearest Neighbors (KNN) classifier (Sec2.2)
STT : Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) Note: The choice of K has a drastic effect on the KNN classifier obtained.

37 K-Nearest Neighbors (KNN) classifier (Sec2.2)
STT : Intro. to Statistical Learning K-Nearest Neighbors (KNN) classifier (Sec2.2) Note: The choice of K has a drastic effect on the KNN classifier obtained.

38 Comparison of Classification Methods
STT : Intro. to Statistical Learning Comparison of Classification Methods KNN (Chapter 2.2) Logistic Regression (Chapter 4) LDA (Chapter 4) QDA (Chapter 4)

39 Review: Why LDA, when we have Logistic Regression
STT : Intro. to Statistical Learning Review: Why LDA, when we have Logistic Regression When classes are well-separated, parameter estimates for logistic regression are surprisingly unstable. LDA does not suffer from this problem. If n is small and distribution of X is approximately normal in each of classes, LDA is again more stable than logistic regression. LDA is popular when we have more than two response classes (polytomous responses).

40 Logistic Regression vs. LDA
STT : Intro. to Statistical Learning Logistic Regression vs. LDA Similarity: Both Logistic Regression and LDA produce linear boundaries Difference: LDA assumes that the observations are drawn from the normal distribution with common variance in each class, while logistic regression does not have this assumption. LDA would do better than Logistic Regression if the assumption of normality hold, otherwise logistic regression can outperform LDA

41 KNN vs. (LDA and Logistic Regression)
STT : Intro. to Statistical Learning KNN vs. (LDA and Logistic Regression) KNN takes a completely different approach KNN is completely non-parametric: No assumptions are made about the shape of the decision boundary! Advantage of KNN: We can expect KNN to dominate both LDA and Logistic Regression when the decision boundary is highly non-linear Disadvantage of KNN: KNN does not tell us which predictors are important (no table of coefficients)

42 QDA vs. (LDA, Logistic Regression, and KNN)
STT : Intro. to Statistical Learning QDA vs. (LDA, Logistic Regression, and KNN) QDA is a compromise between non-parametric KNN method and the linear LDA and logistic regression Summary: If the true decision boundary is: Linear: LDA and Logistic outperforms Moderately Non-linear: QDA outperforms Complicated Non-linear decision boundary: KNN is superior

43 QDA, LDA, Logistic Regression, and KNN
STT : Intro. to Statistical Learning QDA, LDA, Logistic Regression, and KNN Scenario 1: uncorrelated random normal variables with a different mean; Scenario 2: same to (1), but of correlation -0.5; Scenario 3: both from t-distribution;

44 QDA, LDA, Logistic Regression, and KNN
STT : Intro. to Statistical Learning QDA, LDA, Logistic Regression, and KNN Scenario 4: both X~Normal, r1=0.5, and r2=-0.5 (non-constant variances); Scenario 5: both X~Normal, y=function(X1^2, X2^2, X1*X2); [quadratic decision boundary] Scenario 6: both X~Normal, y=Complicated non-linear function; [Non-linear decision boundary]

45 STT592-002: Intro. to Statistical Learning
Iris Data library(MASS) attach(iris) View(iris) names(iris) table(Species) lda.model=lda(Species~., data=iris)


Download ppt "Classification Methods"

Similar presentations


Ads by Google