Linear Discriminant Analysis (LDA)
Goal To classify observations into 2 or more groups based on k discriminant functions (Dependent variable Y is categorical with k classes.) Assumptions Multivariate Normal Distribution variables are distributed normally within the classes/groups. Similar Group Covariances Correlations between and the variances within each group should be similar.
Dependent Variable Must be categorical with 2 or more classes (groups). If there are only 2 classes, the discriminant analysis procedure will give the same result as the multiple regression procedure.
Independent Variables Continuous or categorical independent variables If categorical, they are converted into binary (dummy) variables as in multiple linear regression
Output Example: Assume 3 classes (y=1,2,3) of the dependent. Yx11x12x13x14f1f2f3Pred. Y … … …..
Binary Dependent - Regression If only 2 classes of dependent, can do multiple regression Sample data shown below: StatusAge (18-30)Age (50+)Income YX1X2X … …
Regression Output SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations24 ANOVA dfSSMSFSignificance F Regression E-05 Residual Total Coefficients Standard Errort StatP-valueLower 95%Upper 95% Intercept X X Income
Classification StatusAge (18-30)Age (50+)Income YX1X2X3Predicted YClass Classification Rule in this case: If Pred. Y > 0.5 then Class = 1; else Class = 0. This model yielded 2 misclassifications out of 24. How good is R-square?
Crosstab of Pred. Y and Y For large datasets, one can format the Predicted Y variable and create a crosstab with Y to see how accurately the model classifies the data (fictitious results shown here). The Good and Bad columns represent the number of actual Y values. Predicted Y *1000GoodBad 900to to to to to to to to to to to to to to to to to to to
Kolmogorov-Smirnov Test Use the crosstabs shown in last slide to conduct the KS Test to determine 1. Cutoff score, 2. Classification accuracy, and 3. Forecasts of model performance.