Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.1 Lecture 9: Discriminant function analysis (DFA) l Rationale and use of DFA l The underlying model (what is a discriminant function anyway?) l Finding discriminant functions: principles and procedures l Rationale and use of DFA l The underlying model (what is a discriminant function anyway?) l Finding discriminant functions: principles and procedures l Linear versus quadratic discriminant functions l Significance testing l Rotating discriminant functions l Component retention, significance, and reliability.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.2 What is discriminant function analysis? l Given a set of p variables X 1, X 2,…, X p, and a set of N objects belonging to m known groups (classes) G 1, G 2,…, G m, we try and construct a set of functions Z 1, Z 2,…, Z min{m-1,p} that allow us to classify each object correctly. l The hope (sometimes faint) is that “good” classification results (i.e., low misclassification rate, high reliability) will be obtained through a relatively small set of simple functions.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.3 What is a discriminant function anyway? l A discriminant function is a function: which maximizes the “separation” between the groups under consideration, or (more technically) maximizes the ratio of between group/within group variation. l A discriminant function is a function: which maximizes the “separation” between the groups under consideration, or (more technically) maximizes the ratio of between group/within group variation. Group 1 Group 2 Group 1 Group 2 (not so good) (better) Frequency
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.4 The linear discriminant model l For a set of p variables X 1, X 2,…, X p, the general model is l where the X j s are the original variables and the a ij s are the discriminant function coefficients. l For a set of p variables X 1, X 2,…, X p, the general model is l where the X j s are the original variables and the a ij s are the discriminant function coefficients. l Note: unlike in PCA and FA, the discriminant functions are based on the raw (unstandardized) variables, since the resulting classifications are unaffected by scale. l For p variables and m groups, the maximum number of DFs is min{p, m-1}.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.5 The geometry of a single linear discriminant function l 2 groups with measurements of two variables (X 1 and X 2 ) on each object. l In this case, the linear DF Z* results in no misclassifications, whereas another possible DF (Z) gives two misclassifications. l 2 groups with measurements of two variables (X 1 and X 2 ) on each object. l In this case, the linear DF Z* results in no misclassifications, whereas another possible DF (Z) gives two misclassifications. Group 1 Group 2 X2X2 X1X1 Z Z* Misclassified under Z but not under Z*
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.6 Finding discriminant functions: principles l The first discriminant function is that which maximizes the differences between groups compared to the differences within groups… l …which is equivalent to maximizing F in a one- way ANOVA. l The first discriminant function is that which maximizes the differences between groups compared to the differences within groups… l …which is equivalent to maximizing F in a one- way ANOVA. a = (a 1,…, a p ) F(Z) Z1Z1
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.7 Finding discriminant functions: principles l The second discriminant function is that which maximizes the differences between groups compared to the differences within groups unaccounted for by Z 1... l …which is equivalent to maximizing F in a one-way ANOVA given the constraint that Z 1, Z 2 are uncorrelated. l The second discriminant function is that which maximizes the differences between groups compared to the differences within groups unaccounted for by Z 1... l …which is equivalent to maximizing F in a one-way ANOVA given the constraint that Z 1, Z 2 are uncorrelated. a = (a 1,…, a p ) F(Z) Z2Z2
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.8 The geometry of several linear discriminant functions l 2 groups with measurements of two variables (X 1 and X 2 ) on each individual. l Using only Z 1, 4 objects are misclassified, whereas using both Z 1 and Z 2, only one object is misclassified. l 2 groups with measurements of two variables (X 1 and X 2 ) on each individual. l Using only Z 1, 4 objects are misclassified, whereas using both Z 1 and Z 2, only one object is misclassified. X2X2 X1X1 Z1Z1 Group 1 Group 2 Z2Z2 Misclassified using both Z 1 and Z 2 Misclassified using only Z 1
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.9 SSCP matrices: within, between, and total l The total (T) SSCP matrix (based on p variables X 1, X 2,…, X p ) in a sample of objects belonging to m groups G 1, G 2,…, G m with sizes n 1, n 2,…, n m can be partitioned into within- groups (W) and between- groups (B) SSCP matrices: Value of variable X k for ith observation in group j Mean of variable X k for group j Overall mean of variable X k Element in row r and column c of total (T, t) and within (W, w) SSCP
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.10 Finding discriminant functions: analytic procedures l Calculate total (T), within (W) and between (W) SSCPs. l Determine eigenvalues and eigenvectors of the product W -1 B. is ratio of between to within SSs for the ith discriminant function Z i … l …and the elements of the corresponding eigenvectors are the discriminant function coefficients. l Calculate total (T), within (W) and between (W) SSCPs. l Determine eigenvalues and eigenvectors of the product W -1 B. is ratio of between to within SSs for the ith discriminant function Z i … l …and the elements of the corresponding eigenvectors are the discriminant function coefficients.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.11 AssumptionsAssumptions l Equality of within-group covariance matrices (C 1 = C 2 =...) implies that each element of C 1 is equal to the corresponding element in C 2, etc. Variance Covariance G1G1 G2G2
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.12 The quadratic discriminant model l For a set of p variables X 1, X 2,…, X p, the general quadratic model is l where the X j s are the original variables and the a ij s are the linear coefficients and the b ij s are the 2nd order coefficients. l For a set of p variables X 1, X 2,…, X p, the general quadratic model is l where the X j s are the original variables and the a ij s are the linear coefficients and the b ij s are the 2nd order coefficients. l Because the quadratic model involves many more parameters, sample sizes must be considerably larger to get reasonably stable estimates of coefficients. X2X2 X1X1 Group 1 Group 2 Quadratic Z 1 Linear Z 1
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.13 Fitting discriminant function models: the problems l Goal: find the “best” model, given the available data l Problem 1: what is “best”? l Problem 2: even if “best” is defined, by what method do we find it? l Possibilities: n If there are m variables, we might compute DFs using all possible subsets (2 m -1) of variables models and choose the best one n use some procedure for winnowing down the set of possible models. l Goal: find the “best” model, given the available data l Problem 1: what is “best”? l Problem 2: even if “best” is defined, by what method do we find it? l Possibilities: n If there are m variables, we might compute DFs using all possible subsets (2 m -1) of variables models and choose the best one n use some procedure for winnowing down the set of possible models.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.14 Criteria for choosing the “best” discriminant model l Discriminating ability: better models are better able to distinguish among groups l Implication: better models will have lower misclassification rates. l N.B. Raw misclassification rates can be very misleading. l Discriminating ability: better models are better able to distinguish among groups l Implication: better models will have lower misclassification rates. l N.B. Raw misclassification rates can be very misleading. l Parsimony: a discriminant model which includes fewer variables is better than one with more variables. l Implication: if the elimination/addition of a variable does not significantly increase/decrease the misclassification rate, it may not be very useful.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.15 Criteria for choosing the “best” discriminant model (cont’d) l Model stability: better models have coefficients that are stable as judged through cross- validation. l Procedure: Judge stability through cross- validation (jackknifing, bootstrapping). l Model stability: better models have coefficients that are stable as judged through cross- validation. l Procedure: Judge stability through cross- validation (jackknifing, bootstrapping). l NB.1. In general, linear discriminant functions will be more stable than quadratic functions, especially if the sample is small. l NB.2. If the sample is small, then ”outliers” may dramatically decrease model stability.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.16 Fitting discriminant function models: the problems l Goal: find the “best” model, given the available data l Problem 1: what is “best”? l Problem 2: even if “best” is defined, by what method do we find it? l Possibilities: n If there are m variables, we might compute DFs using all possible subsets (2 m -1) of variables models and choose the best one n use some procedure for winnowing down the set of possible models. l Goal: find the “best” model, given the available data l Problem 1: what is “best”? l Problem 2: even if “best” is defined, by what method do we find it? l Possibilities: n If there are m variables, we might compute DFs using all possible subsets (2 m -1) of variables models and choose the best one n use some procedure for winnowing down the set of possible models.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.17 Analytic procedures: general approach l Evaluate significance of a variable (X i ) in DF by computing the difference in group resolution between two models, one with the variable included, the other with it excluded. Evaluate change in discriminating ability ( DA) associated with inclusion of the variable in question l Unfortunately, change in discriminating ability may depend on what other variables are in model! l Evaluate significance of a variable (X i ) in DF by computing the difference in group resolution between two models, one with the variable included, the other with it excluded. Evaluate change in discriminating ability ( DA) associated with inclusion of the variable in question l Unfortunately, change in discriminating ability may depend on what other variables are in model! Model A (X i in) Model B (X i out) DA Delete X i ( small) Retain X i ( large)
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.18 Strategy I: computing all possible models l compute all possible models and choose the “best” one. l Impractical unless number of variables is relatively small. l compute all possible models and choose the “best” one. l Impractical unless number of variables is relatively small. {X 1, X 2, X 3 } {X2}{X2} {X1}{X1} {X3}{X3} {X1, X2}{X1, X2} {X2, X3}{X2, X3} {X1, X3}{X1, X3} {X1, X2, X3}{X1, X2, X3}
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.19 Strategy II: forward selection l start with variable for which differences among group means are the largest (largest F-value) l add others one at a time based on F to enter (p to enter) until no further significant increase in discriminating ability is achieved. l problem: if X j is included, it stays in even if it contributes little to discriminating ability once other variables are included. l start with variable for which differences among group means are the largest (largest F-value) l add others one at a time based on F to enter (p to enter) until no further significant increase in discriminating ability is achieved. l problem: if X j is included, it stays in even if it contributes little to discriminating ability once other variables are included. F 2 > F 1, F 3, F 4 F 2 > F to enter (< p to enter) F 1 > F 3, F 4 F 1 > F to enter (< p to enter) (X 1, X 2, X 3, X 4 ) All variables (X2)(X2) (X 1, X 2 ) F 4 > F 3 ; F 4 < F to enter (> p to enter) Final model
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.20 What is F to enter/remove (p to enter/remove) anyway? l When no variables are in the model, F to enter is the F-value from a univariate one-way ANOVA comparing group means with respect to the variable in question, and p to enter is the Type I probability associated with the null that all group means are equal. l When other variables are in the model, F to enter corresponds to the F-value for an ANCOVA comparing group means with respect to the variable in question, where the covariates are the variables already entered.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.21 Strategy III: backward selection l Start with all variables and drop that for which differences among group means are the smallest (smallest F-value) l Delete others one at a time based on F to remove (p to remove) until further removal results in a significant reduction in the ability to discriminate groups. l problem: if X j is excluded, it stays out even if it contributes substantially to discriminating ability once other variables are excluded. l Start with all variables and drop that for which differences among group means are the smallest (smallest F-value) l Delete others one at a time based on F to remove (p to remove) until further removal results in a significant reduction in the ability to discriminate groups. l problem: if X j is excluded, it stays out even if it contributes substantially to discriminating ability once other variables are excluded. F 2 < F 1, F 3, F 4 F 2 < F to remove (> p to remove) F 1 < F 3, F 4 F 1 < F to remove (> p to remove) (X 1, X 2, X 3, X 4 ) All variables in (X 3, X 4 ) F 4 F to remove (< p to remove) Final model (X 1, X 3, X 4 )
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.22 Canonical scores l Because discriminant functions are functions, we can “plug in” the values for each variable for each observation, and calculate a canonical score for each observation and each discriminant function.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.23 Canonical scores plots l Plots of canonical scores for each object. l The better the model, the greater the separation between clouds of points representing individual groups, e.g. Fisher’s famous irises. l Plots of canonical scores for each object. l The better the model, the greater the separation between clouds of points representing individual groups, e.g. Fisher’s famous irises. Canonical scores of group means % confidence ellipse
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.24 PriorsPriors In standard DFA, it is assumed that in the absence of any information, the a priori (prior) probability i of a given object belonging to one of i = 1,…,m groups is the same for all groups: l But, if each group is not equally likely, then priors should be adjusted so as to reflect this bias. l E.g. in species with biased sex-ratios, males and females should have unequal priors.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.25 Caveats: unequal priors l For a given set of discriminant functions, misclassification rates will usually depend on the priors… l …so that artificially low misclassification rates can be obtained simply by strategically adjusting the priors. l For a given set of discriminant functions, misclassification rates will usually depend on the priors… l …so that artificially low misclassification rates can be obtained simply by strategically adjusting the priors. l So, only adjust priors if you are confident that the true frequency of each group in the population is (reasonably) accurately estimated by the group frequencies in the sample.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.26 Significance testing l Question: which discriminant functions are statistically “significant”? For testing significance of all r DFs for m groups based on p variables, calculate Bartlett’s V and compare to 2 distribution with p(m-1) degrees of freedom l Question: which discriminant functions are statistically “significant”? For testing significance of all r DFs for m groups based on p variables, calculate Bartlett’s V and compare to 2 distribution with p(m-1) degrees of freedom Eigenvalue associated with ith discriminant function
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.27 Significance testing (cont’d) l Each DF is tested in a hierarchical fashion by first testing significance of all DFs combined. l If all DFs combined not significant, then no DF is significant. l If all DFs combined are significant, then remove first DF and recalculate V (= V 1 ) and test. l Continue until residual V j no longer significant at df = (p – j)(m – j - 1) l Each DF is tested in a hierarchical fashion by first testing significance of all DFs combined. l If all DFs combined not significant, then no DF is significant. l If all DFs combined are significant, then remove first DF and recalculate V (= V 1 ) and test. l Continue until residual V j no longer significant at df = (p – j)(m – j - 1)
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.28 Caveats/assumptions: tests of significance l Tests of significance assume that within-group covariance matrices are the same for all groups, and that within groups, observations have a multivariate normal distribution l Tests of significance can be very misleading because j th discriminant function in the population may not appear as j th discriminant function in the sample due to sampling errors… l So be careful, especially if the sample is small! l Tests of significance assume that within-group covariance matrices are the same for all groups, and that within groups, observations have a multivariate normal distribution l Tests of significance can be very misleading because j th discriminant function in the population may not appear as j th discriminant function in the sample due to sampling errors… l So be careful, especially if the sample is small!
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.29 Caveats/assumptions: tests of significance l If stepwise (forward or backward) procedures are used, significance tests are biased because given enough variables, significant discriminant functions can be produced by chance alone. l In such cases, it is advisable to (1) test results with more standard analyses or (2) use randomization procedures whereby objects are randomly assigned to groups. l If stepwise (forward or backward) procedures are used, significance tests are biased because given enough variables, significant discriminant functions can be produced by chance alone. l In such cases, it is advisable to (1) test results with more standard analyses or (2) use randomization procedures whereby objects are randomly assigned to groups.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.30 Assessing classification accuracy I. Raw classification results l The derived discriminant functions are used to classify all objects in the sample, and a classification table is produced. l Classification accuracy is likely to be overestimated, since the data used to generate the DFs in the first place are themselves being classified. l The derived discriminant functions are used to classify all objects in the sample, and a classification table is produced. l Classification accuracy is likely to be overestimated, since the data used to generate the DFs in the first place are themselves being classified. GroupTotal Group Total Misclassification (G 2 ) = 8/22 Misclassification (G 1 ) = 5/48 Overall misclassification = 13/70
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.31 Assessing classification accuracy II. Jackknifed classification l Discriminant functions are derived using N – 1 objects, and the Nth object is then classified. l This procedure is repeated for all N objects, each time leaving a different one out, and a classification table produced. l In general, jackknifed classification results are worse than raw classification results, but more reliable. l Discriminant functions are derived using N – 1 objects, and the Nth object is then classified. l This procedure is repeated for all N objects, each time leaving a different one out, and a classification table produced. l In general, jackknifed classification results are worse than raw classification results, but more reliable. GroupTotal Group Total Misclassification (G 2 ) = 9/22 Misclassification (G 1 ) = 7/48 Overall misclassification = 16/70
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.32 Assessing classification accuracy III. Data splitting l Use 2/3 of sample data (randomly) selected to generate discriminant functions (learning set) l Use derived discriminant functions to classified other 1/3 (test set) and produce classification table. l In general, data-splitting classification results are worse than both raw and jackknifed classification results, but more reliable. l Use 2/3 of sample data (randomly) selected to generate discriminant functions (learning set) l Use derived discriminant functions to classified other 1/3 (test set) and produce classification table. l In general, data-splitting classification results are worse than both raw and jackknifed classification results, but more reliable. GroupTotal Group Total Misclassification (G 2 ) = 9/22 Misclassification (G 1 ) = 8/48 Overall misclassification = 17/70
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.33 Assessing classification accuracy IV. Bootstrapped data splitting l Use 2/3 of sample data (randomly sampled) to generate discriminant functions (learning set) l Use derived discriminant functions to classify other 1/3 (test set) and produce classification results. l Repeat a large number (e.g. 1000) times, each time sampling with replacement. l Generate classification statistics over bootstrapped samples, e.g. mean classification results, standard errors, etc. l Use 2/3 of sample data (randomly sampled) to generate discriminant functions (learning set) l Use derived discriminant functions to classify other 1/3 (test set) and produce classification results. l Repeat a large number (e.g. 1000) times, each time sampling with replacement. l Generate classification statistics over bootstrapped samples, e.g. mean classification results, standard errors, etc. GroupTotal Group Total Misclassification (G 2 ) = 42.3% Misclassification (G 1 ) = 14.2% Overall misclassification = 23.0%
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.34 Interpreting discriminant functions l Examine standardized coefficients (coefficients of discriminant functions based on standardized values) l For interpretation, use variables with large absolute standardized coefficients. l Examine standardized coefficients (coefficients of discriminant functions based on standardized values) l For interpretation, use variables with large absolute standardized coefficients. l Examine the discriminant-variable correlations. l For interpretation, use variables with high correlations with important discriminant functions.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.35 Example: Fisher’s famous irises l Data: four variables (sepal length, sepal width, petal length, petal width), 3 species, N = 150 (50 for each species). l Problem: find the “best” set of DFs. l Data: four variables (sepal length, sepal width, petal length, petal width), 3 species, N = 150 (50 for each species). l Problem: find the “best” set of DFs.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.36 Example: Fisher’s famous irises: between-groups F- matrix l Matrix entries are F – values from one-way MANOVA comparing group means, and can be considered measures of the distance between group centroids. l Do not use associated probabilities to determine “significance” unless you correct for multiple tests. l Matrix entries are F – values from one-way MANOVA comparing group means, and can be considered measures of the distance between group centroids. l Do not use associated probabilities to determine “significance” unless you correct for multiple tests. Species
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.37 Example: Fisher’s famous irises: canonical discriminant functions l Four variables (sepal length, sepal width, petal length, petal width), 3 species, N = 150 (50 for each species). Canonical discriminant functions 1 2 Constant SEPALLEN SEPALWID PETALLEN PETALWID Note: discriminant functions are derived using equal priors.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.38 Example: Fisher’s famous irises: standardized canonical discriminant functions l Four variables (sepal length, sepal width, petal length, petal width), 3 species, N = 150 (50 for each species). Standardized canonical discriminant functions 1 2 SEPALLEN SEPALWID PETALLEN PETALWID Note: standardized canonical discriminant functions are based on standardized values.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.39 l Eigenvalues give amount of differences among groups captured by a a particular discriminant function, and cumulative proportion of dispersion is the corresponding proportion. Discriminant function Parameter12 Eigenvalues Canonical correlation Cumulative proportion of dispersion l Canonical correlation is the correlation between a given canonical variate (DF) and a set of two dummy variables representing each group.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.40 Fisher’s irises: raw and jackknifed classification results l In this case, results are identical (a relatively rare occurrence!) Species % correct Species Total Species % correct Species Total
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.41 Dicriminant function analysis: caveats and notes l Unless the ratio of number of objects/number of variables is large (> 20), standardized coefficients and correlations are unstable. l DFA is unaffected by differences among variables in scale, so standardization is not required (unlike PCA, FA, etc.) l Unless the ratio of number of objects/number of variables is large (> 20), standardized coefficients and correlations are unstable. l DFA is unaffected by differences among variables in scale, so standardization is not required (unlike PCA, FA, etc.) l Linear DFA is quite sensitive to the assumption of equality of covariance matrices among groups. If this assumption is violated, use quadratic classification. l However, quadratic DFA is more unstable when N is small and normality does not hold.