Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 A Problem in Personnel Classification This problem is from Phillip J. Rulon, David V. Tiedeman, Maurice Tatsuoka, and Charles R. Langmuir. Multivariate.

Similar presentations


Presentation on theme: "Slide 1 A Problem in Personnel Classification This problem is from Phillip J. Rulon, David V. Tiedeman, Maurice Tatsuoka, and Charles R. Langmuir. Multivariate."— Presentation transcript:

1 Slide 1 A Problem in Personnel Classification This problem is from Phillip J. Rulon, David V. Tiedeman, Maurice Tatsuoka, and Charles R. Langmuir. Multivariate Statistics for Personnel Classification. 1967. This sample data is for "World Airlines, a company employing over 50,000 persons and operating scheduled flights. This company naturally needs many men who can be assigned to a particular set of functions. The mechanics on the line who service the equipment of World Airlines form one of the groups we shall consider. A second group are the agents who deal with the passengers of the airline. A third group are the men in operations who coordinate airline activities. The personnel officer of World Airlines has developed an Activity Preference Inventory for the use of the airline. The first section of this inventory contains 30 pairs of activities, each pair naming an indoor activity and an outdoor activity. One item is _____ Billiards : Golf _____ The applicant for a job in World Airlines checks the activity he prefers. The score in the number of outdoor activities marked." (page 24) The second section of the Activity Preference Inventory "contains 35 items. One activity of each pair is a solitary activity, the other convivial. An example is _____ Solitaire : Bridge _____ The apprentice's score is the number of convivial activities he prefers." (page 82) The third section of the Activity Preference Inventory "contains 25 items. One activity of each pair is a liberal activity, the other a conservative activity. An example is _____ Counseling : Advising _____ Discriminant Analysis

2 Slide 2 Discriminant Analysis The apprentice's score is the number of conservative activities he prefers." (page 153) The Activity Preference Inventory was administered to 244 employees in the three job classifications who were successful and satisfied with their jobs. The dependent variable, JOBCLASS 'Job Classification' included three job classifications: 1 - Passenger Agents, 2 - Mechanics, and 3 - Operations Control. The purpose of the analysis is to develop a classification scheme based on scores on the Activity Preference Inventory to assign new employees to the different job groups. A Problem in Personnel Classification (continued)

3 Slide 3 Stage One: Define the Research Problem In this stage, the following issues are addressed: Relationship to be analyzed Specifying the dependent and independent variables Method for including independent variables Discriminant Analysis Relationship to be analyzed We are interested in the relationship between scores on the three scales of the Activity Preference Inventory and the different job classifications. Specifying the dependent and independent variables The dependent variable is: JOBCLASS 'Job Classification' The independent variables are: OUTDOOR, 'Outdoor Activity Score' CONVIV, 'Convivial Score' CONSERV, 'Conservative Score' Method for including independent variables Since the purpose of this analysis is to articulate the relationship between the activity scores and job classification, direct entry of independent variables would be an appropriate method for selecting variables. However, I prefer to use a stepwise method in order to identify which predictors are statistically significant.

4 Slide 4 Stage 2: Develop the Analysis Plan: Sample Size Issues In this stage, the following issues are addressed: Missing data analysis Minimum sample size requirement: 20+ cases per independent variable Division of the sample: 20+ cases in each dependent variable group Discriminant Analysis Missing data analysis There is no missing data in this data set. Minimum sample size requirement: 20+ cases per independent variable The data set contains 244 subjects and 3 independent variables. The ratio of 81 cases per independent variable excess the minimum sample size requirement. Division of the sample: 20+ cases in each dependent variable group There were 85 Passenger Agents in the sample, 93 Mechanics, and 66 Operations Control staff in the sample. There are more than 20 cases in each dependent variable group.

5 Slide 5 Stage 2: Develop the Analysis Plan: Measurement Issues: In this stage, the following issues are addressed: Incorporating nonmetric data with dummy variables Representing curvilinear effects with polynomials Representing interaction or moderator effects Discriminant Analysis Incorporating Nonmetric Data with Dummy Variables None of the variables are nonmetric. Representing Curvilinear Effects with Polynomials We do not have any evidence of curvilinear effects at this point in the analysis. Representing Interaction or Moderator Effects We do not have any evidence at this point in the analysis that we should add interaction or moderator variables.

6 Slide 6 Stage 3: Evaluate Underlying Assumptions In this stage, the following issues are addressed: Nonmetric dependent variable and metric or dummy-coded independent variables Multivariate normality of metric independent variables: assess normality of individual variables Linear relationships among variables Assumption of equal dispersion for dependent variable groups Discriminant Analysis Nonmetric dependent variable and metric or dummy-coded independent variables The dependent variable is nonmetric. All of the independent variables are metric. Multivariate normality of metric independent variables Since there is not a method for assessing multivariate normality, we assess the normality of the individual metric variables.

7 Slide 7 Run the 'NormalityAssumptionAndTransformations' Script Discriminant Analysis

8 Slide 8 Complete the 'Test for Assumption of Normality' Dialog Box Discriminant Analysis Tests of Normality We find that all three of the independent variables fail the test of normality, and that none of the transformations induced normality in any of the variables. We should note the failure to meet the normality assumption for possible inclusion in our discussion of findings.

9 Slide 9 Linear relationships among variables Since our dependent variable is not metric, we cannot use it to test for linearity of the independent variables. As an alternative, we can plot each metric independent variable against all other independent variables in a scatterplot matrix to look for patterns of nonlinear relationships. If one of the independent variables shows multiple nonlinear relationships to the other independent variables, we consider it a candidate for transformation Discriminant Analysis

10 Slide 10 Requesting a Scatterplot Matrix Discriminant Analysis

11 Slide 11 Specifications for the Scatterplot Matrix Discriminant Analysis

12 Slide 12 The Scatterplot Matrix Blue fit lines were added to the scatterplot matrix to improve interpretability. Having computed a scatterplot for all combinations of metric independent variables, we identify all of the variables that appear in any plot that shows a nonlinear trend. We will call these variables our nonlinear candidates. To identify which of the nonlinear candidates is producing the nonlinear pattern, we look at all of the plots for each of the candidate variables. The candidate variable that is not linear should show up in a nonlinear relationship in several plots with other linear variables. Hopefully, the form of the plot will suggest the power term to best represent the relationship, e.g. squared term, cubed term, etc. None of the scatterplots show evidence of any nonlinear relationships. Discriminant Analysis

13 Slide 13 Assumption of equal dispersion for dependent variable groups Box's M statistic tests for homogeneity of dispersion matrices across the subgroups of the dependent variable. The null hypothesis is that the dispersion matrices are homogenous. If the analysis fails this test, we can request classification using separate group dispersion matrices to see it this improves the model's accuracy rate. Box's M statistic is produced by the SPSS discriminant procedure, so we will defer this question until we have obtained the discriminant analysis output. Discriminant Analysis

14 Slide 14 Stage 4: Estimation of Discriminant Functions and Overall Fit: The Discriminant Functions In this stage, the following issues are addressed: Compute the discriminant analysis Overall significance of the discriminant function(s) Discriminant Analysis Compute the discriminant analysis The steps to obtain a discriminant analysis are detailed on the following screens.

15 Slide 15 Requesting a Discriminant Analysis Discriminant Analysis

16 Slide 16 Specifying the Dependent Variable Discriminant Analysis

17 Slide 17 Specifying the Independent Variables Discriminant Analysis

18 Slide 18 Specifying Statistics to Include in the Output Discriminant Analysis

19 Slide 19 Specifying the Stepwise Method for Selecting Variables Discriminant Analysis

20 Slide 20 Specifying the Classification Requirement Discriminant Analysis

21 Slide 21 Complete the Discriminant Analysis Request Discriminant Analysis

22 Slide 22 Overall significance of the discriminant function(s) - 1 Our first task is to determine whether or not there is a statistically significant relationship between the independent variables and the dependent variable. We navigate to the section of output titled "Summary of Canonical Discriminant Functions" to locate the following outputs: Recall that the maximum number of discriminant functions is equal to the number of groups in the dependent variable minus one, or the number of variables in the analysis, whichever is smaller. For this problem, the maximum number of discriminant functions is two. Discriminant Analysis

23 Slide 23 Overall significance of the discriminant function(s) - 2 In the Wilks' Lambda table, SPSS successively tests models with an increasing number of functions. The first line of the table tests the null hypothesis that the mean discriminant scores for the two possible functions are equal in the three groups of the dependent variable. Since the probability of the chi-square statistic for this test is less than 0.0001, we reject the null hypothesis and conclude that there is at least one statistically significant function. Had the probability for this test been larger than 0.05, we would have concluded that there are no discriminant functions which separate the groups of the dependent variable. The second line of the Wilks' Lambda table tests the null hypothesis that the mean discriminant scores for the second possible discriminant function are equal in the three groups of the dependent variable. Since the probability of the chi-square statistic for this test is less than 0.0001, we reject the null hypothesis and conclude that the second discriminant function, as well as the first, is statistically significant. Had the probability for this test been larger than 0.05, we would have concluded that there is only one discriminant function to separate the groups of the dependent variable. Our conclusion from this output is that there are two statistically discriminant functions for this problem. Discriminant Analysis

24 Slide 24 Stage 4: Estimation of Discriminant Functions and Overall Fit: Assessing Model Fit In this stage, the following issues are addressed: Assumption of equal dispersion for dependent variable groups Classification accuracy chance criteria Press's Q statistic Presence of outliers Discriminant Analysis

25 Slide 25 In discriminant analysis, the best measure of overall fit is classification accuracy. The appropriateness of using the pooled covariance matrix in the classification phase is evaluated by the Box's M statistic. We examine the probability of the Box's M statistic to determine whether or not we meet the assumption of equal dispersion of the dispersion or covariance matrices (multivariate measure of variance). This test is very sensitive, so we should select a conservative alpha value of 0.01. At that alpha level, we fail to reject the null hypothesis for this analysis. Assumption of equal dispersion for dependent variable groups Had we failed this test, our remedy would be to re-run the discriminant analysis requesting the use of separate covariance matrices in classification. Discriminant Analysis

26 Slide 26 Classification accuracy chance criteria - 1 The classification matrix for this problem computed by SPSS is shown below: Discriminant Analysis

27 Slide 27 Classification accuracy chance criteria - 2 Following the text, we compare the accuracy rate for the cross-validated sample (75.0%) to each of the by chance accuracy rates. In the table of Prior Probabilities for Groups, we see that the three groups contained.348,.381, and.270 of the sample of 244 cases used to derive the discriminant model. The proportional chance criteria for assessing model fit is calculated by summing the squared proportion that each group represents of the sample, in this case (0.348 x 0.348) + (0.381 x 0.381 ) + (0.270 x 0.270) = 0.339. Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 0.339= 0.4424. Our model accuracy rate of 75% exceeds this standard. The maximum chance criteria is the proportion of cases in the largest group, 38.1% in this problem. Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 38.1% = 47.6%. Our model accuracy rate of 75% exceeds this standard. Discriminant Analysis

28 Slide 28 Press's Q statistic Substituting the values for this problem (244 cases, 183 correct classifications, and 3 groups) into the formula for Press's Q statistic, we obtain a value = [244 - (183 x 3)] ^ 2 / 244 * (3 - 1) = 190.6. This value exceeds the critical value of 6.63 (Text, page 305) so we conclude that the prediction accuracy is greater than that expected by chance. By all three criteria, we would interpret our model as having an accuracy above that expected by chance. Thus, this is a valuable or useful model that supports predictions of the dependent variable. Discriminant Analysis

29 Slide 29 SPSS print Mahalanobis distance scores for each case in the table of Casewise Statistics, so we can use this as a basis for detecting outliers. According to the SPSS Applications Guide, p.227, cases with large values of the Mahalanobis Distance from their group mean can be identified as outliers. For large samples from a multivariate normal distribution, the square of the Mahalanobis distance from a case to its group mean is approximately distributed as a chi-square statistic with degrees of freedom equal to the number of variables in the analysis. The critical value of chi-square with 3 degrees of freedom (the stepwise procedure entered three variables in the function) and an alpha of 0.01 (we only want to detect major outliers) is 11.345. We can request this figure from SPSS using the following compute command: COMPUTE mahcutpt = IDF.CHISQ(0.99,3). EXECUTE. Where 0.99 is the cumulative probability up to the significance level of interest and 3 is the number of degrees of freedom. SPSS will create a column of values in the data set that contains the desired value. We scan the table of Casewise Statistics to identify any cases that have a Squared Mahalanobis distance greater than 11.345 for the group to which the case is most likely to belong, i.e. under the column labeled 'Highest Group.' Presence of outliers - 1 Discriminant Analysis

30 Slide 30 Presence of outliers - 2 In this particular analysis, I find one case, number 23, with a large enough Mahalanobis distance to indicate that it is an outlier and might be considered for removal from the analysis. However, since there is only one case out of 244, it is not likely to make any difference, so we will forego re-running the analysis without this case. Discriminant Analysis

31 Slide 31 Stage 5: Interpret the Results In this section, we address the following issues: Number of functions to be interpreted Relationship of functions to categories of the dependent variable Assessing the contribution of predictor variables Impact of multicollinearity on solution Discriminant Analysis Number of functions to be interpreted As indicated previously, there are two significant discriminant functions to be interpreted.

32 Slide 32 Role of functions in differentiating categories of the dependent variable The combined-groups scatterplot enables us to link the discriminant functions to the categories of the dependent variable. I have modified the SPSS output by changing the symbols for the different points so that we can easily detect the group members. In addition, I have added reference lines at the zero value for each axis. Analyzing this plot, we see that the first function differentiates Passenger Agents from Mechanics and Operations Control personnel. The second function differentiates Mechanics from Operations Control staff. Discriminant Analysis

33 Slide 33 Assessing the contribution of predictor variables - 1 Identifying the statistically significant predictor variables The summary table of variables entering and leaving the discriminant functions is shown below. We can see that we have three independent variables included in the analysis in the order shown in the table. We would conclude all three of the independent variables, Outdoor Activity Score, Convivial Score, and Conservative Score make a statistically significant contribution to group membership on the dependent variable. Discriminant Analysis

34 Slide 34 Assessing the contribution of predictor variables - 2 Importance of Variables and the Structure Matrix To determine which predictor variables are more important in predicting group membership when we use a stepwise method of variable selection, we can simply look at the order in which the variables entered, as shown in the following table. Discriminant Analysis

35 Slide 35 Assessing the contribution of predictor variables - 3 While we know which variables were important to the overall analysis, we are also concerned with which variables are important to which discriminant function. This information is provided by the structure matrix, which is a rotated correlation matrix containing the correlations between each of the independent variables and the discriminant function scores. From the structure matrix, we see that two of the three variable entered into the functions (Convivial Score and Conservative Score) are the important variables in the first discriminant function, while Outdoor Activity Score is the important variable on the second function. Discriminant Analysis

36 Slide 36 Assessing the contribution of predictor variables - 4 Comparing Group Means to Determine Direction of Relationships If we examine the pattern of means for the three statistically significant variables for the three job classifications, we can provider a fuller discussion of the relationships between the independent variables, the dependent variable groups, and the discriminant functions. Discriminant Analysis

37 Slide 37 Assessing the contribution of predictor variables - 5 The first discriminant function distinguishes Passenger Agents from Mechanics and Operations Control staff. The two variables that are important on the first function are convivial score and conservative score. Passenger agents had higher convivial scores and lower conservative scores than the other two groups. Operations Control staff are distinguished from Mechanics by the second discriminant function which contains only a single variable, the outdoor activity Score. Mechanics had a higher average on the outdoor activity score than did Operations Control staff. In sum, Passenger Agents are more outgoing (convivial) and more tolerant (less conservative) than Mechanics and Operations Control personnel. Mechanics differ from Operations Control personnel in their stronger preference for outdoor oriented activities. Discriminant Analysis

38 Slide 38 Impact of Multicollinearity on solution Multicollinearity is indicated by SPSS for discriminant analysis by very small tolerance values for variables, e.g. less than 0.10 (0.10 is the size of the tolerance, not its significance value). If we look at the table of Variables Not In The Analysis, we see that it did not print anything for step 3, indicating that all variables were in the analysis. Multicollinearity is not an issue in this problem. Discriminant Analysis

39 Slide 39 Stage 6: Validate The Model In this stage, we are normally concerned with the following issues: Conducting the Validation Analysis Generalizability of the Discriminant Model Discriminant Analysis Conducting the Validation Analysis To validate the discriminant analysis, we can randomly divide our sample into two groups, a screening sample and a validation sample. The analysis is computed for the screening sample and used to predict membership on the dependent variable in the validation sample. If the model in the screening sample is valid, we would expect that the accuracy rates for both samples to be about the same. In the double cross-validation strategy, we reverse the designation of the screening and validation sample and re-run the analysis. We can then compare the discriminant functions derived for both samples. If the two sets of functions contain a very different set of variables, it indicates that the variables might have achieved significance because of the sample size and not because of the strength of the relationship. Our findings about these individual variables would be that the predictive utility of these variables is not generalizable.

40 Slide 40 Set the Starting Point for Random Number Generation Discriminant Analysis

41 Slide 41 Compute the Variable to Randomly Split the Sample into Two Halves Discriminant Analysis

42 Slide 42 Specify the Cases to Include in the First Screening Sample Discriminant Analysis

43 Slide 43 Specify the Value of the Selection Variable for the First Validation Analysis Discriminant Analysis

44 Slide 44 Specify the Value of the Selection Variable for the Second Validation Analysis Discriminant Analysis

45 Slide 45 Generalizability of the Discriminant Model We base our decisions about the generalizability of the discriminant model on a table which compares key outputs comparing the analysis with the full data set to each of the validation runs. Full ModelSplit=0Split=1 Number of Significant Functions 222 Cross-validated Accuracy75.0%74.8%76.0% Accuracy Rate for Validation Sample 77.7%76.4% Significant Coefficients (p < 0.05) 1. OUTDOOR Outdoor Activity Score 2. CONVIV Convivial Score 3. CONSERV Conservative Score 2. OUTDOOR Outdoor Activity Score 1. CONVIV Convivial Score 3. CONSERV Conservative Score 1. OUTDOOR Outdoor Activity Score 2. CONVIV Convivial Score 3. CONSERV Conservative Score In both of the validation analyses, two significant discriminant functions were found. The cross-validated accuracy rates and the accuracy rate for the validation samples were approximately the same size. Both validation analyses included the three available independent variables, though the order of entry differed. The results of the validation analyses are similar to the model with the full data set. We can conclude that the model is generalizable. Discriminant Analysis


Download ppt "Slide 1 A Problem in Personnel Classification This problem is from Phillip J. Rulon, David V. Tiedeman, Maurice Tatsuoka, and Charles R. Langmuir. Multivariate."

Similar presentations


Ads by Google