Presentation is loading. Please wait.

Presentation is loading. Please wait.

SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.

Similar presentations


Presentation on theme: "SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems."— Presentation transcript:

1 SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems

2 SW388R7 Data Analysis & Computers II Slide 2 General criteria for split sample validation  It is expected that the results obtained from split sample validations will vary somewhat from the results obtained from the analysis using the full data set. We will use the following as our criteria that the validation verified our analysis and supports the generalizability of our findings:  First, the overall relationship between the dependent variable and the set of independent variable must be statistically significant for both validation analyses.  Second, the R² for each validation must be within 5% (plus or minus) of the R² for the model using the full sample.

3 SW388R7 Data Analysis & Computers II Slide 3 General criteria for split sample validation - 2  Third, the pattern of statistical significance for the coefficients of the independent variables for both validation analyses must be the same as the pattern for the full analysis, i.e. the same variables are statistically significant or not significant.  For stepwise multiple regression, we require that the same variables be significant, but it is not required that they enter in exactly the same order.  For hierarchical multiple regression, the R² change for the validation must be statistically significant.

4 SW388R7 Data Analysis & Computers II Slide 4 Notes  Findings are stated on the results for the analysis of the full data set, not the validations.  If our validation analysis does not support the findings of the analysis on the full data set, we will declare the answer to the problem to be false. There is, however, another common option, which is to only cite statistical significance for independent variables supported by the validation analysis as well as the full data set analysis. All other variables are considered to be non-significant. Generally it is the independent variables with the weakest individual relationship to the dependent variable which fail to validate.

5 SW388R7 Data Analysis & Computers II Slide 5 Problem 1 1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using 552804 as the random number seed. The variables "age" [age], "highest year of school completed" [educ], and "sex" [sex] have a moderate relationship to the variable "occupational prestige score" [prestg80]. Survey respondents who were older had more prestigious occupations. Survey respondents who had completed more years of school had more prestigious occupations. The variable sex did not have a relationship to "occupational prestige score" [prestg80]. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

6 SW388R7 Data Analysis & Computers II Slide 6 Steps prior to the validation analysis  Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers.  Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation.  When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate.

7 SW388R7 Data Analysis & Computers II Slide 7 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 1 The probability of the F statistic (36.639) for the overall regression relationship is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the set of independent variables and the dependent variable (R² = 0). We support the research hypothesis that there is a statistically significant relationship between the set of independent variables and the dependent variable.

8 SW388R7 Data Analysis & Computers II Slide 8 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 2 The Multiple R for the relationship between the set of independent variables and the dependent variable is 0.554, which would be characterized as moderate using the rule of thumb than a correlation less than or equal to 0.20 is characterized as very weak; greater than 0.20 and less than or equal to 0.40 is weak; greater than 0.40 and less than or equal to 0.60 is moderate; greater than 0.60 and less than or equal to 0.80 is strong; and greater than 0.80 is very strong. The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by all of the independent variables is 30.7%.

9 SW388R7 Data Analysis & Computers II Slide 9 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 For the independent variable age, the probability of the t statistic (2.749) for the b coefficient is 0.006 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with age is equal to zero (b = 0) and conclude that there is a statistically significant relationship between age and occupational prestige score. The b coefficient associated with age (0.125) is positive, indicating a direct relationship in which higher numeric values for age are associated with higher numeric values for occupational prestige score. Therefore, the positive value of b implies that survey respondents who were older had more prestigious occupations.

10 SW388R7 Data Analysis & Computers II Slide 10 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 2 For the independent variable highest year of school completed, the probability of the t statistic (10.352) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with highest year of school completed is equal to zero (b = 0) and conclude that there is a statistically significant relationship between highest year of school completed and occupational prestige score. The b coefficient associated with highest year of school completed (2.810) is positive, indicating a direct relationship in which higher numeric values for highest year of school completed are associated with higher numeric values for occupational prestige score. Therefore, the positive value of b implies that survey respondents who had completed more years of school had more prestigious occupations.

11 SW388R7 Data Analysis & Computers II Slide 11 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 3 For the independent variable sex, the probability of the t statistic (-0.917) for the b coefficient is 0.360 which is greater than the level of significance of 0.05. We fail to reject the null hypothesis that the slope associated with sex is equal to zero (b = 0) and conclude that there is not a statistically significant relationship between sex and occupational prestige score.

12 SW388R7 Data Analysis & Computers II Slide 12 Setting the random number seed To set the random number seed, select the Random Number Seed… command from the Transform menu.

13 SW388R7 Data Analysis & Computers II Slide 13 Set the random number seed First, click on the Set seed to option button to activate the text box. Second, type in the random seed stated in the problem. Third, click on the OK button to complete the dialog box. Note that SPSS does not provide you with any feedback about the change.

14 SW388R7 Data Analysis & Computers II Slide 14 Select the compute command To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.

15 SW388R7 Data Analysis & Computers II Slide 15 The formula for the split variable First, type the name for the new variable, split, into the Target Variable text box. Second, the formula for the value of split is shown in the text box. The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0.50. If the random number is less than or equal to 0.50, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.50, the formula will return a 0, the SPSS numeric equivalent to false. Third, click on the OK button to complete the dialog box.

16 SW388R7 Data Analysis & Computers II Slide 16 The split variable in the data editor In the data editor, the split variable shows a random pattern of zero’s and one’s. To select half of the sample for each validation analysis, we will first select the cases where split = 0, then select the cases where split = 1.

17 SW388R7 Data Analysis & Computers II Slide 17 Repeat the regression with first validation sample To repeat the multiple regression analysis for the first validation sample, select Linear Regression from the Dialog Recall tool button.

18 SW388R7 Data Analysis & Computers II Slide 18 Using "split" as the selection variable First, scroll down the list of variables and highlight the variable split. Second, click on the right arrow button to move the split variable to the Selection Variable text box.

19 SW388R7 Data Analysis & Computers II Slide 19 Setting the value of split to select cases When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the Rule… button to enter a value for split.

20 SW388R7 Data Analysis & Computers II Slide 20 Completing the value selection First, type the value for the first half of the sample, 0, into the Value text box. Second, click on the Continue button to complete the value entry.

21 SW388R7 Data Analysis & Computers II Slide 21 Requesting output for the first validation sample When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 0 for the split variable. Click on the OK button to request the output. Since the validation analysis requires us to compare the results of the analysis using the two split sample, we will request the output for the second sample before doing any comparison.

22 SW388R7 Data Analysis & Computers II Slide 22 Repeat the regression with second validation sample To repeat the multiple regression analysis for the second validation sample, select Linear Regression from the Dialog Recall tool button.

23 SW388R7 Data Analysis & Computers II Slide 23 Setting the value of split to select cases Since the split variable is already in the Selection Variable text box, we only need to change its value. Click on the Rule… button to enter a different value for split.

24 SW388R7 Data Analysis & Computers II Slide 24 Completing the value selection First, type the value for the first half of the sample, 1, into the Value text box. Second, click on the Continue button to complete the value entry.

25 SW388R7 Data Analysis & Computers II Slide 25 Requesting output for the second validation sample When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable. Click on the OK button to request the output.

26 SW388R7 Data Analysis & Computers II Slide 26 SPLIT-SAMPLE VALIDATION - 1 In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant. In the first validation, the probability for the F statistic testing overall relationship was <0.001. For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001. Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables.

27 SW388R7 Data Analysis & Computers II Slide 27 SPLIT-SAMPLE VALIDATION - 2 The proportion of variance in the relationship utilizing the full data set was 30.7% compared to 24.5% for the first split sample validation and 39.6% for the second split sample validation. In both of the split-sample validation analyses, the proportion of variance in the dependent variable explained by the independent variables was not within 5% of the variance explained in the model using the full data set (30.7%). The strength of the relationship between the independent variables and the dependent variable is not supported by the validation. The answer to the question is false, because the validation did not verify the strength of the relationship in the analysis of the full data set.

28 SW388R7 Data Analysis & Computers II Slide 28 Table of validation results: standard regression Full Data SetSplit = 0 (Split1 = 1) Split = 1 (Split2 = 1) ANOVA significance (sig <= 0.05) <0.001 R2R2 0.3070.2450.396 Significant Coefficients (sig <= 0.05) Age of respondent Highest year of school completed Age of respondent Highest year of school completed This table shows us that the validation failed in the strength of relationship row (R²), where the R ² for the validations did not fall within 5% if the R ² for the full data set. Had we satisfied this criteria, the validation would have failed in the Significant Coefficients row, where age of respondent was not statistically significant in the first validation analysis. It may be helpful to create a table for our validation results and fill in its cells as we complete the analysis.

29 SW388R7 Data Analysis & Computers II Slide 29 Problem 2 1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using 397155 as the random number seed. After controlling for the effects of the variables "age" [age] and "sex" [sex], the addition of the variable "happiness of marriage" [hapmar] reduces the error in predicting "general happiness" [happy] by 37.0%. After controlling for age and sex, the variable happiness of marriage makes an individual contribution to reducing the error in predicting general happiness. Survey respondents who were less happy with their marriages were less happy overall. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

30 SW388R7 Data Analysis & Computers II Slide 30 Steps prior to the validation analysis  Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers.  Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation.  When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate.  For hierarchical regression, the R² change statistics must be statistically significant in order for the model to be validated.

31 SW388R7 Data Analysis & Computers II Slide 31 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 1 The probability of the F statistic (25.476) for the overall regression relationship for all indpendent variables is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the set of all independent variables and the dependent variable (R² = 0). We support the research hypothesis that there is a statistically significant relationship between the set of all independent variables and the dependent variable.

32 SW388R7 Data Analysis & Computers II Slide 32 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 2 The R Square Change statistic for the increase in R² associated with the added variables (happiness of marriage) is 0.370. Using a proportional reduction in error interpretation for R², information provided by the added variables reduces our error in predicting general happiness by 37.0%. The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by all of the independent variables is 37.2%.

33 SW388R7 Data Analysis & Computers II Slide 33 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 3 The probability of the F statistic (75.972) for the change in R² associated with the addition of the predictor variables to the regression analysis containing the control variables is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no improvement in the relationship between the set of independent variables and the dependent variable when the predictors are added (R² Change = 0). We support the research hypothesis that there is a statistically significant improvement in the relationship between the set of independent variables and the dependent variable.

34 SW388R7 Data Analysis & Computers II Slide 34 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 For the independent variable happiness of marriage, the probability of the t statistic (8.716) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with happiness of marriage is equal to zero (b = 0) and conclude that there is a statistically significant relationship between happiness of marriage and general happiness.

35 SW388R7 Data Analysis & Computers II Slide 35 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 2 The b coefficient associated with happiness of marriage (0.658) is positive, indicating a direct relationship in which higher numeric values for happiness of marriage are associated with higher numeric values for general happiness. The independent variable happiness of marriage is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who were less happy with their marriages. The dependent variable general happiness is also an ordinal variable. It is coded so that higher numeric values are associated with survey respondents who were less happy overall. Therefore, the positive value of b implies that survey respondents who were less happy with their marriages were less happy overall.

36 SW388R7 Data Analysis & Computers II Slide 36 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 3 Age and sex are variables whose effects are controlled for, so there is no requirement that they be statistically significant.

37 SW388R7 Data Analysis & Computers II Slide 37 Setting the random number seed To set the random number seed, select the Random Number Seed… command from the Transform menu.

38 SW388R7 Data Analysis & Computers II Slide 38 Set the random number seed First, click on the Set seed to option button to activate the text box. Second, type in the random seed stated in the problem. Third, click on the OK button to complete the dialog box. Note that SPSS does not provide you with any feedback about the change.

39 SW388R7 Data Analysis & Computers II Slide 39 Select the compute command To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.

40 SW388R7 Data Analysis & Computers II Slide 40 The formula for the split variable First, type the name for the new variable, split, into the Target Variable text box. Second, the formula for the value of split is shown in the text box. The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0.50. If the random number is less than or equal to 0.50, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.50, the formula will return a 0, the SPSS numeric equivalent to false. Third, click on the OK button to complete the dialog box.

41 SW388R7 Data Analysis & Computers II Slide 41 The split variable in the data editor In the data editor, the split variable shows a random pattern of zero’s and one’s. To select half of the sample for each validation analysis, we will first select the cases where split = 0, then select the cases where split = 1.

42 SW388R7 Data Analysis & Computers II Slide 42 Repeat the regression with first validation sample To repeat the multiple regression analysis for the first validation sample, select Linear Regression from the Dialog Recall tool button.

43 SW388R7 Data Analysis & Computers II Slide 43 Using "split" as the selection variable First, scroll down the list of variables and highlight the variable split. Second, click on the right arrow button to move the split variable to the Selection Variable text box.

44 SW388R7 Data Analysis & Computers II Slide 44 Setting the value of split to select cases When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the Rule… button to enter a value for split.

45 SW388R7 Data Analysis & Computers II Slide 45 Completing the value selection First, type the value for the first half of the sample, 0, into the Value text box. Second, click on the Continue button to complete the value entry.

46 SW388R7 Data Analysis & Computers II Slide 46 Requesting output for the first validation sample When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 0 for the split variable. Click on the OK button to request the output. Since the validation analysis requires us to compare the results of the analysis using the two split sample, we will request the output for the second sample before doing any comparison.

47 SW388R7 Data Analysis & Computers II Slide 47 Repeat the regression with second validation sample To repeat the multiple regression analysis for the second validation sample, select Linear Regression from the Dialog Recall tool button.

48 SW388R7 Data Analysis & Computers II Slide 48 Setting the value of split to select cases Since the split variable is already in the Selection Variable text box, we only need to change its value. Click on the Rule… button to enter a different value for split.

49 SW388R7 Data Analysis & Computers II Slide 49 Completing the value selection First, type the value for the first half of the sample, 1, into the Value text box. Second, click on the Continue button to complete the value entry.

50 SW388R7 Data Analysis & Computers II Slide 50 Requesting output for the second validation sample When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable. Click on the OK button to request the output.

51 SW388R7 Data Analysis & Computers II Slide 51 SPLIT-SAMPLE VALIDATION - 1 In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant. In the first validation, the probability for the F statistic testing overall relationship was <0.001. For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001. Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables.

52 SW388R7 Data Analysis & Computers II Slide 52 SPLIT-SAMPLE VALIDATION - 2 The total proportion of variance in the relationship utilizing the full data set was 37.2% compared to 36.8% for the first split sample validation and 39.5% for the second split sample validation. In both of the split-sample validation analyses, the total proportion of variance in the dependent variable explained by the independent variables was within 5% of the variance explained in the model using the full data set (37.2%).

53 SW388R7 Data Analysis & Computers II Slide 53 SPLIT-SAMPLE VALIDATION - 3 In both of the split-sample validation analyses, the change in R² from the model including only control variables to the model containing all variables was statistically significant. In the first validation, the probability for the F statistic testing the change in R² was <0.001. For the second validation analysis, the probability for the F statistic testing the change in R² was <0.001. Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables, and the significance of the contribution of the independent variables included after the control variables.

54 SW388R7 Data Analysis & Computers II Slide 54 SPLIT-SAMPLE VALIDATION - 4 The relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statisically significant. In the first validation analysis, the probability for the test of relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was <0.001, which was less than or equal to 0.05 and statistically significant.

55 SW388R7 Data Analysis & Computers II Slide 55 SPLIT-SAMPLE VALIDATION - 5 The split sample validation supports the findings of the regression analysis using the full data set. A caution is added because of the inclusion of ordinal level variables. The answer to the question is true with caution. In the second validation analysis, the probability for the test of relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was <0.001, which was less than or equal to 0.05 and statistically significant.

56 SW388R7 Data Analysis & Computers II Slide 56 Table of validation results: hierarchical regression Full Data SetSplit = 0 (Split1 = 1) Split = 1 (Split2 = 1) ANOVA significance (sig <= 0.05) <0.001 R2R2 0.3720.3680.395 R 2 Change (sig <= 0.05)<0.001 Significant Coefficients (sig <= 0.05) Happiness of marriage In this example, we satisfy all of the validation criteria. The validation supports the generalizability of the regression model to the population represented by this sample. NOTE: we add the R ² change statistic to the validation results because it indicates the contribution of the variables added in the second stage.

57 SW388R7 Data Analysis & Computers II Slide 57 Problem 3 1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using 636396 as the random number seed. From the list of variables "how many in family earned money" [earnrs], "income" [rincom98], and "age" [age], the best predictors of "total family income" [income98] are "income" [rincom98] and "how many in family earned money" [earnrs]. Income and how many in family earned money have a strong relationship to total family income. The most important predictor of total family income is income. The second most important predictor of total family income is how many in family earned money. Survey respondents who had higher incomes had higher total family incomes. Survey respondents who had more family members earning money had higher total family incomes. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

58 SW388R7 Data Analysis & Computers II Slide 58 Steps prior to the validation analysis  Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers.  Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation.  When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate.  For stepwise regression, we will require that the same variables be selected in each validation analysis. We will not require that they be entered in the same order.

59 SW388R7 Data Analysis & Computers II Slide 59 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 1 The best subset of predictors for total family income included the independent variables: income and how many in family earned money. The probability of the F statistic (98.194) for the regression relationship which includes these variables is <0.001, less than or equal to the level of significance of 0.01. We reject the null hypothesis that there is no relationship between the best subset of independent variables and the dependent variable (R² = 0). We support the research hypothesis that there is a statistically significant relationship between the best subset of independent variables and the dependent variable.

60 SW388R7 Data Analysis & Computers II Slide 60 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 2 The Multiple R for the relationship between the subset of independent variables that best predict the dependent variable is 0.744, which would be characterized as strong using the rule of thumb than a correlation less than or equal to 0.20 is very weak; greater than 0.20 and less than or equal to 0.40 is weak; greater than 0.40 and less than or equal to 0.60 is moderate; greater than 0.60 and less than or equal to 0.80 is strong; and greater than 0.80 is very strong. The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by the independent variables included in the analysis is 55.4%.

61 SW388R7 Data Analysis & Computers II Slide 61 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 3 Based on the table of "Variables Entered/Removed," the most important predictor of total family income is income. The second most important predictor of total family income is how many in family earned money.

62 SW388R7 Data Analysis & Computers II Slide 62 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 For the independent variable income, the probability of the t statistic (11.322) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.01. We reject the null hypothesis that the slope associated with income is equal to zero (b = 0) and conclude that there is a statistically significant relationship between income and total family income.

63 SW388R7 Data Analysis & Computers II Slide 63 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 2 The b coefficient associated with income (0.470) is positive, indicating a direct relationship in which higher numeric values for income are associated with higher numeric values for total family income. The independent variable income is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who had higher incomes. The dependent variable total family income is also an ordinal variable. It is coded so that higher numeric values are associated with survey respondents who had higher total family incomes. Therefore, the positive value of b implies that survey respondents who had higher incomes had higher total family incomes.

64 SW388R7 Data Analysis & Computers II Slide 64 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 3 For the independent variable how many in family earned money, the probability of the t statistic (8.566) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.01. We reject the null hypothesis that the slope associated with how many in family earned money is equal to zero (b = 0) and conclude that there is a statistically significant relationship between how many in family earned money and total family income.

65 SW388R7 Data Analysis & Computers II Slide 65 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 4 The b coefficient associated with how many in family earned money (16.009) is positive, indicating a direct relationship in which higher numeric values for how many in family earned money are associated with higher numeric values for total family income. Therefore, the positive value of b implies that survey respondents who had more family members earning money had higher total family incomes. If we wanted to specifically interpret the value of the b coefficient (16.009), we would have to convert it out of log unit.

66 SW388R7 Data Analysis & Computers II Slide 66 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 5 The relationship between "age" [age] and "total family income" [income98] was not statistically significant for the model using the full data set (p=0.102).

67 SW388R7 Data Analysis & Computers II Slide 67 Setting the random number seed To set the random number seed, select the Random Number Seed… command from the Transform menu.

68 SW388R7 Data Analysis & Computers II Slide 68 Set the random number seed First, click on the Set seed to option button to activate the text box. Second, type in the random seed stated in the problem. Third, click on the OK button to complete the dialog box. Note that SPSS does not provide you with any feedback about the change.

69 SW388R7 Data Analysis & Computers II Slide 69 Select the compute command To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.

70 SW388R7 Data Analysis & Computers II Slide 70 The formula for the split variable First, type the name for the new variable, split, into the Target Variable text box. Second, the formula for the value of split is shown in the text box. The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0.50. If the random number is less than or equal to 0.50, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.50, the formula will return a 0, the SPSS numeric equivalent to false. Third, click on the OK button to complete the dialog box.

71 SW388R7 Data Analysis & Computers II Slide 71 The split variable in the data editor In the data editor, the split variable shows a random pattern of zero’s and one’s. To select half of the sample for each validation analysis, we will first select the cases where split = 0, then select the cases where split = 1.

72 SW388R7 Data Analysis & Computers II Slide 72 Repeat the regression with first validation sample To repeat the multiple regression analysis for the first validation sample, select Linear Regression from the Dialog Recall tool button.

73 SW388R7 Data Analysis & Computers II Slide 73 Using "split" as the selection variable First, scroll down the list of variables and highlight the variable split. Second, click on the right arrow button to move the split variable to the Selection Variable text box.

74 SW388R7 Data Analysis & Computers II Slide 74 Setting the value of split to select cases When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the Rule… button to enter a value for split.

75 SW388R7 Data Analysis & Computers II Slide 75 Completing the value selection First, type the value for the first half of the sample, 0, into the Value text box. Second, click on the Continue button to complete the value entry.

76 SW388R7 Data Analysis & Computers II Slide 76 Requesting output for the first validation sample When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 0 for the split variable. Click on the OK button to request the output. Since the validation analysis requires us to compare the results of the analysis using the two split sample, we will request the output for the second sample before doing any comparison.

77 SW388R7 Data Analysis & Computers II Slide 77 Repeat the regression with second validation sample To repeat the multiple regression analysis for the second validation sample, select Linear Regression from the Dialog Recall tool button.

78 SW388R7 Data Analysis & Computers II Slide 78 Setting the value of split to select cases Since the split variable is already in the Selection Variable text box, we only need to change its value. Click on the Rule… button to enter a different value for split.

79 SW388R7 Data Analysis & Computers II Slide 79 Completing the value selection First, type the value for the first half of the sample, 1, into the Value text box. Second, click on the Continue button to complete the value entry.

80 SW388R7 Data Analysis & Computers II Slide 80 Requesting output for the second validation sample When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable. Click on the OK button to request the output.

81 SW388R7 Data Analysis & Computers II Slide 81 SPLIT-SAMPLE VALIDATION - 1 In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant. In the first validation, the probability for the F statistic testing overall relationship was <0.001. For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001.

82 SW388R7 Data Analysis & Computers II Slide 82 SPLIT-SAMPLE VALIDATION - 2 The total proportion of variance in the relationship utilizing the full data set was 55.4% compared to 55.6% for the first split sample validation and 57.0% for the second split sample validation. In both of the split-sample validation analyses, the total proportion of variance in the dependent variable explained by the independent variables was within 5% of the variance explained in the model using the full data set (55.4%).

83 SW388R7 Data Analysis & Computers II Slide 83 SPLIT-SAMPLE VALIDATION - 3 The relationship between "age" [age] and "total family income" [income98] was not statistically significant for the model using the full data set (p=0.102). Similarly, the relationships in both of the validation analyses were not statistically significant. In the second validation analysis, the probability for the test of relationship between "age" [age] and "total family income" [income98] was 0.246, which was greater than the level of significance of 0.01 and not statistically significant. In the first validation analysis, the probability for the test of relationship between "age" [age] and "total family income" [income98] was 0.263, which was greater than the level of significance of 0.01 and not statistically significant.

84 SW388R7 Data Analysis & Computers II Slide 84 SPLIT-SAMPLE VALIDATION - 4 The relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statistically significant. In the first validation analysis, the probability for the test of relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.

85 SW388R7 Data Analysis & Computers II Slide 85 SPLIT-SAMPLE VALIDATION - 5 In the second validation analysis, the probability for the test of relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.

86 SW388R7 Data Analysis & Computers II Slide 86 SPLIT-SAMPLE VALIDATION - 6 The relationship between "income" [rincom98] and "total family income" [income98] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statistically significant. In the first validation analysis, the probability for the test of relationship between "income" [rincom98] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.

87 SW388R7 Data Analysis & Computers II Slide 87 SPLIT-SAMPLE VALIDATION - 7 The split sample validation supports the findings of the regression analysis using the full data set. A caution is added because of the inclusion of ordinal level variables. The answer to the question is true with caution. In the second validation analysis, the probability for the test of relationship between "income" [rincom98] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.

88 SW388R7 Data Analysis & Computers II Slide 88 Table of validation results: stepwise regression Full Data SetSplit = 0 (Split1 = 1) Split = 1 (Split2 = 1) ANOVA significance (sig <= 0.01) <0.001 R2R2 0.5540.5560.570 Significant Coefficients (R 2 change sig <= 0.01) Respondent’s income How many in family earned income Respondent’s income How many in family earned income NOTE: we use a lower level of significance to offset stepwise regression’s tendency to over fit a model. In this example, we satisfy all of the validation criteria. The validation supports the generalizability of the regression model to the population represented by this sample. The same variables entered the validation analyses in a different order. The difference is order does not negate the validation.

89 SW388R7 Data Analysis & Computers II Slide 89 Split sample validation - 1 The following is a guide to the decision process for answering problems about split sample validation: Inappropriate application of a statistic Yes No Dependent variable metric? Independent variables metric or dichotomous? Compute regression analysis, using transformations and omitting outliers as needed. Yes Ratio of cases to independent variables at least 5 to 1? Yes No Inappropriate application of a statistic

90 SW388R7 Data Analysis & Computers II Slide 90 Split sample validation - 2 Are there statistical findings requiring validation? Yes No False Yes No Enough valid cases to split sample and keep 5 to 1 ratio of cases/variables? Set the random seed and compute the split variable Re-run factor with split = 0 Re-run factor with split = 1 Set the first random seed and compute the split1 variable Re-run factor with split1 = 1 Set the second random seed and compute the split2 variable Re-run factor with split2 = 1

91 SW388R7 Data Analysis & Computers II Slide 91 Split sample validation - 3 Compute regression analysis for both validation samples, using same method as analysis of full sample Yes Probability of ANOVA test <= level of significance for both validation analyses? Yes No False Yes R ² for both validations within 5% of R ² for analysis of full data set? Yes No False

92 SW388R7 Data Analysis & Computers II Slide 92 Split sample validation - 4 Pattern of significance for independent variables in both validations matches pattern for full data set? Yes No False Change in R² statistically significant in both validation analyses? (Hierarchical only) No False Yes Satisfies ratio for preferred sample size: 15 to 1 (stepwise: 50 to 1) Yes No True with caution

93 SW388R7 Data Analysis & Computers II Slide 93 Split sample validation - 5 Yes Other cautions added for ordinal variables or violation of assumptions? Yes No True with caution True Yes


Download ppt "SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems."

Similar presentations


Ads by Google