Multiple Regression – Split Sample Validation

Multiple Regression – Split Sample Validation
General criteria for split sample validation Sample problems

General criteria for split sample validation
It is expected that the results obtained from split sample validations will vary somewhat from the results obtained from the analysis using the full data set. We will use the following as our criteria that the validation verified our analysis and supports the generalizability of our findings: First, the overall relationship between the dependent variable and the set of independent variable must be statistically significant for both validation analyses. Second, the R² for each validation must be within 5% (plus or minus) of the R² for the model using the full sample.

General criteria for split sample validation - 2
Third, the pattern of statistical significance for the coefficients of the independent variables for both validation analyses must be the same as the pattern for the full analysis, i.e. the same variables are statistically significant or not significant. For stepwise multiple regression, we require that the same variables be significant, but it is not required that they enter in exactly the same order. For hierarchical multiple regression, the R² change for the validation must be statistically significant.

Notes Findings are stated on the results for the analysis of the full data set, not the validations. If our validation analysis does not support the findings of the analysis on the full data set, we will declare the answer to the problem to be false. There is, however, another common option, which is to only cite statistical significance for independent variables supported by the validation analysis as well as the full data set analysis. All other variables are considered to be non-significant. Generally it is the independent variables with the weakest individual relationship to the dependent variable which fail to validate.

Problem 1 1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using as the random number seed. The variables "age" [age], "highest year of school completed" [educ], and "sex" [sex] have a moderate relationship to the variable "occupational prestige score" [prestg80]. Survey respondents who were older had more prestigious occupations. Survey respondents who had completed more years of school had more prestigious occupations. The variable sex did not have a relationship to "occupational prestige score" [prestg80]. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Steps prior to the validation analysis
Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers. Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation. When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate.

OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 1
The probability of the F statistic (36.639) for the overall regression relationship is <0.001, less than or equal to the level of significance of We reject the null hypothesis that there is no relationship between the set of independent variables and the dependent variable (R² = 0). We support the research hypothesis that there is a statistically significant relationship between the set of independent variables and the dependent variable.

The Multiple R for the relationship between the set of independent variables and the dependent variable is 0.554, which would be characterized as moderate using the rule of thumb than a correlation less than or equal to 0.20 is characterized as very weak; greater than 0.20 and less than or equal to 0.40 is weak; greater than 0.40 and less than or equal to 0.60 is moderate; greater than 0.60 and less than or equal to 0.80 is strong; and greater than 0.80 is very strong. The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by all of the independent variables is 30.7%.

RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1
For the independent variable age, the probability of the t statistic (2.749) for the b coefficient is which is less than or equal to the level of significance of We reject the null hypothesis that the slope associated with age is equal to zero (b = 0) and conclude that there is a statistically significant relationship between age and occupational prestige score. The b coefficient associated with age (0.125) is positive, indicating a direct relationship in which higher numeric values for age are associated with higher numeric values for occupational prestige score. Therefore, the positive value of b implies that survey respondents who were older had more prestigious occupations.

For the independent variable highest year of school completed, the probability of the t statistic (10.352) for the b coefficient is <0.001 which is less than or equal to the level of significance of We reject the null hypothesis that the slope associated with highest year of school completed is equal to zero (b = 0) and conclude that there is a statistically significant relationship between highest year of school completed and occupational prestige score. The b coefficient associated with highest year of school completed (2.810) is positive, indicating a direct relationship in which higher numeric values for highest year of school completed are associated with higher numeric values for occupational prestige score. Therefore, the positive value of b implies that survey respondents who had completed more years of school had more prestigious occupations.

For the independent variable sex, the probability of the t statistic (-0.917) for the b coefficient is which is greater than the level of significance of We fail to reject the null hypothesis that the slope associated with sex is equal to zero (b = 0) and conclude that there is not a statistically significant relationship between sex and occupational prestige score.

Setting the random number seed
To set the random number seed, select the Random Number Seed… command from the Transform menu.

Set the random number seed
First, click on the Set seed to option button to activate the text box. Second, type in the random seed stated in the problem. Third, click on the OK button to complete the dialog box. Note that SPSS does not provide you with any feedback about the change.

Select the compute command
To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.

The formula for the split variable
First, type the name for the new variable, split, into the Target Variable text box. Second, the formula for the value of split is shown in the text box. The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0.50. If the random number is less than or equal to 0.50, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.50, the formula will return a 0, the SPSS numeric equivalent to false. Third, click on the OK button to complete the dialog box.

The split variable in the data editor
In the data editor, the split variable shows a random pattern of zero’s and one’s. To select half of the sample for each validation analysis, we will first select the cases where split = 0, then select the cases where split = 1.

Repeat the regression with first validation sample
To repeat the multiple regression analysis for the first validation sample, select Linear Regression from the Dialog Recall tool button.

Using "split" as the selection variable
First, scroll down the list of variables and highlight the variable split. Second, click on the right arrow button to move the split variable to the Selection Variable text box.

Setting the value of split to select cases
When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the Rule… button to enter a value for split.

Completing the value selection
First, type the value for the first half of the sample, 0, into the Value text box. Second, click on the Continue button to complete the value entry.

Requesting output for the first validation sample
Click on the OK button to request the output. When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 0 for the split variable. Since the validation analysis requires us to compare the results of the analysis using the two split sample, we will request the output for the second sample before doing any comparison.

Repeat the regression with second validation sample
To repeat the multiple regression analysis for the second validation sample, select Linear Regression from the Dialog Recall tool button.

Since the split variable is already in the Selection Variable text box, we only need to change its value. Click on the Rule… button to enter a different value for split.

Requesting output for the second validation sample
Click on the OK button to request the output. When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable.

SPLIT-SAMPLE VALIDATION - 1
In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant. In the first validation, the probability for the F statistic testing overall relationship was <0.001. For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001. Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables.

The proportion of variance in the relationship utilizing the full data set was 30.7% compared to 24.5% for the first split sample validation and 39.6% for the second split sample validation. In both of the split-sample validation analyses, the proportion of variance in the dependent variable explained by the independent variables was not within 5% of the variance explained in the model using the full data set (30.7%). The strength of the relationship between the independent variables and the dependent variable is not supported by the validation. The answer to the question is false, because the validation did not verify the strength of the relationship in the analysis of the full data set.

Table of validation results: standard regression
It may be helpful to create a table for our validation results and fill in its cells as we complete the analysis. Full Data Set Split = 0 (Split1 = 1) Split = 1 (Split2 = 1) ANOVA significance (sig <= 0.05) <0.001 R2 0.307 0.245 0.396 Significant Coefficients (sig <= 0.05) Age of respondent Highest year of school completed This table shows us that the validation failed in the strength of relationship row (R²), where the R² for the validations did not fall within 5% if the R² for the full data set. Had we satisfied this criteria, the validation would have failed in the Significant Coefficients row, where age of respondent was not statistically significant in the first validation analysis.

Problem 2 1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using as the random number seed. After controlling for the effects of the variables "age" [age] and "sex" [sex], the addition of the variable "happiness of marriage" [hapmar] reduces the error in predicting "general happiness" [happy] by 37.0%. After controlling for age and sex, the variable happiness of marriage makes an individual contribution to reducing the error in predicting general happiness. Survey respondents who were less happy with their marriages were less happy overall. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers. Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation. When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate. For hierarchical regression, the R² change statistics must be statistically significant in order for the model to be validated.

The probability of the F statistic (25.476) for the overall regression relationship for all indpendent variables is <0.001, less than or equal to the level of significance of We reject the null hypothesis that there is no relationship between the set of all independent variables and the dependent variable (R² = 0). We support the research hypothesis that there is a statistically significant relationship between the set of all independent variables and the dependent variable.

The R Square Change statistic for the increase in R² associated with the added variables (happiness of marriage) is Using a proportional reduction in error interpretation for R², information provided by the added variables reduces our error in predicting general happiness by 37.0%. The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by all of the independent variables is 37.2%.

The probability of the F statistic (75.972) for the change in R² associated with the addition of the predictor variables to the regression analysis containing the control variables is <0.001, less than or equal to the level of significance of We reject the null hypothesis that there is no improvement in the relationship between the set of independent variables and the dependent variable when the predictors are added (R² Change = 0). We support the research hypothesis that there is a statistically significant improvement in the relationship between the set of independent variables and the dependent variable.

For the independent variable happiness of marriage, the probability of the t statistic (8.716) for the b coefficient is <0.001 which is less than or equal to the level of significance of We reject the null hypothesis that the slope associated with happiness of marriage is equal to zero (b = 0) and conclude that there is a statistically significant relationship between happiness of marriage and general happiness.

The b coefficient associated with happiness of marriage (0.658) is positive, indicating a direct relationship in which higher numeric values for happiness of marriage are associated with higher numeric values for general happiness. The independent variable happiness of marriage is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who were less happy with their marriages. The dependent variable general happiness is also an ordinal variable. It is coded so that higher numeric values are associated with survey respondents who were less happy overall. Therefore, the positive value of b implies that survey respondents who were less happy with their marriages were less happy overall.

Age and sex are variables whose effects are controlled for, so there is no requirement that they be statistically significant.

In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant. In the first validation, the probability for the F statistic testing overall relationship was <0.001. For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001. Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables.

The total proportion of variance in the relationship utilizing the full data set was 37.2% compared to 36.8% for the first split sample validation and 39.5% for the second split sample validation. In both of the split-sample validation analyses, the total proportion of variance in the dependent variable explained by the independent variables was within 5% of the variance explained in the model using the full data set (37.2%).

In both of the split-sample validation analyses, the change in R² from the model including only control variables to the model containing all variables was statistically significant. In the first validation, the probability for the F statistic testing the change in R² was <0.001. Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables, and the significance of the contribution of the independent variables included after the control variables. For the second validation analysis, the probability for the F statistic testing the change in R² was <0.001.

The relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statisically significant. In the first validation analysis, the probability for the test of relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was <0.001, which was less than or equal to 0.05 and statistically significant.

In the second validation analysis, the probability for the test of relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was <0.001, which was less than or equal to 0.05 and statistically significant. The split sample validation supports the findings of the regression analysis using the full data set. A caution is added because of the inclusion of ordinal level variables. The answer to the question is true with caution.

Table of validation results: hierarchical regression
In this example, we satisfy all of the validation criteria. The validation supports the generalizability of the regression model to the population represented by this sample. Full Data Set Split = 0 (Split1 = 1) Split = 1 (Split2 = 1) ANOVA significance (sig <= 0.05) <0.001 R2 0.372 0.368 0.395 R2 Change (sig <= 0.05) Significant Coefficients (sig <= 0.05) Happiness of marriage NOTE: we add the R² change statistic to the validation results because it indicates the contribution of the variables added in the second stage.

Problem 3 1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using as the random number seed. From the list of variables "how many in family earned money" [earnrs], "income" [rincom98], and "age" [age], the best predictors of "total family income" [income98] are "income" [rincom98] and "how many in family earned money" [earnrs]. Income and how many in family earned money have a strong relationship to total family income. The most important predictor of total family income is income. The second most important predictor of total family income is how many in family earned money. Survey respondents who had higher incomes had higher total family incomes. Survey respondents who had more family members earning money had higher total family incomes. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers. Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation. When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate. For stepwise regression, we will require that the same variables be selected in each validation analysis. We will not require that they be entered in the same order.

The best subset of predictors for total family income included the independent variables: income and how many in family earned money. The probability of the F statistic (98.194) for the regression relationship which includes these variables is <0.001, less than or equal to the level of significance of We reject the null hypothesis that there is no relationship between the best subset of independent variables and the dependent variable (R² = 0). We support the research hypothesis that there is a statistically significant relationship between the best subset of independent variables and the dependent variable.

The Multiple R for the relationship between the subset of independent variables that best predict the dependent variable is 0.744, which would be characterized as strong using the rule of thumb than a correlation less than or equal to 0.20 is very weak; greater than 0.20 and less than or equal to 0.40 is weak; greater than 0.40 and less than or equal to 0.60 is moderate; greater than 0.60 and less than or equal to 0.80 is strong; and greater than 0.80 is very strong. The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by the independent variables included in the analysis is 55.4%.

Based on the table of "Variables Entered/Removed," the most important predictor of total family income is income. The second most important predictor of total family income is how many in family earned money.

For the independent variable income, the probability of the t statistic (11.322) for the b coefficient is <0.001 which is less than or equal to the level of significance of We reject the null hypothesis that the slope associated with income is equal to zero (b = 0) and conclude that there is a statistically significant relationship between income and total family income.

The b coefficient associated with income (0.470) is positive, indicating a direct relationship in which higher numeric values for income are associated with higher numeric values for total family income. The independent variable income is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who had higher incomes. The dependent variable total family income is also an ordinal variable. It is coded so that higher numeric values are associated with survey respondents who had higher total family incomes. Therefore, the positive value of b implies that survey respondents who had higher incomes had higher total family incomes.

For the independent variable how many in family earned money, the probability of the t statistic (8.566) for the b coefficient is <0.001 which is less than or equal to the level of significance of We reject the null hypothesis that the slope associated with how many in family earned money is equal to zero (b = 0) and conclude that there is a statistically significant relationship between how many in family earned money and total family income.

The b coefficient associated with how many in family earned money (16.009) is positive, indicating a direct relationship in which higher numeric values for how many in family earned money are associated with higher numeric values for total family income. Therefore, the positive value of b implies that survey respondents who had more family members earning money had higher total family incomes. If we wanted to specifically interpret the value of the b coefficient (16.009), we would have to convert it out of log unit.

The relationship between "age" [age] and "total family income" [income98] was not statistically significant for the model using the full data set (p=0.102).

In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant. In the first validation, the probability for the F statistic testing overall relationship was <0.001. For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001.

The total proportion of variance in the relationship utilizing the full data set was 55.4% compared to 55.6% for the first split sample validation and 57.0% for the second split sample validation. In both of the split-sample validation analyses, the total proportion of variance in the dependent variable explained by the independent variables was within 5% of the variance explained in the model using the full data set (55.4%).

The relationship between "age" [age] and "total family income" [income98] was not statistically significant for the model using the full data set (p=0.102). Similarly, the relationships in both of the validation analyses were not statistically significant. In the first validation analysis, the probability for the test of relationship between "age" [age] and "total family income" [income98] was 0.263, which was greater than the level of significance of 0.01 and not statistically significant. In the second validation analysis, the probability for the test of relationship between "age" [age] and "total family income" [income98] was 0.246, which was greater than the level of significance of 0.01 and not statistically significant.

The relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statistically significant. In the first validation analysis, the probability for the test of relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.

In the second validation analysis, the probability for the test of relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.

The relationship between "income" [rincom98] and "total family income" [income98] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statistically significant. In the first validation analysis, the probability for the test of relationship between "income" [rincom98] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.

In the second validation analysis, the probability for the test of relationship between "income" [rincom98] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant. The split sample validation supports the findings of the regression analysis using the full data set. A caution is added because of the inclusion of ordinal level variables. The answer to the question is true with caution.

Table of validation results: stepwise regression
In this example, we satisfy all of the validation criteria. The validation supports the generalizability of the regression model to the population represented by this sample. Full Data Set Split = 0 (Split1 = 1) Split = 1 (Split2 = 1) ANOVA significance (sig <= 0.01) <0.001 R2 0.554 0.556 0.570 Significant Coefficients (R2 change sig <= 0.01) Respondent’s income How many in family earned income NOTE: we use a lower level of significance to offset stepwise regression’s tendency to over fit a model. The same variables entered the validation analyses in a different order. The difference is order does not negate the validation.

Split sample validation - 1
The following is a guide to the decision process for answering problems about split sample validation: Dependent variable metric? Independent variables metric or dichotomous? No Inappropriate application of a statistic Yes Ratio of cases to independent variables at least 5 to 1? No Inappropriate application of a statistic Yes Yes Compute regression analysis, using transformations and omitting outliers as needed.

Are there statistical findings requiring validation? No False Yes Enough valid cases to split sample and keep 5 to 1 ratio of cases/variables? No Yes Set the first random seed and compute the split1 variable Re-run factor with split1 = 1 Set the second random seed and compute the split2 variable Re-run factor with split2 = 1 Set the random seed and compute the split variable Re-run factor with split = 0 Re-run factor with split = 1 Yes

Compute regression analysis for both validation samples, using same method as analysis of full sample Probability of ANOVA test <= level of significance for both validation analyses? No False Yes Yes R² for both validations within 5% of R² for analysis of full data set? No False Yes Yes

Yes Change in R² statistically significant in both validation analyses? (Hierarchical only) No False Yes Pattern of significance for independent variables in both validations matches pattern for full data set? No False Yes Satisfies ratio for preferred sample size: 15 to 1 (stepwise: 50 to 1) No True with caution Yes Yes

Yes Other cautions added for ordinal variables or violation of assumptions? No True Yes Yes True with caution

Multiple Regression – Split Sample Validation

Similar presentations

Presentation on theme: "Multiple Regression – Split Sample Validation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Regression – Split Sample Validation

Similar presentations

Presentation on theme: "Multiple Regression – Split Sample Validation"— Presentation transcript:

Similar presentations

About project

Feedback