Testing whether a multivariate specification can be simplified

Testing whether a multivariate specification can be simplified
Jane E. Miller, PhD This podcast discusses how to test whether a multivariate regression model can be simplified. Such simplification could involve omitting one or more variables from an initial specification, whether those are main effects or interaction terms. Before watching this podcast, you should be familiar with the material in the podcasts on interpreting coefficients from OLS and logit models, testing differences across coefficients, and comparing model goodness of fit. If you are interested in using this approach to evaluate which terms to include in an interaction model, also be familiar with the prior podcasts on specifying a model to test for interactions, and specification errors for interaction models. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Overview Initial model specifications with full set of independent variables (IVs) How to test whether simpler specification fits the data as well Approaches to simplifying model specification Creating a reference category that combines categories Collapsing other categories Presenting results of analyses to evaluate model specification When estimating multivariate model(s) to answer our research question, we usually begin with a full set of independent variables based on theoretical considerations for our topic as well as the ways our variables are measured and coded in our data. For instance, you might test a model with several different measures of socioeconomic status, each of which has multiple categories. In the example we’ve been using throughout WAMA II and the associated podcasts, we’ve estimated a model of birth weight as a function of race/ethnicity (three categories) and mother’s education (three categories). We might want to know whether it is possible to simplify that specification by combining categories of those variables while still retaining a model that fits the data well. In this podcast, I’ll work through empirical examples of how to assess whether a simpler model might be the parsimonious one for the topic and data Then I’ll close with some illustrations of how to describe how you arrived at your final model specification. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Estimated coefficients from an OLS model of birth weight in grams
Coefficient Standard error Intercept 3,317.8** 25.1 Race/Hispanic origin (Non-Hispanic white) Mexican American –23.0 22.7 Non-Hispanic black –172.6** 17.5 Mother’s education < High school (<HS) –55.5** 19.3 = High school (=HS) –53.9** 14.8 (> High school; >HS) Here is a table of coefficients from an OLS model of birth weight as a function of race and mother’s education. Other variables controlled in the model include gender, age, income, and smoking, but their coefficients are not needed for the examples I will work through ** denotes p < .01 Reference category in parenthesis

Testing whether a specification could be simplified
For a specification that involves several multi-category variables, might be able to simplify the specification if some terms can be omitted without worsening the overall fit of the model E.g., for a three-category variable, might it be possible to Combine one of the modeled categories with the reference category? Combine the two modeled categories with one another? When I talk about simplifying a specification, one way that could be done is to omit some of the dummy variables used to specify the association between a three category IV and the DV Read slide

Example 1: Revising the reference category for race
The estimated coefficient for Mexican American is not statistically significantly different from zero E.g., predicted birth weight is not statistically significantly different for Mexican American than for non-Hispanic white infants (the reference category) βMexicanAmerican = –23.0; standard error = 22.7 Since birth weights for those two racial/ethnic groups are so close, could combine them to create the reference category Reference category now includes BOTH Non-Hispanic white AND Mexican American infants The first example we will look at is whether we could simplify the race/ethnic specification in our birth weight model. We see from the coefficients reported on the earlier slide that

Test a revised race/ethnicity specification
Replace Specification A BW = f (NHB, MA, other independent variables) With Specification B BW = f (NHB, same set of other independent variables) Reference category is now non-Hispanic white and Mexican American infants Compare overall fit of specifications A and B Goodness-of-fit (GOF) statistics If fit of the model with simpler racial/ethnic specification is not statistically significantly worse than that of the more detailed specification, it is the parsimonious specification To test that possibility, we estimate a new specification, which we’ll call “Specification B” that includes the same set of independent variables as the original specification (“Specification A”) EXCEPT it drops the dummy variable “MA” Now the ref cat is NHW + MA. So the racial/ethnic comparison is now NHB against all other infants To see whether that specification fits the data as well as the one with the more detailed racial/ethnic contrasts use the techniques we learned in the podcast on comparing model GOF to formally compare the overall fit of Specs A and B

Combining two modeled categories with one another
Testing differences of s from one model To formally test statistical significance of differences between coefficients, e.g., H0: βj = βk, calculate the test statistic: Divide the difference between the estimated coefficients (j − i) by the standard error of the difference Compare the value of the test statistic against the critical value with one degree of freedom Another possibility is to combine two categories currently being modeled in Specification A. Here we use the approach explained in testing differences between coefficients in one model – in the podcast by that name. As a review, we would calculate the test statistic as follows Read bullets

Standard error of the difference
The standard error of the difference is calculated: √[var(j) + (2 × cov(j, k)) + var(k) ] var(j) and var(k) are the variances of j and k, respectively cov(j, k) is the covariance between j and k The complete variance-covariance matrix for a regression can be requested as part of the output The variance of each coefficient can be calculated from its standard error (s.e.): var(j) = [s.e.(j)]2 As a review, the standard error of a difference between coefficients Bi and Bj from the same multivariate model, Read slide

Example 2: Testing whether β<HS = β=HS
From the table, <HS = –55.5 and =HS = –53.9 The difference between β<HS and β=HS is calculated β<HS – β=HS = –55.5 – (–53.9) = 1.6 For that model, var(<HS) = 370.9 var(=HS) = 218.8 cov(<HS, =HS) = 137.8 Plugging those values into the formula for the standard error of the difference yields = √[ (2 × 137.8) ] = 17.72 Suppose we want to test H0: β< HS = β= HS. From table 11.1, we have ^< HS = –55.5 and ^= HS = –53.9, for a difference of 1.6. For that model, var(^< HS) = , var(^= HS) = , and cov(^< HS, ^= HS) = I haven’t shown the complete variance-covariance matrix here Plugging those values into the formula for the standard error of the difference yields The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Example 2, cont.: Test statistic for β<HS = β=HS
To calculate the test statistic, divide the difference between <HS and =HS by the standard error of the difference: (β<HS – β=HS)/s.e. (β<HS – β=HS) = 1.6/17.7 = 0.09 0.09 < 1.96 (the critical value of 1.96 for a t-test with ∞ degrees of freedom) Thus we cannot reject the null hypothesis that β<HS = β=HS Next we calculate the test statistic and compare it against the critical value Read slide The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

TEST statement Alternatively, request the test statistic for equality of coefficients for pairs of coefficients as part of the regression procedure E.g., to test whether predicted birth weight is statistically significantly different for non-Hispanic black than for Mexican American infants Specify “TEST ‘<HS’ = ‘=HS’ ” in your SAS syntax Output for H0: β<HS = β=HS reports an F-statistic of 0.01 with a p-value of 0.93 Conclusion: No statistically significant difference between the estimated coefficients for <HS and =HS Alternatively, we can have the software do the calculations for us, by requesting an option to the regression procedure. In SAS…

Collapsing the education classification
Because Cannot reject the null hypothesis that β<HS = β=HS based on the estimates from the model Both β<HS and β=HS are statistically significantly different from zero β<HS and β=HS are empirically very similar (–55.5 and –53.9, respectively) Could simplify the specification by creating one dummy to capture ≤HS Collapses the categories of <HS and =HS From the results of the original specification and the test for a difference between B< HS and B= HS, we see Read slide

Test a revised education specification
Replace Specification A BW = f (<HS, =HS, other independent variables) With Specification C BW = f (≤HS, same set of other independent variables) Compare overall fit of specifications A and C GOF statistics If fit of the simpler specification is not statistically significantly worse than that of the more detailed education specification It would be the parsimonious specification To test that possibility, we could estimate a new specification, which we’ll call “Specification C” that includes the same set of independent variables as the original specification (“Specification A”) EXCEPT it replaces the separate dummy variables < HS and = HS with ONE dummy that captures BOTH of those levels of mother’s education combined The ref cat remains > HS. So the education comparison is now <= HS against > HS To see whether that specification fits the data as well as the one with the more detailed educational attainment contrasts, you would use the techniques we learned in the podcast on comparing model GOF to formally compare the overall fit of Specs A and C

Caveat about combining categories
Only combine categories for which it makes substantive sense to do so E.g., < HS and > HS aren’t adjacent ordinal categories, so you would NOT combine them with one another to compare against = HS. However, for some research questions, you could combine non-Hispanic blacks with Mexican-Americans because both are considered racial/ethnic minority groups in the US Before doing such a combination, it is important to think carefully about whether it makes substantive sense to combine categories, even when empirical tests of statistical significance shows that their effect sizes are similar. For example… read slide

Describing exploratory work on model specification
Always explain in your methods or results section how you arrived at your final model specification Describe the criteria you used to decide which independent variables to include in both initial and final models Theoretical criteria about which variables and classifications were used in initial specification Empirical criteria used to test simplifications to that specification Theoretical criteria might override empirical criteria due to the role of that variable in your specific research question Since this series of podcasts is linked to a book on WRITING ABOUT multivariate analysis, I will close this lecture by discussing what you should present about the exploratory work you have done to arrive at your final model specification. If you explore more than a few specifications, do NOT present the full set of coefficients etc for every single model you estimated! Instead, describe the logic you use to decide Which IVS to include in the initial, most detailed specification The theoretical and empirical criteria you used to test possible simplifications to arrive at a more parsimonious specification Read last bullet

Example description of exploratory work on model specification
“Although birth weight for Mexican American infants was not statistically significantly different from that of non-Hispanic white infants, because race/ethnicity is a variable of primary interest for our research question, we retained it as a separate category in sequence of models.” Theoretical criteria, used if race/ethnicity is of central interest in the analysis Here is an example of a decision made based on theoretical criteria for the specific research question, which in this case over-rode results of empirical tests for similarity of coefficients on two racial/ethnic groups in the birth weight model.

Alternative description of exploratory work on model specification
“Our initial model specification compared three racial/ethnic categories (Mexican American, non-Hispanic black, and non-Hispanic white). However, birth weight for Mexican American infants was not statistically significantly different from that of non-Hispanic white infants, so those we combined those two groups to create the revised reference category for race/ethnicity.” Empirical criteria, to be used if race/ethnicity not a key IV On the other hand, if race/ethnicity were not a key independent variable in your analysis, you might decide to combine Mexican American with non-Hispanic white infants based on empirical criteria that showed a lack of stat sig difference between those groups’ birth weight (DV).

Summary Initial model specifications often include a full set of independent variables (IVs) related to the substantive research question. E.g., Detailed classifications of 1+ categorical variables All pertinent main effects and interaction terms If some of those variables are not statistically significant in the initial model, test simpler specifications to assess whether there is a statistically significant loss of fit Decisions about which IVs to include in the final model should be based on both theoretical and empirical criteria related to your research question and data In summary, Read first two bullets So, as with most aspects of research and statistical analysis, don’t blindly follow the same steps regardless of your research question and data – think through what makes sense for your analysis, conduct the steps, and explain to your readers what you did Why you did it that way What you concluded The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Suggested resources Miller, J. E The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. University of Chicago Press, chapters 11 and 15. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Suggested online resources
Podcasts on Testing statistical significance of differences between coefficients Comparing overall goodness of fit across models Creating variables and specifying models to test for interactions The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Contact information Jane E. Miller, PhD Online materials available at The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Testing whether a multivariate specification can be simplified

Similar presentations

Presentation on theme: "Testing whether a multivariate specification can be simplified"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Testing whether a multivariate specification can be simplified

Similar presentations

Presentation on theme: "Testing whether a multivariate specification can be simplified"— Presentation transcript:

Similar presentations

About project

Feedback