Testing whether a multivariate specification can be simplified

Slides:



Advertisements
Similar presentations
Simple Logistic Regression
Advertisements

The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Planning a speech and designing effective slides Jane E. Miller, PhD.
ANOVA: Analysis of Variance
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Organizing data in tables and charts: Different criteria for different tasks Jane.
Logarithmic specifications Jane E. Miller, PhD The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Paper versus speech versus poster: Different formats for communicating research.
Chapter 13: Inference in Regression
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Creating effective tables and charts Jane E. Miller, PhD.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Calculating interaction patterns from logit coefficients: Interaction between two.
One-Way Manova For an expository presentation of multivariate analysis of variance (MANOVA). See the following paper, which addresses several questions:
Multiple Regression. In the previous section, we examined simple regression, which has just one independent variable on the right side of the equation.
Comparing overall goodness of fit across models
Lecture 8 Analysis of Variance and Covariance Effect of Coupons, In-Store Promotion and Affluence of the Clientele on Sales.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Calculating the shape of a polynomial from regression coefficients Jane E. Miller,
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Differentiating between statistical significance and substantive importance Jane.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Preparing speaker’s notes and practicing your talk Jane E. Miller, PhD.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Implementing “generalization, example, exception”: Behind-the-scenes work for summarizing.
The Chicago Guide to Writing about Numbers, 2 nd edition. Basics of writing about numbers: Reporting one number Jane E. Miller, PhD.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Writing prose to present results of interactions Jane E. Miller, PhD.
The Chicago Guide to Writing about Numbers, 2 nd edition. Preparing speaker’s notes and practicing your talk Jane E. Miller, PhD.
Planning how to create the variables you need from the variables you have Jane E. Miller, PhD The Chicago Guide to Writing about Numbers, 2 nd edition.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Criteria for choosing a reference category Jane E. Miller, PhD.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Defining the Goldilocks problem Jane E. Miller, PhD.
Chapter 13 Multiple Regression
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Conducting post-hoc tests of compound coefficients using simple slopes for a categorical.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Interpreting multivariate OLS and logit coefficients Jane E. Miller, PhD.
Standardized coefficients Jane E. Miller, PhD The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Choosing tools to present numbers: Tables, charts, and prose Jane E. Miller, PhD.
The Chicago Guide to Writing about Numbers, 2 nd edition. Choosing a comparison group Jane E. Miller, PhD.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Resolving the Goldilocks problem: Variables and measurement Jane E. Miller, PhD.
14-1 Qualitative Variable - Example Frequently we wish to use nominal-scale variables—such as gender, whether the home has a swimming pool, or whether.
Introduction to testing statistical significance of interactions Jane E. Miller, PhD The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Testing statistical significance of differences between coefficients Jane E. Miller, PhD The Chicago Guide to Writing about Multivariate Analysis, 2nd.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Visualizing shapes of interaction patterns between two categorical independent.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Conducting post-hoc tests of compound coefficients using simple slopes for a categorical.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Visualizing shapes of interaction patterns with continuous independent variables.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Resolving the Goldilocks problem: Presenting results Jane E. Miller, PhD.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Creating charts to present interactions Jane E. Miller, PhD.
Approaches to testing statistical significance of interactions Jane E. Miller, PhD The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Resolving the Goldilocks problem: Model specification Jane E. Miller, PhD.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8… Where we are going… Significance Tests!! –Ch 9 Tests about a population proportion –Ch 9Tests.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Calculating interaction effects from OLS coefficients: Interaction between 1 categorical.
Methods of Presenting and Interpreting Information Class 9.
F-tests continued.
Logic of Hypothesis Testing
More Multiple Regression
Overview of categorical by categorical interactions: Part I: Concepts, definitions, and shapes Interactions in regression models occur when the association.
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Analysis of Variance and Covariance
Calculating interaction effects from OLS coefficients: Interaction between two categorical independent variables Jane E. Miller, PhD As discussed in the.
Lecture Slides Elementary Statistics Twelfth Edition
Using alternative reference categories to test statistical significance of an interaction This podcast is the last in the series on testing statistical.
Basic Practice of Statistics - 5th Edition
Correlation – Regression
Creating variables and specifying models to test for interactions between two categorical independent variables This lecture is the third in the series.
CHAPTER 29: Multiple Regression*
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
Chapter 11: Inference for Distributions of Categorical Data
More Multiple Regression
More Multiple Regression
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Introduction to SAS Essentials Mastering SAS for Data Analytics
Regression Analysis.
Psych 231: Research Methods in Psychology
Introduction to interactions in regression models: Concepts and equations Jane E. Miller, PhD Interactions in regression models occur when the association.
Overview of categorical by continuous interactions: Part II: Variables, specifications, and calculations Interactions in regression models occur when.
3 basic analytical tasks in bivariate (or multivariate) analyses:
Presentation transcript:

Testing whether a multivariate specification can be simplified Jane E. Miller, PhD This podcast discusses how to test whether a multivariate regression model can be simplified. Such simplification could involve omitting one or more variables from an initial specification, whether those are main effects or interaction terms. Before watching this podcast, you should be familiar with the material in the podcasts on interpreting coefficients from OLS and logit models, testing differences across coefficients, and comparing model goodness of fit. If you are interested in using this approach to evaluate which terms to include in an interaction model, also be familiar with the prior podcasts on specifying a model to test for interactions, and specification errors for interaction models. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Overview Initial model specifications with full set of independent variables (IVs) How to test whether simpler specification fits the data as well Approaches to simplifying model specification Creating a reference category that combines categories Collapsing other categories Presenting results of analyses to evaluate model specification When estimating multivariate model(s) to answer our research question, we usually begin with a full set of independent variables based on theoretical considerations for our topic as well as the ways our variables are measured and coded in our data. For instance, you might test a model with several different measures of socioeconomic status, each of which has multiple categories. In the example we’ve been using throughout WAMA II and the associated podcasts, we’ve estimated a model of birth weight as a function of race/ethnicity (three categories) and mother’s education (three categories). We might want to know whether it is possible to simplify that specification by combining categories of those variables while still retaining a model that fits the data well. In this podcast, I’ll work through empirical examples of how to assess whether a simpler model might be the parsimonious one for the topic and data Then I’ll close with some illustrations of how to describe how you arrived at your final model specification. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Estimated coefficients from an OLS model of birth weight in grams   Coefficient Standard error Intercept 3,317.8** 25.1 Race/Hispanic origin (Non-Hispanic white) Mexican American –23.0 22.7 Non-Hispanic black –172.6** 17.5 Mother’s education < High school (<HS) –55.5** 19.3 = High school (=HS) –53.9** 14.8 (> High school; >HS) Here is a table of coefficients from an OLS model of birth weight as a function of race and mother’s education. Other variables controlled in the model include gender, age, income, and smoking, but their coefficients are not needed for the examples I will work through ** denotes p < .01 Reference category in parenthesis

Testing whether a specification could be simplified For a specification that involves several multi-category variables, might be able to simplify the specification if some terms can be omitted without worsening the overall fit of the model E.g., for a three-category variable, might it be possible to Combine one of the modeled categories with the reference category? Combine the two modeled categories with one another? When I talk about simplifying a specification, one way that could be done is to omit some of the dummy variables used to specify the association between a three category IV and the DV Read slide

Example 1: Revising the reference category for race The estimated coefficient for Mexican American is not statistically significantly different from zero E.g., predicted birth weight is not statistically significantly different for Mexican American than for non-Hispanic white infants (the reference category) βMexicanAmerican = –23.0; standard error = 22.7 Since birth weights for those two racial/ethnic groups are so close, could combine them to create the reference category Reference category now includes BOTH Non-Hispanic white AND Mexican American infants The first example we will look at is whether we could simplify the race/ethnic specification in our birth weight model. We see from the coefficients reported on the earlier slide that

Test a revised race/ethnicity specification Replace Specification A BW = f (NHB, MA, other independent variables) With Specification B BW = f (NHB, same set of other independent variables) Reference category is now non-Hispanic white and Mexican American infants Compare overall fit of specifications A and B Goodness-of-fit (GOF) statistics If fit of the model with simpler racial/ethnic specification is not statistically significantly worse than that of the more detailed specification, it is the parsimonious specification To test that possibility, we estimate a new specification, which we’ll call “Specification B” that includes the same set of independent variables as the original specification (“Specification A”) EXCEPT it drops the dummy variable “MA” Now the ref cat is NHW + MA. So the racial/ethnic comparison is now NHB against all other infants To see whether that specification fits the data as well as the one with the more detailed racial/ethnic contrasts use the techniques we learned in the podcast on comparing model GOF to formally compare the overall fit of Specs A and B

Combining two modeled categories with one another Testing differences of s from one model To formally test statistical significance of differences between coefficients, e.g., H0: βj = βk, calculate the test statistic: Divide the difference between the estimated coefficients (j − i) by the standard error of the difference Compare the value of the test statistic against the critical value with one degree of freedom Another possibility is to combine two categories currently being modeled in Specification A. Here we use the approach explained in testing differences between coefficients in one model – in the podcast by that name. As a review, we would calculate the test statistic as follows Read bullets

Standard error of the difference The standard error of the difference is calculated: √[var(j) + (2 × cov(j, k)) + var(k) ] var(j) and var(k) are the variances of j and k, respectively cov(j, k) is the covariance between j and k The complete variance-covariance matrix for a regression can be requested as part of the output The variance of each coefficient can be calculated from its standard error (s.e.): var(j) = [s.e.(j)]2 As a review, the standard error of a difference between coefficients Bi and Bj from the same multivariate model, Read slide

Example 2: Testing whether β<HS = β=HS From the table, <HS = –55.5 and =HS = –53.9 The difference between β<HS and β=HS is calculated β<HS – β=HS = –55.5 – (–53.9) = 1.6 For that model, var(<HS) = 370.9 var(=HS) = 218.8 cov(<HS, =HS) = 137.8 Plugging those values into the formula for the standard error of the difference yields = √[370.9 + (2 × 137.8) + 218.8] = 17.72 Suppose we want to test H0: β< HS = β= HS. From table 11.1, we have ^< HS = –55.5 and ^= HS = –53.9, for a difference of 1.6. For that model, var(^< HS) = 370.87, var(^= HS) = 218.79, and cov(^< HS, ^= HS) = 137.83. I haven’t shown the complete variance-covariance matrix here Plugging those values into the formula for the standard error of the difference yields 17.72. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Example 2, cont.: Test statistic for β<HS = β=HS To calculate the test statistic, divide the difference between <HS and =HS by the standard error of the difference: (β<HS – β=HS)/s.e. (β<HS – β=HS) = 1.6/17.7 = 0.09 0.09 < 1.96 (the critical value of 1.96 for a t-test with ∞ degrees of freedom) Thus we cannot reject the null hypothesis that β<HS = β=HS Next we calculate the test statistic and compare it against the critical value Read slide The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

TEST statement Alternatively, request the test statistic for equality of coefficients for pairs of coefficients as part of the regression procedure E.g., to test whether predicted birth weight is statistically significantly different for non-Hispanic black than for Mexican American infants Specify “TEST ‘<HS’ = ‘=HS’ ” in your SAS syntax Output for H0: β<HS = β=HS reports an F-statistic of 0.01 with a p-value of 0.93 Conclusion: No statistically significant difference between the estimated coefficients for <HS and =HS Alternatively, we can have the software do the calculations for us, by requesting an option to the regression procedure. In SAS…

Collapsing the education classification Because Cannot reject the null hypothesis that β<HS = β=HS based on the estimates from the model Both β<HS and β=HS are statistically significantly different from zero β<HS and β=HS are empirically very similar (–55.5 and –53.9, respectively) Could simplify the specification by creating one dummy to capture ≤HS Collapses the categories of <HS and =HS From the results of the original specification and the test for a difference between B< HS and B= HS, we see Read slide

Test a revised education specification Replace Specification A BW = f (<HS, =HS, other independent variables) With Specification C BW = f (≤HS, same set of other independent variables) Compare overall fit of specifications A and C GOF statistics If fit of the simpler specification is not statistically significantly worse than that of the more detailed education specification It would be the parsimonious specification To test that possibility, we could estimate a new specification, which we’ll call “Specification C” that includes the same set of independent variables as the original specification (“Specification A”) EXCEPT it replaces the separate dummy variables < HS and = HS with ONE dummy that captures BOTH of those levels of mother’s education combined The ref cat remains > HS. So the education comparison is now <= HS against > HS To see whether that specification fits the data as well as the one with the more detailed educational attainment contrasts, you would use the techniques we learned in the podcast on comparing model GOF to formally compare the overall fit of Specs A and C

Caveat about combining categories Only combine categories for which it makes substantive sense to do so E.g., < HS and > HS aren’t adjacent ordinal categories, so you would NOT combine them with one another to compare against = HS. However, for some research questions, you could combine non-Hispanic blacks with Mexican-Americans because both are considered racial/ethnic minority groups in the US Before doing such a combination, it is important to think carefully about whether it makes substantive sense to combine categories, even when empirical tests of statistical significance shows that their effect sizes are similar. For example… read slide

Describing exploratory work on model specification Always explain in your methods or results section how you arrived at your final model specification Describe the criteria you used to decide which independent variables to include in both initial and final models Theoretical criteria about which variables and classifications were used in initial specification Empirical criteria used to test simplifications to that specification Theoretical criteria might override empirical criteria due to the role of that variable in your specific research question Since this series of podcasts is linked to a book on WRITING ABOUT multivariate analysis, I will close this lecture by discussing what you should present about the exploratory work you have done to arrive at your final model specification. If you explore more than a few specifications, do NOT present the full set of coefficients etc for every single model you estimated! Instead, describe the logic you use to decide Which IVS to include in the initial, most detailed specification The theoretical and empirical criteria you used to test possible simplifications to arrive at a more parsimonious specification Read last bullet

Example description of exploratory work on model specification “Although birth weight for Mexican American infants was not statistically significantly different from that of non-Hispanic white infants, because race/ethnicity is a variable of primary interest for our research question, we retained it as a separate category in sequence of models.” Theoretical criteria, used if race/ethnicity is of central interest in the analysis Here is an example of a decision made based on theoretical criteria for the specific research question, which in this case over-rode results of empirical tests for similarity of coefficients on two racial/ethnic groups in the birth weight model.

Alternative description of exploratory work on model specification “Our initial model specification compared three racial/ethnic categories (Mexican American, non-Hispanic black, and non-Hispanic white). However, birth weight for Mexican American infants was not statistically significantly different from that of non-Hispanic white infants, so those we combined those two groups to create the revised reference category for race/ethnicity.” Empirical criteria, to be used if race/ethnicity not a key IV On the other hand, if race/ethnicity were not a key independent variable in your analysis, you might decide to combine Mexican American with non-Hispanic white infants based on empirical criteria that showed a lack of stat sig difference between those groups’ birth weight (DV).

Summary Initial model specifications often include a full set of independent variables (IVs) related to the substantive research question. E.g., Detailed classifications of 1+ categorical variables All pertinent main effects and interaction terms If some of those variables are not statistically significant in the initial model, test simpler specifications to assess whether there is a statistically significant loss of fit Decisions about which IVs to include in the final model should be based on both theoretical and empirical criteria related to your research question and data In summary, Read first two bullets So, as with most aspects of research and statistical analysis, don’t blindly follow the same steps regardless of your research question and data – think through what makes sense for your analysis, conduct the steps, and explain to your readers what you did Why you did it that way What you concluded The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Suggested resources Miller, J. E. 2013. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. University of Chicago Press, chapters 11 and 15. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Suggested online resources Podcasts on Testing statistical significance of differences between coefficients Comparing overall goodness of fit across models Creating variables and specifying models to test for interactions The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Contact information Jane E. Miller, PhD jmiller@ifh.rutgers.edu Online materials available at http://press.uchicago.edu/books/miller/multivariate/index.html The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.