Creating variables and specifying models to test for interactions between two categorical independent variables This lecture is the third in the series.

Creating variables and specifying models to test for interactions between two categorical independent variables This lecture is the third in the series on interactions. Before you watch this one, please watch the introduction to interactions and visualizing shapes of interaction patterns. Jane E. Miller, PhD

Overview Creating variables for an interaction between two categorical variables Review: dummy variables Review: reference categories Aside on missing values Specifying a model with an interaction between two categorical variables Here, we will learn how to create interaction terms (or variables) for interactions between two categorical independent variables And interactions between a continuous and a categorical independent variable Along the way, we will review types of variables and discuss some issues related to missing values.

List of variables used in examples
Dependent variable = birth weight in grams (BW). Independent variables: Main effects terms: Race Two nominal categories (non-Hispanic black; non-Hispanic white is the reference category) One main effect dummy variable: NHB Coded 1 = non-Hispanic black, 0 = non-Hispanic white Mother’s education Three ordinal categories (<HS; =HS; >HS is the reference category) Two main effects dummies: <HS, =HS Each coded 1 = named category, 0 = all other values We will use the same research question we have been tracing throughout the interactions lectures so far: Testing whether the association between education and birth weight differs for black and for white infants. As a review, our source variables are… Throughout this module and the subsequent ones on interpreting coefficients and calculating patterns Yellow used to identify the main effects terms Green used to identify the interaction terms.

List of variables, continued
Interaction between race and mother’s education Two interaction term dummies: NHB_<HS; NHB_=HS Each named using the “_” convention to link the names of the component variables. Each coded 1 = named category, 0 = all other values E.g., NHB_<HS = 1 for those who are both NHB and <HS, = 0 for all other combinations of race and education The interaction terms NHB_<HS and NHB_=HS are defined as explained in the previous podcast. For each case, they take on the value of the product of the two component dummy variables. e.g., NHB_<HS = NHB * <HS

Interaction between two categorical independent variables
Example: Race and education Race is a 2-category independent variable classified Non-Hispanic black (NHB) Non-Hispanic white (NHW) = reference category Mother’s educational attainment is a 3-category independent variable classified Less than complete high school (<HS) High school diploma, no higher (=HS) More than high school (>HS) = reference category Tracing the same research question that we introduced in the overview module, we will start by considering the set of variables needed to specify a model to test an interaction between two categorical independent variables – race and mother’s educational attainment – in an OLS model of birthweight. Before we can discuss the set of variables needed for our model, let’s familiarize ourselves with the specific variables available in our data set and how they are coded. Race is a two-category independent variable that differentiates between non-Hispanic black and non-Hispanic white infants. Mother’s educational attainment is a three-category variable with ordinal categories <HS, =HS (in other words, high school diploma, no higher), and >HS (at least some college). Note that in other data sets, you might have measures of race and education that are defined differently but the basic points shown here about how to handle nominal and ordinal variables will apply.

Coding of variables Each of the dummy (also known as “binary”) variables will be coded 1 for each case that has the trait after which the variable is named. 0 for all other cases. E.g., the dummy variable “NHB” will be coded 1 for all non-Hispanic black infants. 0 for all others (in this example, all non-Hispanic white infants). For each of the categorical independent variables, we will create dummy variables [read slide]

Reference category for an interaction
Need a set of independent variables to uniquely identify each possible combination of race and mother’s educational attainment. With one 2-category variable and one 3-category variable, there are six such combinations. Choose one category to be the basis of comparison. The reference category. Define dummy variables to differentiate among the other five categories. To correctly specify a model with interactions between two categorical independent variables, we need a set of dummy variables that collectively identify all possible combinations of the two IVs In this example, race is a 2-category variable and education a 3-category variable, so there are six possible combinations of race and education that need to be uniquely represented by the variables in the specification. As with main effects for categorical variables, we will choose one category against which to compare all of the others, called the reference or omitted category.

Possible combinations of race and mother’s educational attainment
<HS =HS >HS Non-Hispanic black Non-Hispanic white Reference category Here is a schematic diagram illustrating the six possible combinations of race (in the rows) and mother’s education (in the columns). We chose non-Hispanic white as the ref cat for race, and >HS as the ref cat for education, so the ref cat for the interaction will be the combination of those two ref cats: non-Hispanic white infants born to women with more than a high school education.

Source variables used to create main effects and interaction terms
Three source variables: A two-category race variable RACE coded 1 = non-Hispanic white; 2 = non-Hispanic black A three-category education variable MOMED coded 1 = <HS; 2 = “=HS”; 3 = >HS A continuous income variable IPR, annual family income (in $) divided by the Federal Poverty Level for a family of that size and age composition On the next few slides, PINK = original (“source”) variable YELLOW = main effect term GREEN = interaction term To show how those variables are created from the original source variables, in the rest of this lecture I use PINK to denote the original (or source) variable, coded as shown here YELLOW to identify main effects terms (or variables) And GREEN to identify interactions terms.

Coding of main effects and interaction terms: race/ethnicity and education
Case characteristics Main effects terms Interaction terms Race Education Race & educ NHB <HS =HS NHB_<HS NHB_=HS Non-H white & <HS 1 Non-H white & =HS Non-H white & >HS Non-H black & <HS Non-H black & =HS Non-H black & >HS Here is a table showing each of the possible combinations of race and education, one in each row. The columns each hold one of the five dummy variables we will use to distinguish among those groups. Each of the dummy variables will be coded 1 if the infant is in that group, 0 otherwise. On the next few slides, I will walk you step by step to show how those variables will help us to identify (differentiate among) three of our race*education categories. For a two-category race variable (non-Hispanic white = reference category). And a three-category educational attainment variable (>HS = reference category).

Coding of main effects and interaction variables: non-Hispanic white infants
Case characteristics Main effects terms Interaction terms Race Education Race & educ NHB <HS =HS NHB_<HS NHB_=HS Non-H white & <HS 1 Non-H white & =HS Non-H white & >HS This slide looks specifically at the rows pertaining to non-Hispanic white infants. For all non-Hispanic whites, the dummy variable NHB =0 because that trait does not apply to them. They are the reference category. This means that both of the interaction terms NHB_<HS and NHB_=HS will also equal 0 for all non-Hispanic whites, because to obtain a value of 1 one must be BOTH NHB and in the pertinent education category. Non-Hispanic whites born to mothers with more than a high school education (bottom row shown here) have a value of 0 for each of the main effects terms, including the NHB race dummy and both the <HS and =HS dummy, because none of those characteristics pertain to them. They are the reference category, also known as the omitted category, because they are identified by 0s on all of the dummy variables pertaining to both of the categorical variables involved in this interaction. However, white infants born to mothers with lower educational attainment each take on the value 1 for ONE of the education main effect dummies – EITHER <HS if their mother did not complete high school, OR =HS if she earned a high school diploma as her highest earned degree. For a two-category race variable (non-Hispanic white = reference category). And a three-category educational attainment variable (>HS = reference category).

Calculating an interaction term from two dummy main effects terms
Using the convention of naming the interaction term with an “_” to connect the names of the two component variables. The interaction term between NHB and <HS is calculated NHB × <HS. Since both component main effects terms are coded 1 for the named group and 0 for all others, only when both NHB and <HS = 1 is NHB_<HS = 1. A value of 1 for that interaction term identifies infants with BOTH of those traits. E.g., for an infant who is NHW and <HS we have 0 × 1 = 0.

Coding of main effects and interaction variables: non-Hispanic black infants
Case characteristics Main effects terms Interaction terms Race Education Race & educ NHB <HS =HS NHB_<HS NHB_=HS Non-H black & <HS 1 Non-H black & =HS Non-H black & >HS For all non-Hispanic blacks, the dummy variable for the race main effect NHB =1 as we discussed before. Black infants born to mothers with >HS education will have values of 0 for both education main effects and the race*education interaction terms because >HS is the reference category for the educational attainment variable. However, black infants born to mothers in the two lower educational attainment groups will take on values of the education main effects and interactions as shown in the table. For instance, black infants whose mothers did not complete high school will have both the <HS main effect and the NHB_<HS interaction terms =1 because the infant fits both of those characteristics. Likewise, black infants whose mothers have a high school diploma but no higher will have the dummy variables =HS and NHB_=HS =1 because the infant fits both of those traits. Put differently, the interaction terms are the product of the two pertinent main effects terms involved in that interaction, so that to be coded a 1 on NHB_<HS, one must be =1 on both the NHB and <HS main effects terms. For a two-category race variable (non-Hispanic white = reference category). And a three-category educational attainment variable (>HS = reference category).

Coding of main effects and interaction variables: race and educational attainment
Case characteristics Main effects terms Interaction terms Race Education Race & education NHB <HS =HS NHB_<HS NHB_=HS Non-H white & <HS 1 Non-H white & =HS Non-H white & >HS Non-H black & <HS Non-H black & =HS Non-H black & >HS To review, each of the six possible combinations of race and education is shown in a separate row in this table. Scanning down the rows, you will see that each row has a unique combination of values of the five variables involved in the interaction, so that taken together, the values of the three main effects and two interaction terms tells us exactly which race/education group the infant is in. Because it is the combination of values of all of those variables that tell us which group an infant is in, to understand the shape and size of the association between these two independent variables and our dependent variable will require that we look at the coefficients on the full set of variables TOGETHER. That is the topic of a later lecture. For a two-category race variable (non-Hispanic white = reference category). And a three-category educational attainment variable (>HS = reference category).

Aside: Missing values For each new variable created, the new variable should take on a missing value if the original source variable was missing for a given case. Need to specify this as an extra step for IF/THEN logic such as that used in creating the dummies. E.g., IF RACE = . THEN NHB =.; In the statistical package SAS, “.” is the code for missing. For variables created using arithmetic, if any component source variable is missing, the result of the calculation will also be missing. E.g., if IPR =., then IPR_NHB will also be missing. One more important point: as you are creating the new variables for your interaction specifications, make sure they are correctly assigning missing values to any case that was missing on either of the variables involved in the interaction. Read your syntax carefully. [Read rest of slide]

Be parsimonious in deciding which interactions to test
As shown here, the number of variables in the regression model proliferates rapidly with each additional interaction. Specify interactions only between key independent variables. Communicating results becomes unwieldy: Considerable behind-the-scenes calculations. Extra tables or charts to convey the shape of the interaction.

Criteria for identifying pertinent interactions to test
Theoretical reasons why the association between X1 and Y might differ by X2 for the particular variables you are studying. Empirical evidence that the association between X1 and Y varies by X2 in your data. Three-way association among X1, X2 , and Y. See Babbie’s elaboration paradigm. To identify which interactions pertain to your research question and data, consider both theoretical and empirical criteria. Read the previous literature to learn about theoretical reasons why… Also example the empirical evidence both in other studies and in a simple three-way association among X1, X2 and Y in your own data. Babbie’s elaboration paradigm can be a useful systematic approach to identifying when you might need to test for interactions.

Model specification with interactions: race and education
BW = f (race, education, race_education) Birth weight is a function of race, education, and the race-by-education interaction. To specify the model, need ALL of the main effects and interaction term variables related to race and mother’s education BW = f (NHB, <HS, =HS, NHB_<HS, NHB_=HS) Initially we specify the model with all three main effects dummy variables and both interaction terms.

Parsimonious specification
Most interaction specifications should initially include Main effects terms for all variables involved in the interaction Interaction terms Might be able to omit some main effect or interaction terms based on Theoretical criteria Empirical statistical significance tests for combining groups For novice researchers, it is best to start with a model that includes all of the pertinent main effects and interaction terms. Later might be able to omit some of those terms based on theoretical criteria or tests of statistical significance. That is a more advanced topic to be covered in a later podcast.

Summary A model specification to test for interactions includes both main effects and interaction terms. Combination of those terms in the model uniquely identifies each possible combination of values of the component variables. Number and type of interaction terms needed depends on Type (s)of variables in the interaction. Number of categories, for categorical variables in interaction. For most situations, test interactions among key variables only. For criteria to help you decide which interactions to test for your topic and data, see podcast on visualizing shapes of interaction patterns

Suggested resources Miller, J. E., The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Chapter 16, on interactions Chapter 9, on defining dummy variables Chapter 8, on choice of reference category Chapters 8 and 9 of Cohen et al Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd Edition. Florence, KY: Routledge.

Suggested online resources
Podcasts on Introduction to interactions Visualizing shapes of interaction patterns Choosing a reference category

Suggested practice exercises
Study guide to The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Suggested course extensions for Chapter 16 “Reviewing” exercises #2, 3 and 4. “Applying statistics and writing” exercises #1, 2, and 3. “Revising” exercises #1 and 3.

Contact information Jane E. Miller, PhD Online materials available at

Creating variables and specifying models to test for interactions between two categorical independent variables This lecture is the third in the series.

Similar presentations

Presentation on theme: "Creating variables and specifying models to test for interactions between two categorical independent variables This lecture is the third in the series."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Creating variables and specifying models to test for interactions between two categorical independent variables This lecture is the third in the series.

Similar presentations

Presentation on theme: "Creating variables and specifying models to test for interactions between two categorical independent variables This lecture is the third in the series."— Presentation transcript:

Similar presentations

About project

Feedback