Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice Problem Review of main themes from course Directions for future study
Sex discrimination revisited At the beginning of the class, in case study 1.2, we examined data from a sex discrimination case. Strong evidence that male clerks are paid more than female hires. But bank’s defense lawyers say that this is because males have higher education and experience, i.e., there are omitted confounding variables.
Multiple regression model for sex discrimination Let’s look at controlling for education level first. To examine bank’s claim, we want to look at and compare to How do we incorporate a categorical explanatory variable into multiple regression? Dummy variables.
Dummy variables Define Multiple regression model:, the coefficient on the dummy variable for sex, is the difference in mean earnings between the populations of men and women with the same education levels.
Categorical variables in JMP To color and mark the points by a categorical variable such as Sex, click red triangle to left on first column and select Color or Mark by Column. Select Set Marker by Value to use different marker by column.
Parallel Regression Lines The model implies that Regression lines for males and females as education varies are parallel. No interaction between sex and education.
Plot produced by JMP version 5 in Fit Model output that shows the parallel regression lines and the actual observations.
Interactions with Dummy Variables The model assumes that difference between men and women’s mean salaries for fixed levels of education is the same for all levels of education. There might be an interaction between sex and education. Difference between men and women might differ depending on level of education.
Interaction Model Multiple regression model that allows for interaction between sex and education: To add interaction in JMP, create a new colun sexdummy*educ. Right click on column, select formula and use the formula sexdummy*educ.. Difference in mean salary between men and women of same education level depends on the education level.
The model with one continuous explanatory variable, one categorical variable and an interaction is called the separate regression lines model because regression lines of y on continuous explanatory variables for two levels of dummy variable are “separate,” neither coincident nor parallel.
Multiple regression with education, experience and sex We can easily control for both education and experience in the sex discrimination case by adding them both to the multiple regression. A model without interactions is: Note that is difference between mean salaries of males and females of same education and experience level.
Course Summary Techniques: –Methods for comparing two groups –Methods for comparing more than two groups (one-way ANOVA F test, multiple comparisons) –Method for testing hypothesis about distribution of one population of nominal variable (chi-squared test) –Simple and multiple linear regression for predicting a response variable based on explanatory variables and (with a random experiment or no omitted confounding variables) finding the causal effect of explanatory variables on a response variable.
Course Summary Cont. Key messages: –Always do a randomized experiment if possible. Inferences about causal effects from observational studies require the always questionable assumption that there are no omitted confounding variables. Similarly, always take a random sample if possible. –p-values only assesses whether there is strong evidence against the null hypothesis. They do not provide information about practical significance. –Always form confidence intervals for the parameters (e.g., difference in means, regression coefficients) in addition to making point estimates and doing hypothesis tests. Confidence intervals provide information about the accuracy of the estimate and the practical significance of the finding.
Course Summary Cont. Key messages: –Beware of multiple comparisons and data snooping. Use Tukey-Kramer method or Bonferroni to adjust for multiple comparisons. –Simple/multiple linear regression is a powerful method for making predictions of a variable y based on explanatory variables. However, beware of extrapolation. –Multiple regression can be used to control for known confounding variables in order to obtain good estimates of the causal effect of a variable on an outcome. However, if there are omitted confounding variables, the estimate of the causal effect will be biased. The sign and magnitude of the bias is indicated by the omitted variable bias formula.
Directions for Future Study Stat 500: Applied Regression and Analysis of Variance. Offered next fall. Natural follow-up to Stat 112, giving a more advanced treatment of the topics in 112. Stat 501: Introduction to Nonparametric Methods and Log-linear models. Offered this spring. Follow-up to Stat 500. Stat 430: Probability. Will be offered next fall and next spring. Stat 431: Statistical Inference. Will be offered next fall and next spring. Stat 210: Sample Survey Design. Stat 202: Intermediate Statistics.