Presentation is loading. Please wait.

Presentation is loading. Please wait.

Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Similar presentations


Presentation on theme: "Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:"— Presentation transcript:

1 Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed: –This Week: Inference (Ch. 10) –Next Week: Transformations and Polynomial Regression (Ch. 2.6), Example Regression Analysis –Tue., Oct. 19: Review for Midterm I –Thu., Oct. 21: Midterm I –Fall Break!

2 Regression without Center City Philadelphia

3 The Question of Causation The community that ran this regression would like to increase property values. If low crime rates increase property values, the community might be able to cover the costs of increased police protection by gains in tax revenue from higher property values. The regression without Center City Philadelphia is Linear Fit HousePrice = 225233.55 - 2288.6894 CrimeRate The community concludes that if it can cut its crime rate from 30 down to 20 incidents per 1000 population, it will increase its average house price by $2288.6894*10=$22,887. Is the community’s conclusion justified?

4 Potential Outcomes Model Let Y i 30 denote what the house price for community i would be if its crime rate was 30 and all other aspects of community i were held fixed and let Y i 20 denote what the house price for community i would be if its crime rate was 20 and all other aspects of community I were held fixed. X (crime rate) causes a change in Y (house price) for community i if. A decrease in crime rate causes an increase in house price for community i if

5 Association is Not Causation A regression model tells us about how the mean of Y|X is associated with changes in X. A regression model does not tell us what would happen if we actually changed X. Possible Explanations for an Observed Association Between Y and X 1.X causes Y 2.Y causes X 3.There is a lurking variable Z that is associated with changes in both X and Y. Any combination of the three explanations may apply to an observed association.

6 Y Causes X Perhaps it is changes in house price that cause changes in crime rate. When house prices increase, the residents of a community have more to lose by engaging in criminal activities; this is called the economic theory of crime.

7 Lurking Variables Lurking variable for the causal relationship between X and Y: A variable Z that is associated with both X and Y. Example of lurking variable in Philadelphia crime rate data: Level of education. Level of education may be associated with both house prices and crime rate. The effect of crime rate on house price is confounded with the effect of education on house price. If we just look at data on house price and crime rate, we can’t distinguish between the effect of crime rate on house price and the effect of education on house price. Lurking variables are sometimes called confounding variables.

8 Weekly Wages (Y) and Education (X) in March 1988 CPS Will getting an extra year of education cause an increase of $50.41 on average in your weekly wage? What are some potential lurking variables?

9 Establishing Causation Best method is an experiment, but many times that is not ethically or practically possible (e.g., smoking and cancer, education and earnings).

10 Establishing Causation from an Observational Study Main strategy for learning about causation when we can’t do an experiment: Consider all lurking variables you can think of. Look at how Y is associated with X when the lurking variables are held “fixed.” We will study methods for doing this when we study multiple regression in Chapter 11.

11 Statistics and Smoking Doctors had long observed a strong association between smoking and death from lung cancer. But did smoking cause lung cancer? There were many possible lurking variables – smokers have worse diet, drink more alcohol, get less exercise than nonsmokers. The possibility that there was a genetic factor that predisposes people both to nicotine addiction and lung cancer was also raised. Statistical evidence from observational studies formed an essential part of the Surgeon General’s report in 1964 that declared that smoking causes lung cancer. How were objections to the association between lung cancer and smoking being entirely the result of observational studies overcome? This smoker said the findings “didn’t frighten him at all.”

12 Criteria for Establishing Causation Without an Experiment The association is strong. The association is consistent. Higher doses are associated with stronger responses. The alleged cause precedes the effect in time. The alleged cause is plausible.

13 Random Samples and Inference The Current Population Survey is a monthly sample survey of the labor force behavior of American households. The data in cpswages.JMP is the monthly wages and education for a random sample of 25,631 men from the March 1988 Current Population Survey. Suppose we take random subsamples of size 25 from this data: In JMP, we can take a random sample of data by clicking Tables, Subset, then click Random Sample and put the size of the sample you want in the box Sampling Rate or Sample Size. Then click OK and a new data table will be created that consists of a random sample of the rows in the original data.

14 Four Random Samples of Size 25 from cpswage.JMP

15 Least Squares Slopes in 1000 Random Samples of Size 25

16 Inference Based on Sample The whole Current Population Survey (25,631 men ages 18-70) is a random sample from the U.S. population (roughly 75 million men ages 18-70). In most regression analyses, the data, we have is a sample from some larger (hypothetical) population. We are interested in the true regression line for the larger population. Inference Questions: –How accurate is the least squares estimate of the slope for the true slope in the larger population? –What is a plausible range of values for the true slope in the larger population based on the sample? –Is it plausible that the slope equals a particular value (e.g., 0) based on the sample? Regression Applet: http://gsbwww.uchicago.edu/fac/robert.mcculloch/resear ch/webpage/teachingApplets/ciSLR/index.html

17 Model for Inference For inference, we assume the simple linear regression model is true. We should first check the assumptions using residual plots and also look for outliers and influential points before making inferences. Simple Linear Regression Model: Simple linear regression model: – – has a normal distribution with mean 0 and standard deviation (SD) –The subpopulation of Y with corresponding X=X i has a normal distribution with mean and SD – Technical note: For inference for simple linear regression, we assume we take repeated samples from the simple linear regression model with the X’s set equal to the X’s in the data,

18 Standard Error for the Slope True model: From the sample of size n, we estimate by the least squares estimate In repeated samples of size n with X’s set equal to, standard error is the “typical” absolute value of the error made in estimating by

19 Full Data Set Random Sample of Size 25

20 Confidence Intervals Confidence interval: A range of values that are plausible for a parameter given the data. 95% confidence interval: An interval that 95% of the time will contain the true parameter. Approximate 95% confidence interval: Estimate of parameter 2*SE(Estimate of parameter). Approximate 95% confidence interval for slope: For wage-education data,, approximate 95% CI = Interpretation of 95% confidence interval: It is most plausible that the true slope is in the 95% confidence interval. It is possible that the true slope is outside the 95% confidence interval but unlikely; the confidence interval will fail to contain the true slope only 5% of the time in repeated samples.

21 Conf. Intervals for Slope in JMP After Fit Line, right click in the parameter estimates table, go to Columns and click on Lower 95% and Upper 95%. The exact 95% confidence interval is close to but not equal to

22 Confidence Intervals and the Polls Margin of Error = 2*SE(Estimate). 95% CI for Bush-Kerry difference: 95% CI for difference between Bush and Kerry’s proportions:

23 Why Do the Polls Sometimes Disagree So Much?

24 Assumptions for Validity of Confidence Interval The margin of error in a confidence interval covers only random sampling errors according to the assumed random sampling model; the confidence interval’s “95% guarantee” assumes the model is correct. In presidential polls, it must be determined who is “likely to vote.” Different polls use different models for determining who is likely to vote. The margin of error in the confidence interval assumes that the poll’s model for who is likely to vote is correct. For simple linear regression, the confidence interval for the slope assumes the simple linear regression model is correct; if the simple linear regression model is not correct, the confidence interval’s “95% guarantee” that 95% of the time, it will contain the true slope, is not valid. Always check the assumptions of the simple linear regression model before doing inference.


Download ppt "Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:"

Similar presentations


Ads by Google