Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression model. –Inferences for simple linear regression model.
Wages and Education A random sample of 100 men (ages 18-70) was surveyed about their weekly wages in 1988 and their education (part of the 1988 March U.S. Current Population Survey) (in file wagedatasubset.JMP) How much more on average do men with one extra year of education make? If a man has a high school diploma but no further education, what’s the best prediction of his earnings? Regression addresses these two questions X=Education, Y= Weekly Wage
Simple Linear Regression Model
Sample vs. Population We can view the data – -- as a sample from a population. Our goal is to learn about the relationship between X and Y in the population: –We don’t care about the particular 100 men sampled but about the population of US men ages –From Notes 1, we don’t care about the relationship between tracks counted and the density of deer for the particular sample, but the relationship among the population of all tracks; this enables to predict in the future the density of deer from the number of tracks counted.
Simple Linear Regression Model
Assumptions of the Simple Linear Regression Model
Checking the Assumptions
Residual Plot
Checking Linearity Assumption
Violation of Linearity
Checking Constant Variance
Checking Normality
Checking Assumptions It is important to check the assumptions of a regression model because the inferences depend on the assumptions approximately holding. The assumptions don’t need to hold exactly but only approximately. We will study more about checking assumptions and how to deal with violations of the assumptions in Chapters 5 and 6.
Inferences
Sampling Distribution of b 0,b 1 The sampling distribution of describes the probability distribution of the estimates over repeated samples from the simple linear regression model. Understanding the sampling distribution is the key to drawing inferences from the sample to the population.
Sampling distribution in wage data To see how the least squares estimates can differ over different samples from the population, we consider the “population” to be all 25,632 men surveyed in the March 1988 Current Population Survey in wagedata1988.JMP and the sample to be random samples of size 100 like the one in wagedatasubset.JMP.
Samples of wage data To take samples in JMP, click the Tables menu, then click Subset and then click the circle next to Random Sample Size and set the sample size. JMP will create a new data subset which is a random sample of the original data set.
Sampling distributions Only sample, not population, is usually available so we need to understand sampling distribution. Sampling distribution of – –Sampling distribution is normally distributed. –Even if normality assumption fails, sampling distributions of are still approximately normal if n>30.
Properties of and as estimators of and Unbiased Estimators: Mean of the sampling distribution is equal to the population parameter being estimated. Consistent Estimators: As the sample size n increases, the probability that the estimator will become as close as you specify to the true parameter converges to 1. Minimum Variance Estimator: The variance of the estimator is smaller than the variance of any other linear unbiased estimator of, say
Confidence Intervals Point Estimate: Confidence interval: range of plausible values for the true slope Confidence Interval: where is an estimate of the standard deviation of ( ) Typically we use a 95% CI. 95% CI is approximately 95% CIs for a parameter are usually approximately where the standard error of the point estimate is an estimate of the standard deviation of the point estimate.
Computing Confidence Interval with JMP
Summary We have described the assumptions of the simple linear regression model and how to check them. We have come up with a method of describing the uncertainty in our estimates of the slope and the intercept via confidence intervals. Note: These confidence intervals are only accurate if the assumptions of the simple linear regression model are approximately correct. Next class: Hypothesis tests.