Stat 512 – Lecture 17 Inference for Regression (9.5, 9.6)
Last Time – Two Quantitative Variables Question: Is there an association between the two variables? Graphical summary: Scatterplot of response variable vs. explanatory variable (horizontal) Description: Direction, form, strength Numerical summary: If linear, r, Pearson’s correlation coefficient, -1 < r < 1
Practice Problem r =.504 r =.938
Temperatures vs. time r =.029 Always plot the data!!
Least-Squares Regression Line Model: Least Squares Regression Line Minimize sum of squared residuals Response-hat = a + b explanatory a = intercept, predicted value when explanatory=0 b = slope, predicted change in response associated with an increase in explanatory variable by 1 unit Use regression line for making predictions Warnings: regression line is not resistant Influential observation = removing the value changes the regression equation Outliers = extreme residual value
R2R2 If predict everyone to have the same height, lots of “unexplained” variation (SSE = ) If take explanatory variable into account, much less “unexplained” variation (SSE = 235) % change=( ) = 50.6%
R2R2 Of the variability in the heights, 50.6% of that variation is explained by this regression on foot length BAD 50.6% of points lie on the line 50.6% of predictions will be correct
Example 1 (cont.): Airline costs Each flight has a ‘set up’ cost of $151 and each additional mile of travels is associated with an predicted increase in cost of about 7 cents. 19.3% of the variability in airfare is explained by this regression on distance (still lots of unexplained variability) Might investigate further while the cost for ACK was so much higher than expected
Inference in Regression Is the relationship between the two variables statistically significant? Need to understand how the behavior/variability of regression lines from different random samples
Example 1: House Prices Observational units, variable Houses, price (quantitative) Positive linear moderately strong association Larger houses tend to cost more! Predicted price = sq ft, r 2 =42.1% Perhaps houses in Northern CA tend to be a bit more expensive even for the same size, but not a huge difference
Example 1: House Prices H 0 : no association between price and size H 0 : = 0 H a : > 0, there is a positive association between price and size
Example 1: House Prices 1) Curvature? Not really 2) normality? no3) independence? Random sample 4) Equal spread? No
Example 1: House Prices 1) Curvature? No 2) normality? better3) independence? Random sample 4) Equal spread? Yes
Simulating Regression Lines Sampling variabiltiy A slope of.8899 would be quite surprising!.8899 is more than 7 standard errors from 0! p-value <.001, less than.1% of random samples from a population with =0 would see such an extreme sample slope by chance alone
Example 2: Airfare Costs H 0 : = 0 (no association between price and distance) H a : > 0 (cities further away are associated with more expensive flights) p-value =.002/2 =.001 Strong evidence against the null hypothesis, statistically significant evidence of a positive relationship between price and distance BUT residual plots don’t look so great
Example 3: Money-Making Movies
For Tuesday Skim non-starred sections of Ch. 11 Submit PP 15 in Blackboard by 3pm Submit Project Report 3 in class (see syllabus for details)