Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTION 2.6, 9.1 Least squares line Interpreting coefficients Prediction Cautions Inference for slope, correlation
Statistics: Unlocking the Power of Data Lock 5 ANOVA is used to test for an association between a)Two categorical variables b)One categorical and one quantitative variable c)Two quantitative variables Review
Statistics: Unlocking the Power of Data Lock 5 A 2 test is used to test for an association between a)Two categorical variables b)One categorical and one quantitative variable c)Two quantitative variables Review
Statistics: Unlocking the Power of Data Lock 5 MODELING
Statistics: Unlocking the Power of Data Lock 5 Can you estimate the temperature on a summer evening, just by listening to crickets chirp? We will fit a model to predict temperature based on cricket chirp rate Crickets and Temperature
Statistics: Unlocking the Power of Data Lock 5 Crickets and Temperature Response Variable, y Explanatory Variable, x
Statistics: Unlocking the Power of Data Lock 5 Linear Model A linear model predicts a response variable, y, using a linear function of explanatory variables Simple linear regression predicts on response variable, y, as a linear function of one explanatory variable, x
Statistics: Unlocking the Power of Data Lock 5 Regression Line Goal: Find a straight line that best fits the data in a scatterplot
Statistics: Unlocking the Power of Data Lock 5 Equation of the Line The estimated regression line is InterceptSlope Slope: increase in predicted y for every unit increase in x Intercept: predicted y value when x = 0
Statistics: Unlocking the Power of Data Lock 5 Regression in StatKey
Statistics: Unlocking the Power of Data Lock 5 Regression in RStudio
Statistics: Unlocking the Power of Data Lock 5 Regression Model Which is a correct interpretation? a) The average temperature is b) For every extra 0.23 chirps per minute, the predicted temperate increases by 1 degree c) Predicted temperature increases by 0.23 degrees for each extra chirp per minute d) For every extra 0.23 chirps per minute, the predicted temperature increases by
Statistics: Unlocking the Power of Data Lock 5 Units It is helpful to think about units when interpreting a regression equation y units x units degrees chirps per minute degrees/ chirps per min
Statistics: Unlocking the Power of Data Lock 5 Prediction The regression equation can be used to predict y for a given value of x If you listen and hear crickets chirping about 140 times per minute, your best guess at the outside temperature is
Statistics: Unlocking the Power of Data Lock 5 Prediction
Statistics: Unlocking the Power of Data Lock 5 Prediction If the crickets are chirping about 180 times per minute, your best guess at the temperature is (a) 60 (b) 70 (c) 80
Statistics: Unlocking the Power of Data Lock 5 Prediction The intercept tells us that the predicted temperature when the crickets are not chirping at all is . Do you think this is a good prediction? (a) Yes (b) No
Statistics: Unlocking the Power of Data Lock 5 Regression Caution 1 Do not use the regression equation or line to predict outside the range of x values available in your data (do not extrapolate!) If none of the x values are anywhere near 0, then the intercept is meaningless!
Statistics: Unlocking the Power of Data Lock 5 Duke Rank and Duke Shirts a)positively associated b)negatively associated c)not associated d)other Are the rank of Duke among schools applied to and the number of Duke shirts owned
Statistics: Unlocking the Power of Data Lock 5 Regression Caution 2 Computers will calculate a regression line for any two quantitative variables, even if they are not associated or if the association is not linear ALWAYS PLOT YOUR DATA! The regression line/equation should only be used if the association is approximately linear
Statistics: Unlocking the Power of Data Lock 5 Regression Caution 3 Outliers (especially outliers in both variables) can be very influential on the regression line ALWAYS PLOT YOUR DATA! l.aspx?ID=L455
Statistics: Unlocking the Power of Data Lock 5 Life Expectancy and Birth Rate Coefficients: (Intercept) LifeExpectancy Which of the following interpretations is correct? (a)A decrease of 0.89 in the birth rate corresponds to a 1 year increase in predicted life expectancy (b)Increasing life expectancy by 1 year will cause birth rate to decrease by 0.89 (c)Both (d)Neither
Statistics: Unlocking the Power of Data Lock 5 Regression Caution 4 Higher values of x may lead to higher (or lower) predicted values of y, but this does NOT mean that changing x will cause y to increase or decrease Causation can only be determined if the values of the explanatory variable were determined randomly (which is rarely the case for a continuous explanatory variable)
Statistics: Unlocking the Power of Data Lock 5 Explanatory and Response Unlike correlation, for linear regression it does matter which is the explanatory variable and which is the response
Statistics: Unlocking the Power of Data Lock 5 Regression Line How do we find the best fitting line???
Statistics: Unlocking the Power of Data Lock 5 Predicted and Actual Values
Statistics: Unlocking the Power of Data Lock 5 Predicted and Actual Values
Statistics: Unlocking the Power of Data Lock 5 Residual The residual is also the vertical distance from each point to the line
Statistics: Unlocking the Power of Data Lock 5 Residual Want to make all the residuals as small as possible. How would you measure this?
Statistics: Unlocking the Power of Data Lock 5 Least Squares Regression Least squares regression chooses the regression line that minimizes the sum of squared residuals
Statistics: Unlocking the Power of Data Lock 5 Least Squares Regression
Statistics: Unlocking the Power of Data Lock 5 Sample to Population Everything we have done so far is based solely on sample data Now, we will extend from the sample to the population
Statistics: Unlocking the Power of Data Lock 5 Intercept Slope Simple Linear Model Random error
Statistics: Unlocking the Power of Data Lock 5 Inference for the Slope
Statistics: Unlocking the Power of Data Lock 5 Inference for Slope Give a 95% confidence interval for the true slope. Is the slope significantly different from 0? (a) Yes (b) No
Statistics: Unlocking the Power of Data Lock 5 Confidence Interval
Statistics: Unlocking the Power of Data Lock 5 Hypothesis Test
Statistics: Unlocking the Power of Data Lock 5 Test for a correlation between temperature and cricket chirps (r = ). Correlation Test
Statistics: Unlocking the Power of Data Lock 5 Two Quantitative Variables The t-statistic (and p-value) for a test for a non-zero slope and a test for a non-zero correlation are identical! They are equivalent ways of testing for an association between two quantitative variables.
Statistics: Unlocking the Power of Data Lock 5 Small Samples The t-distribution is only appropriate for large samples (definitely not n = 7)! We should have done inference for the slope using simulation methods...
Statistics: Unlocking the Power of Data Lock 5
r = 0 Challenge: If the correlation between x and y is 0, what would the regression line be?
Statistics: Unlocking the Power of Data Lock 5 To Do Read Section 2.6, 9.1 Do HW 7 (due Monday, 3/31) NO LATE HOMEWORK ACCEPTED – SOLUTIONS WILL BE POSTED IMMEDIATELY AFTER CLASS TO HELP YOU PREPARE FOR EXAM 2