7/2/ Lecture 51 STATS 330: Lecture 5
7/2/ Lecture 52 Tutorials These will cover computing details Held in basement floor tutorial lab, Building 303 South (303S B75) Tutorial 1: Wednesday Tutorial 2: Friday Tutorial 2: Friday 2-3 Start this Thursday
7/2/ Lecture 53 My office hours Tuesday 10:30 – 12:00 Thursday 10:30 – 12:00 Room 265, Building 303S Come up to the second floor in the Computer Science lifts, through the doors and turn right
Tutor office hours Tuesday 1-4 pm Thursday 12-2, 3-4 pm Friday 1-3 pm All in Room (First floor, Building 303) 7/2/ Lecture 54
Next Week Monday, Thursday lectures by Arden Miller No lecture Tuesday Don’t forget assignment due Aug 4 7/2/ Lecture 55
7/2/ Lecture 56 Today’s lecture: The multiple regression model Aim of the lecture: To describe the type of data suitable for multiple regression To review the multiple regression model To describe some plots that check the suitability of the model
7/2/ Lecture 57 Suitable data Suppose our data frame contains A numeric “response” variable Y One or more numeric “explanatory” variables X 1,…X k (also called covariates, independent variables) and we want to “explain” (model, predict) Y in terms of the explanatory variables, e.g. explain volume of cherry trees in terms of height and diameter explain NOx in terms of C and E
7/2/ Lecture 58 The regression model The model assumes The responses are normally distributed with means (each response has a different mean) and constant variance 2 The mean response of a typical observation depends on the covariates through a linear relationship x k x k The responses are independent
7/2/ Lecture 59 The regression plane The relationship x k x k can be visualized as a plane. The plane is determined by the coefficients , k e.g. for k=2 (k=number of covariates)
7/2/ Lecture 510 Slope = slope= 2 00
7/2/ Lecture 511 The regression model (cont) The data is scattered above and below the plane: Size of “sticks” is random, controlled by 2, doesn’t depend on x 1, x 2
7/2/ Lecture 512 Alternative form of model Y x k x k where is normally distributed with mean zero and variance 2
7/2/ Lecture 513 Checking the data How can we tell when this model is suitable? Suitable = data randomly scattered about plane, “planar” for short k=1, draw scatterplot of y vs x k=2, use spinner, look for “edge” k 2 use coplots: if data are planar the plots should be parallel: coplot(y~x1|x2*x3)
7/2/ Lecture 514
7/2/ Lecture 515 Interpretation of coefficients The regression coefficient 0 gives the average response when all covariates are zero The regression coefficient 1 gives the slope of the plane in the x 1 direction measures the average increase in the response for a unit increase in x 1 when all other covariates are held constant Is the slope of plots of y versus x1 in a coplot Is not necessarily the slope of y versus x in a scatter plot
7/2/ Lecture 516 Example Consider data from model y = 5 - 1*x1 + 4*x2 + N(0,1) x1 and x2 highly correlated, r = 0.95 Regression of y on x 1 alone has slope 2.767:
7/2/ Lecture 517 Scatter plot: y vs x1 Slope 2.767
7/2/ Lecture 518 Coplot Slopes are all about - 1
Points to note The regression coefficients 1 and 2 refer to the conditional mean of the response, given the covariates (in this case conditional on x1 and x2) 1 is the slope in the coplot, conditioning on x2 The coefficient in the plot of y versus x1 refers to a different conditional distribution (conditional on x1 only) The same if and only if x1 and x2 uncorrelated 7/2/ Lecture 519
7/2/ Lecture 520 Estimation of the coefficients We estimate the (unknown) regression plane by the “least squares plane” (best fitting plane) Best fitting plane = plane that minimizes the sum of squared vertical deviations from the plane That is, minimize the least squares criterion
7/2/ Lecture 521 Estimation of the coefficients (2) The R command lm calculates the coefficients of the best fitting plane This function solves the normal equations, a set of linear equations derived by differentiating the least squares criterion with respect to the coefficients
7/2/ Lecture 522 Math Stuff (only if you like it) Arrange data on response and covariates into a vector y and a matrix X
7/2/ Lecture 523 Math Stuff: Normal equations Estimates are solutions of
7/2/ Lecture 524 Calculation of best fitting plane > lm(y~x1+x2) Call: lm(formula = y ~ x1 + x2) Coefficients: (Intercept) x1 x
7/2/ Lecture 525 How well does the plane fit? Judge this by examining the residuals and fitted values: Each observation has a fitted value: Y i : response for observation i, x i1, x i2 : values of explanatory variables for observation i Fitted plane is Fitted value for observation i is (the height of the fitted plane at (x i1,x i2 )
7/2/ Lecture 526 How well does the plane fit? (cont) The residual is the difference between the response and the fitted value
7/2/ Lecture 527 How well does the plane fit (cont) For a good fit, residuals must be Small relative to the y’s Have no pattern (don’t depend on the fitted values, x’s etc) For the model to be useful, we should have a strong relationship between the response and the explanatory variables
7/2/ Lecture 528 Measuring goodness of fit How can we measure the relative size of residuals and the strength of the relationship between y and the x’s? The “anova identity” is a way of explaining this
7/2/ Lecture 529 Measuring goodness of fit (cont) Consider the “residual sum of squares” Provides an overall measure of the size of the residuals
7/2/ Lecture 530 Measuring goodness of fit (cont) Consider the “total sum of squares” Provides an overall measure of the variability of the data
7/2/ Lecture 531 Math stuff: ANOVA identity Math result: Anova Identity TSS = RegSS + RSS Since RegSS is non-negative, RSS must be less than TSS so RSS/TSS is between 0 and 1. RegSS: Regression sum of squares
7/2/ Lecture 532 Interpretation The smaller RSS compared to TSS, the better the fit. If RSS is 0, and TSS is positive, we have a perfect fit If RegSS=0, RSS=TSS and the plane is flat (all coefficients except the constant are zero), so the x’s don’t help predict y
7/2/ Lecture 533 R2R2 Another way to express this is the coefficient of determination R 2 This is defined as 1-RSS/TSS = RegSS/TSS R 2 is always between 0 and 1 1 means a perfect fit 0 means a flat plane R 2 is the square of the correlation between the observations and the fitted values
7/2/ Lecture 534 Estimate of Recall that controls the scatter of the observations about the regression plane: the bigger , the more scatter the bigger , the smaller R 2 estimated by s 2 =RSS/(n-k-1)
7/2/ Lecture 535 Calculations cherry.lm <- lm(volume~diameter+height, data=cherry.df)) summary(cherry.lm) The lm function produces an “lm object” that contains all the information from fitting the regression lm = “linear model” “ell” not I !!!!
7/2/ Lecture 536 Call: lm(formula = volume ~ diameter + height, data = cherry.df) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-07 *** diameter < 2e-16 *** height * --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 28 degrees of freedom Multiple R-Squared: 0.948, Adjusted R-squared: F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16 Info on residuals coefficients R2R2 Estimate of Cherries