Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables.

Similar presentations


Presentation on theme: "Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables."— Presentation transcript:

1 Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables (10.1) Partitioning variability – R 2, ANOVA (9.2) Conditions – residual plot (10.2)

2 Statistics: Unlocking the Power of Data Lock 5 Exam 2 Grades: In-Class

3 Statistics: Unlocking the Power of Data Lock 5 Exam 2 Re-grades Re-grade requests due in writing by class on Monday, 4/15/14 Partial credit will not be altered – only submit a re-grade request if you think you have entirely the correct answer but got points off Grades may go up or down If points were added up incorrectly, just bring your exam to your TA (no need for an official re-grade)

4 Statistics: Unlocking the Power of Data Lock 5 Today we’ll finally learn a way to handle more than 2 variables! More than 2 variables!

5 Statistics: Unlocking the Power of Data Lock 5 Multiple regression extends simple linear regression to include multiple explanatory variables: Multiple Regression

6 Statistics: Unlocking the Power of Data Lock 5 We’ll use your current grades to predict final exam scores, based on a model from previous 101 students Response: final exam score Explanatory: hw average, clicker average, exam 1, exam 2 Grade on Final

7 Statistics: Unlocking the Power of Data Lock 5 What variable is the most significant predictor of final exam score? a) Homework average b) Clicker average c) Exam 1 d) Exam 2 Grade on Final

8 Statistics: Unlocking the Power of Data Lock 5 The p-value for explanatory variable x i is associated with the hypotheses For intervals and p-values of coefficients in multiple regression, use a t-distribution with degrees of freedom n – k – 1, where k is the number of explanatory variables included in the model Inference for Coefficients

9 Statistics: Unlocking the Power of Data Lock 5 Estimate your score on the final exam. What type of interval do you want for this estimate? a) Confidence interval b) Prediction interval Grade on Final

10 Statistics: Unlocking the Power of Data Lock 5 Estimate your score on the final exam. (for this data hw average was out of 10, clicker average was out of 2) Grade on Final

11 Statistics: Unlocking the Power of Data Lock 5 Is the clicker coefficient really negative?!? Grade on Final

12 Statistics: Unlocking the Power of Data Lock 5 Is your score on exam 2 really not a significant predictor of your final exam score?!? Grade on Final

13 Statistics: Unlocking the Power of Data Lock 5 The coefficient (and significance) for each explanatory variable depend on the other variables in the model! Coefficients

14 Statistics: Unlocking the Power of Data Lock 5 If you take Exam 1 out of the model… Grade on Final Model with Exam 1: Now Exam 2 is significant!

15 Statistics: Unlocking the Power of Data Lock 5 Multiple Regression The coefficient for each explanatory variable is the predicted change in y for one unit change in x, given the other explanatory variables in the model! The p-value for each coefficient indicates whether it is a significant predictor of y, given the other explanatory variables in the model! If explanatory variables are associated with each other, coefficients and p-values will change depending on what else is included in the model

16 Statistics: Unlocking the Power of Data Lock 5 If you include Project 1 in the model… Grade on Final Model without Project 1:

17 Statistics: Unlocking the Power of Data Lock 5 Grades

18 Statistics: Unlocking the Power of Data Lock 5 Evaluating a Model How do we evaluate the success of a model? How we determine the overall significance of a model? How do we choose between two competing models?

19 Statistics: Unlocking the Power of Data Lock 5 Variability One way to evaluate a model is to partition variability A good model “explains” a lot of the variability in Y Total Variability Variability Explained by the Model Error Variability

20 Statistics: Unlocking the Power of Data Lock 5 Exam Scores Without knowing the explanatory variables, we can say that a person’s final exam score will probably be between 60 and 98 (the range of Y) Knowing hw average, clicker average, exam 1 and 2 grades, and project 1 grades, we can give a narrower prediction interval for final exam score We say the some of the variability in y is explained by the explanatory variables How do we quantify this?

21 Statistics: Unlocking the Power of Data Lock 5 Variability How do we quantify variability in Y? a)Standard deviation of Y b)Sum of squared deviations from the mean of Y c)(a) or (b) d)None of the above

22 Statistics: Unlocking the Power of Data Lock 5 Sums of Squares Total Variability Variability Explained by the model Error variability SSTSSMSSE

23 Statistics: Unlocking the Power of Data Lock 5 Variability If SSM is much higher than SSE, than the model explains a lot of the variability in Y

24 Statistics: Unlocking the Power of Data Lock 5 R2R2 R 2 is the proportion of the variability in Y that is explained by the model Total Variability Variability Explained by the Model

25 Statistics: Unlocking the Power of Data Lock 5 R2R2 For simple linear regression, R 2 is just the squared correlation between X and Y For multiple regression, R 2 is the squared correlation between the actual values and the predicted values

26 Statistics: Unlocking the Power of Data Lock 5 R2R2

27 Final Exam Grade

28 Statistics: Unlocking the Power of Data Lock 5 Is the model significant? If we want to test whether the model is significant (whether the model helps to predict y), we can test the hypotheses: We do this with ANOVA!

29 Statistics: Unlocking the Power of Data Lock 5 ANOVA for Regression k: number of explanatory variables n: sample size Source Model Error Total df k n-k-1 n-1 Sum of Squares SSM SSE SST Mean Square MSM = SSM/k MSE = SSE/(n-k-1) F MSM MSE p-value Use F k,n-k-1

30 Statistics: Unlocking the Power of Data Lock 5 ANOVA for Regression Source Model Error Total df 5 63 68 Sum of Squares 3125.8 1901.4 5027.2 Mean Square 625.16 30.18 F 20.71 p-value  0

31 Statistics: Unlocking the Power of Data Lock 5 Final Exam Grade

32 Statistics: Unlocking the Power of Data Lock 5 Simple Linear Regression For simple linear regression, the following tests will all give equivalent p-values: t-test for non-zero correlation t-test for non-zero slope ANOVA for regression

33 Statistics: Unlocking the Power of Data Lock 5 Mean Square Error (MSE) Mean square error (MSE) measures the average variability in the errors (residuals) The square root of MSE gives the standard deviation of the residuals (giving a typical distance of points from the line) This number is also given in the R output as the residual standard error, and is known as s  in the textbook

34 Statistics: Unlocking the Power of Data Lock 5 Final Exam Grade

35 Statistics: Unlocking the Power of Data Lock 5 Simple Linear Model Residual standard error =  MSE = s e estimates the standard deviation of the residuals (the spread of the normal distributions around the predicted values)

36 Statistics: Unlocking the Power of Data Lock 5 Residual Standard Error Use the fact that the residual standard error is 5.494 and your predicted final exam score to compute an approximate 95% prediction interval for your final exam score NOTE: This calculation only takes into account errors around the line, not uncertainty in the line itself, so your true prediction interval will be slightly wider

37 Statistics: Unlocking the Power of Data Lock 5 Revisiting Conditions For simple linear regression, we learned that the following should hold for inferences to be valid: Linearity Constant variability of the residuals Normality of the residuals How do we assess the first two conditions in multiple regression, when we can no longer visualize with a scatterplot?

38 Statistics: Unlocking the Power of Data Lock 5 Residual Plot A residual plot is a scatterplot of the residuals against the predicted responses Should have: 1)No obvious pattern 2)Constant variability

39 Statistics: Unlocking the Power of Data Lock 5 Residual Plots Obvious patternVariability not constant

40 Statistics: Unlocking the Power of Data Lock 5 Final Exam Score Are the conditions satisfied? (a) Yes(b) No

41 Statistics: Unlocking the Power of Data Lock 5 Conditions What if the conditions for inference aren’t met??? Option 1 (best option): Take STAT 210 and learn more about modeling! Option 2: Try a transformation…

42 Statistics: Unlocking the Power of Data Lock 5 Transformations If the conditions are not satisfied, there are some common transformations you can apply to the response variable You can take any function of y and use it as the response, but the most common are log(y) (natural logarithm - ln)  y (square root) y 2 (squared) e y (exponential))

43 Statistics: Unlocking the Power of Data Lock 5 log(y) Original Response, y : Logged Response, log(y) :

44 Statistics: Unlocking the Power of Data Lock 5 yy Original Response, y : Square root of Response,  y :

45 Statistics: Unlocking the Power of Data Lock 5 y2y2 Original Response, y : Squared response, y 2 :

46 Statistics: Unlocking the Power of Data Lock 5 eyey Original Response, y : Exponentiated Response, e y :

47 Statistics: Unlocking the Power of Data Lock 5 Transformations Interpretation becomes a bit more complicated if you transform the response – it should only be done if it clearly helps the conditions to be met If you transform the response, be careful when interpreting coefficients and predictions The slope will now have different meaning, and predictions and confidence/prediction intervals will be for the transformed response

48 Statistics: Unlocking the Power of Data Lock 5 Transformations You do NOT need to know which transformation would be appropriate for given data on the final, but they may help if conditions are not met for Project 2 or for future data you may want to analyze

49 Statistics: Unlocking the Power of Data Lock 5 How do we decide which explanatory variables to include in the model? How do we use categorical explanatory variables? What if the coefficient of one explanatory variable depends on the value of another explanatory variable? To Come…

50 Statistics: Unlocking the Power of Data Lock 5 Project done in your lab groups – one project per group 10 page (max) paper: due Wednesday, 4/23 Choose one quantitative variable and answer questions about it and it’s relationship with other variables Use multiple regression and anything else we’ve learned in the course Project 2 Details herehere Project 2

51 Statistics: Unlocking the Power of Data Lock 5 Data on college students: Sleep data from a 2-week sleep diary Gender Class year Early riser, night owl, or neither? Early classes? Missed classes Score on a test of cognitive skills GPA Alcohol consumption Depression, anxiety, stress, happiness Project 2 Data

52 Statistics: Unlocking the Power of Data Lock 5 To Do Read 9.2, 10.1, 10.2 Do HW 8 (due Wednesday, 4/16) Do Project 2 (due Wednesday, 4/23)


Download ppt "Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables."

Similar presentations


Ads by Google