Download presentation
Presentation is loading. Please wait.
Published byLuke Blake Randall Modified over 9 years ago
1
MBP1010H – Lecture 4: March 26,2012 1.Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11 Multifactorial Analyses – chapter posted in Resources
2
to assess the linear relationship between 2 variables predict the response (y) based on a change in x Simple Linear Regression
3
Multiple Linear Regression explore relationships among multiple variables to find out which x variables are associated with the response (y) devise an equation to predict y from several x variables adjust for potential confounding (lurking) variables - effect of one particular x variable after adjusting for differences in other x variables
4
Confounding/ Causation Lurking Variable (z) Association Causation
5
Simple Linear Regression Model observed y intercept slope residual DATA = FIT + RESIDUALS where the i are independent and normally distributed N(0, ).
6
Statistical model for n sample data (i = 1, 2, … n) and p explanatory variables : Data = fit + residual y i = ( 0 + 1 x 1i … + p x pi ) + ε i Where the e i are independent and normally distributed N(0, σ). Multiple Regression
7
Analysis of Variance (ANOVA) table for linear regression = + SS Total = SS model + SS error SS = sum of squares Data = fit + residual y i = ( 0 + 1 x 1 i …+ p x pi ) + εi
8
Tail area above F P-value SST/DFTn − 1Total SSE/DFE n −p - 1 Error MSM/MSESSM/DFMpModel FMean square MS DFSum of squares SS Source ANOVA Table (p = number of explanatory variables)
9
ŷ i = b 0 + b 1 x 1i … + b k x pi - least-squares regression method minimizes the sum of squared deviations e i (= y i – ŷ i ) to express y as a linear function of the p explanatory variables - regression coefficients (b 1, … b p ) reflect the unique association of each independent variable with the y variable. - analogous to the slope in simple regression. In the sample: Note: b 1 or β can be used for sample in notation ^
10
Case Study of Multiple Regression Goal: to predict success in early university years. Measure of Success: GPA after 3 semesters
11
Data on 224 first-year computer science majors at a large university in a given year. The data for each student include: * Cumulative GPA (y, response variable) * Average high school grade in math (HSM, x1, explanatory variable) * Average high school grade in science (HSS, x2, explanatory variable) * Average high school grade in English (HSE, x3, explanatory variable) * SAT math score (SATM, x4, explanatory variable) * SAT verbal score (SATV, x5, explanatory variable) What factors are associated with GPA during first year of college?
12
Summary statistics for the data (from SAS software)
13
Univariate Associations between Variables - should do plots of associations – linearity and outliers
14
ANOVA table for model with HSM, HSS and HSE F test - highly significant at least one of the regression coefficients is significantly different from zero. R 2 : HSM, HSS and HEE explain 20% of variation in GPA.
15
ANOVA F-test for multiple regression H 0 : 1 = 2 = … = p = 0 versus H a : at least one 0 F statistic: F = MSM / MSE A significant p-value means that at least one explanatory variable has a significant influence on y.
16
ANOVA table for model with HSM, HSS and HSE F test - highly significant at least one of the regression coefficients is significantly different from zero. R 2 : HSM, HSS and HEE explain about 20% of variation in GPA.
17
R Square and Adjusted R Square -adjusted R-square equal or smaller than regular R square -adjusts for a bias in R-square -regular R square tends to be an overestimate, especially when many predictors and small sample size -statisticians and researcher differ on whether to use adjusted R-square -adjusted R-square not often used or reported
18
Multiple linear regression using HS grade averages: When all 3 high school averages are used together in the multiple regression analysis, only HSM contributes significantly to our ability to predict GPA.
19
Drop the least significant variable from the previous model: HSS. Conclusions are about the same - but actual regression coefficients have changed.
20
SATM and SATV
21
Multiple linear regression with the two SAT scores only. ANOVA test very significant at least one slope is not zero. R 2 is really small (0.06) only 6% of GPA variations are explained by these tests.
22
Multiple regression model with all the variables together The overall test is significant, but only the average high school math score (HSM) makes a significant contribution in this model to predicting the cumulative GPA. P-value very significant R 2 fairly small (21%) HSM significant
23
Next Steps: - refine the model - drop out non significant variables check residuals - histogram or Q-Q plot of residuals - plot residuals against predicted GPA - plot residuals against explanatory variables
24
Assumptions for Linear Regression The relationship is between x and y is linear. Equal variance of y for all values of x. Residuals are approximately normally distributed. The observations are independent.
25
Residuals are randomly scattered good! Curved pattern the relationship is not linear. Change in variability across plot variance not equal for all values of x. (transform) (transform y)
26
Do x and y need to have normal distributions? Regression: - y (probably) doesn’t matter - x doesn’t matter BUT: check for errors/outliers – could be influential In practice, most analysts prefer y to be reasonably normal -Residuals from the model should be -normally distributed
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.