Topic 13: Multiple Linear Regression Example
Outline Description of example Descriptive summaries Investigation of various models Conclusions
Study of CS students Too many computer science majors at Purdue were dropping out of program Wanted to find predictors of success to be used in admissions process Predictors must be available at time of entry into program.
Data available GPA after three semesters Overall high school math grade Overall high school science grade Overall high school English grade SAT Math SAT Verbal Gender (of interest for other reasons)
Data for CS Example Y is the student’s grade point average (GPA) after 3 semesters 3 HS grades and 2 SAT scores are the explanatory variables (p=6) Have n=224 students
Descriptive Statistics Data a1; infile 'C:\...\csdata.dat'; input id gpa hsm hss hse satm satv genderm1; proc means data=a1 maxdec=2; var gpa hsm hss hse satm satv; run;
Output from Proc Means VariableNMeanStd DevMinimumMaximum gpa hsm hss hse satm satv
Descriptive Statistics proc univariate data=a1; var gpa hsm hss hse satm satv; histogram gpa hsm hss hse satm satv /normal; run;
Correlations proc corr data=a1; var hsm hss hse satm satv; proc corr data=a1; var hsm hss hse satm satv; with gpa; run;
Output from Proc Corr Pearson Correlation Coefficients, N = 224 Prob > |r| under H0: Rho=0 gpahsmhsshsesatmsatv gpa < < < hsm < < < < hss < < < <.0001 hse < < < satm < <.0001 satv < <
Output from Proc Corr Pearson Correlation Coefficients, N = 224 Prob > |r| under H0: Rho=0 hsmhsshsesatmsatv gpa < < < All but SATV significantly correlated with GPA
Scatter Plot Matrix proc corr data=a1 plots=matrix; var gpa hsm hss hse satm satv; run; Allows visual check of pairwise relationships
No “strong” linear Relationships Can see discreteness of high school scores
Use high school grades to predict GPA (Model #1) proc reg data=a1; model gpa=hsm hss hse; run;
Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept hsm <.0001 hss hse Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Results Model #1 Meaningful??
ANOVA Table #1 Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model <.0001 Error Corrected Total Significant F test but not all variable t tests significant
Remove HSS (Model #2) proc reg data=a1; model gpa=hsm hse; run;
Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept hsm <.0001 hse Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Results Model #2 Slightly better MSE and adjusted R-Sq
ANOVA Table #2 Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model <.0001 Error Corrected Total Significant F test but not all variable t tests significant
Rerun with HSM only (Model #3) proc reg data=a1; model gpa=hsm; run;
Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept hsm <.0001 Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Results Model #3 Slightly worse MSE and adjusted R-Sq
ANOVA Table #3 Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model <.0001 Error Corrected Total Significant F test and all variable t tests significant
SATs (Model #4) proc reg data=a1; model gpa=satm satv; run;
Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Results Model #4 Much worse MSE and adjusted R-Sq Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept satm satv
ANOVA Table #4 Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model Error Corrected Total Significant F test but not all variable t tests significant
HS and SATs (Model #5) proc reg data=a1; model gpa=satm satv hsm hss hse; *Does general linear test; sat: test satm, satv; hs: test hsm, hss, hse;
Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Results Model #5 Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept hsm hss hse satm satv
Test sat Test sat Results for Dependent Variable gpa SourceDF Mean SquareF ValuePr > F Numerator Denominator Cannot reject the reduced model…No significant information lost…We don’t need SAT variables
Test hs Test hs Results for Dependent Variable gpa SourceDF Mean SquareF ValuePr > F Numerator <.0001 Denominator Reject the reduced model…There is significant information lost…We can’t remove HS variables from model
Best Model? Likely the one with just HSM or the one with HSE and HSM. We’ll discuss comparison methods in Chapters 7 and 8
Key ideas from case study First, look at graphical and numerical summaries one variable at a time Then, look at relationships between pairs of variables with graphical and numerical summaries. Use plots and correlations to understand relationships
Key ideas from case study The relationship between a response variable and an explanatory variable depends on what other explanatory variables are in the model A variable can be a significant (P 0.5) when other X’s are in the model
Key ideas from case study Regression coefficients, standard errors and the results of significance tests depend on what other explanatory variables are in the model
Key ideas from case study Significance tests (P values) do not tell the whole story Squared multiple correlations give the proportion of variation in the response variable explained by the explanatory variables) can give a different view We often express R 2 as a percent
Key ideas from case study You can fully understand the theory in terms of Y = Xβ + e However to effectively use this methodology in practice you need to understand how the data were collected, the nature of the variables, and how they relate to each other
Background Reading Cs2.sas contains the SAS commands used in this topic