Multiple Linear Regression
Introduction The common univariate techniques for data analyses do not allow us to take into account the effect of other covariates/confounders in an analysis. In such situations, a Regression Model would be required. Regression, perhaps the most widely used statistical technique, estimates relationships between independent (predictor or explanatory) variables and a dependent (response or outcome) variable. Regression models can be used to help understand and explain relationships among variables; they can also be used to predict actual outcomes.
Goals After this, you should be able to: Calculate and interpret the simple correlation between two variables. Determine whether the correlation is significant. Calculate and interpret the simple linear regression equation for a set of data. Understand the assumptions behind regression analysis. Determine whether a regression model is significant. Recognize regression analysis applications for purposes of prediction and description
Scatter Plots and Correlation A scatter plot (or scatter diagram) is used to show the relationship between two variables Correlation analysis is used to measure strength of the association (linear relationship) between two variables ◦ Only concerned with strength of the relationship ◦ No causal effect is implied
Scatter Plot Examples y x y x y y x x Linear relationshipsCurvilinear relationships
Scatter Plot Examples y x y x y y x x Strong relationshipsWeak relationships
Scatter Plot Examples y x y x No relationship
Correlation Coefficient The population correlation coefficient ρ (rho) measures the strength of the association between the variables The sample correlation coefficient r is an estimate of ρ and is used to measure the strength of the linear relationship in the sample observations
Features of ρ and r Unit free Range between -1 and 1 The closer to -1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker the linear relationship
r = +.3r = +1 Examples of Approximate r Values y x y x y x y x y x r = -1 r = -.6r = 0
Calculating the Correlation Coefficient where: r = Sample correlation coefficient n = Sample size x = Value of the independent variable y = Value of the dependent variable Sample correlation coefficient: or the algebraic equivalent:
Calculation Example Tree Height Trunk Diameter yxxyy2y2 x2x y=321 x=73 xy=3142 y 2 = x 2 =713
Trunk Diameter, x Tree Height, y Calculation Example r = → relatively strong positive linear association between x and y
Significance Test for Correlation Hypotheses H 0 : ρ = 0 (no correlation) H A : ρ ≠ 0 (correlation exists) Test statistic (with n – 2 degrees of freedom
Example: Produce Stores Is there evidence of a linear relationship between tree height and trunk diameter at the.05 level of significance? H 0 : ρ = 0 (No correlation) H 1 : ρ ≠ 0 (correlation exists) =.05, df = = 6
Example: Test Solution Conclusion: There is evidence of a linear relationship at the 5% level of significance Decision: Reject H 0 Reject H 0 /2=.025 -t α /2 Do not reject H 0 0 t α /2 /2= d.f. = 8-2 = 6
Calculate correlation in SPSS In the menu, click on Analyze Point to Correlat Point to Bivariate …and click Move Var1 and Var2 to the box labeled Variables by clicking the arrow. Click the OK button.
Spss output for Correlation
Linear Regression Definition: Regression Analysis is the estimation of the linear relationship between a dependent variable and one or more independent variables or covariates. It is obvious the dependent variable should be normally distributed. And relationship between dependent and dichotomous variables is homoscedastic (homogeneity of variance) Explain the impact of changes in an independent variable on the dependent variable
Linear Regression A linear regression line has an equation of the form Y = a + bX+e, where X is the explanatory variable, Y is the dependent variable and e is the error term. the error term is to factor in the situation that two persons with the same X need not have the same Y. The difference between the observed value of the dependent variable (y) and the predicted value ( ŷ ) is called the residual, that is an estimate of e The slope of the line is b, and a is the intercept (the value of y when x = 0).
Major assumptions assumptions The relationship between the outcomes and the predictors is (approximately) linear.linear The error term has zero mean.zero mean The error term has constant variance.constant variance The errors are uncorrelated.are uncorrelated The errors are normally distributed or we have an adequate sample size to rely on large sample theory.normally For a good prediction we should be sure that there is no outlier Failing to satisfy the assumptions does not mean that our answer is wrong. It means that our solution may under-report the strength of the relationships. So, We should always check fitted models to make sure that these assumptions have not been violated.
Least-Squares Regression The most common method for fitting a regression line is the method of least- squares. This method calculates the best- fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0).
Linear Regression Random Error for this x value y x Observed Value of y for x i xixi Slope = b Intercept = a eiei Predicted Value of y for x i
File: Cellular data.sav This is a hypothetical data file that concerns a cellular phone company’s efforts to reduce churn. Churn propensity scores are applied to accounts, ranging from 0 to 100.
File: Cellular data.sav Independent variables bill: Average monthly bill business: Pct used for business los: Years using our service Income: Household income (1998) Score: Propensity to leave Dependent variable Minutes: Avg monthly minutes
Start Analysis Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a relationship between the variables of interest. This does not necessarily imply that one variable causes the other (for example, higher age dosen’t not cause higher income), but that there is some significant association between the two variables.
A scatter plot can be a helpful tool in determining the strength of the relationship between two variables. If there appears to be no association between the proposed explanatory and dependent variables (i.e., the scatter plot does not indicate any increasing or decreasing trends), then fitting a linear regression model to the data probably will not provide a useful model. A valuable numerical measure of association between two variables is the correlation coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed data for the two variables.
Open Data File/ Open/ data from this menu: My computer/ C:/ Programs files/ IBM/ SPSS/ Statistics/ 19/ Samples /English / cellular Data
Scatter Plot In the menu, click on Graphs Point to Legacy Dialogs Point to Scatter Plot… and click Tick on the Matirx Scatter and click on Define
Scatter Plot Move all variables to the box labeled Matrix Variables by clicking the arrow.
Scatter Plot
Correlation
Linear Regression In the menu, click on Analyze Point to Regression Point to Linear… and click Move minutes to the box labeled Dependent by clicking the arrow. Move bill, business, los, income and score to the box labeled Independents by clicking the arrow.
The independent variables can be entered into the analysis using five different methods. This tutorial includes the Enter methods.
Enter Method There are other methods available for model building, based on statistical significance, such as backward elimination or forward selection but when building the model on a substantive basis, the enter method is best: variables are included in the regression equation regardless of whether or not they are statistically significant.
Then click on the Statistics button to get This template. Tick on the Descriptive, continue and click OK in the past template.
Descriptive Statistics This table indicates the mean, std and num dependent and independent variables.
Correlations This table indicates the correlation between all variables.
Variable Entered This table indicates independent variables. The method of including the independent variable is Enter
Model Summary In multiple regression, the R measures the correlation between the observed value of the dependent variable and the predicted value based on the regression model. The sample estimate of R Square tends to be an overestimate of the population parameter; the Adjusted R Square is designed to compensate for the optimistic bias of R Square. In this example all of covariates are explained 51.3% of the variance on dependent variable.
ANOVA table The ANOVA table shows the ‘usefulness’ of the linear regression model – we want the p-value to be <0.05.
Coefficients This Table provides the quantification of the relationship Between Avg Monthly minutes with covariates. The unstandardzied Coefficients has shown with one unite increase in covariates, how much increase (positive) or decrease (negative) in dependent variables.
Checking the assumption assumption There are two way to check these assumption: graphical and statistical.
Linearity assumption To check this assumption we use the scatter plot. click the plot icon in the regression win then enter the ZPRED (The standardized predicted values of the dependent variable) in the X axis and ZRESID (The standardized residuals) in the Y axis.
If the points are around a zero-straight line the linearity assumption is established.
Non linearity
Error term has zero mean zero meanzero mean To check this assumption we should tick the Durbin-Watson Residual in the statistics win.
Residual The mean of residuals are zero
Error term has constant variance constant varianceconstant variance In this plot residuals on the vertical axis and the predict variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis show the equality of variance are established.
To check this assumption we should enter the ZRESID (The standardized residuals) in the Y axis and the ZPRED (The standardized predicted values of the dependent variable) in the X axis in the plot win.
Scatter plot The residuals "bounce randomly" around the 0 line. This suggests that the assumption that the relationship is linear is reasonable. The residuals roughly form a "horizontal band" around the 0 line. This suggests that the variances of the error terms are equal. No one residual "stands out" from the basic random pattern of residuals. This suggests that there are no outliers This graph shows the equality of variance are established.
Unequal variance
In general, you want your residual vs fits plots to look something like the above plot. Don't forget though that interpreting these plots is subjective. My experience has been that students learning residual analysis for the first time tend to over-interpret these plots, looking at every twist and turn as something potentially troublesome. You'll especially want to be careful about putting too much weight on residual vs fits plots based on small data sets. Sometimes the data sets are just too small to make interpretation of a residuals vs fits plot worthwhile. Don't worry! You will learn — with practice — how to "read" these plots.
errors are uncorrelated uncorrelated To check this assumption use the Durbin- Watson statistics. This statistics is available in the statistics win of regression window.
Durbin-Watson The Durbin-Watson statistic ranges in value from 0 to 4. A value near 2 indicates non- autocorrelation; a value toward 0 indicates positive autocorrelation; a value toward 4 indicates negative autocorrelation.
Errors are normally distributed normally The best way to check this assumption is Q-Q plot. To draw this plot, click the plot icon in the regression win then tick the Normal probability plot and click continue and ok bottom.
Q-Q plot If the observations follow approximately a normal distribution, the resulting plot should be roughly a straight line with a positive slope.
In this course we don’t mention the solution when one of the assumptions of linear regression are not established. Also we don’t explain outlier, colinearity and influential data.
Question How to manage independent categorical variables? Data: car_insurance_claims