Multiple Regression
What is Multiple Regression? y = b0 + b1x1 + b2x2.... + bkxk. One dependent variable Several independent variables x1,x2..xk
Assumptions Quantitative data Independent observation Each value of IV, the distribution of the DV must be normal The variance of the distribution of the DV should be constant for all values of the IV (Homoscedasticity) The relationship between the DV and each IV should be linear (Multicollinearity) Limited linear correlation among the dependent variables Residuals of the predicted DV value, should be random
Using SWI data set DV: Trust towards social workers (Q2) 6 IV: Social workers make people rely on welfare (Q1c) Social workers bring hopes to those in averse situation (Q1n) Social workers help the disadvantaged (Q1m) Age (Q13), Family income (Q17), Sex (Q12)
A model (or equation) involves a DV and different combination of the IVs. With 5 independent variables, there can be models with 0, 1, 2, 3, 4, or 5 IVs. There can be as many as 32 models
Method Enter Include all the independent variables in the model. Some variables might have little contribution to the model and they’ll affect the coefficient of the equation (or the model). Other methods of inclusion (Forward, Stepwise) include only those with greater contributions (changes in R2) and (usually the case) b (coefficient) significantly differs from 0.
Null hypotheses The ANOVA table above is used to test several equivalent null hypotheses: there is no linear relationship in the population between the dependent variable and the independent variables, All of the population partial regression coefficients are 0, and that the population value for multiple R2 is 0. So the alternative only say at least one partial coefficient is not 0.
Null hypothesis: The population partial regression coefficient for a variable is 0 using the t statistic and its observed significance level.
The equation: Trust of social workers = 3.3 + .759 x Q1n – .233 x Q1c + 0.375 x Q1n - 0.145 x Q17
Understanding partial correlation Removes from both the given IV and the DV all variance accounted for by the control IVs, then correlates the unique component of the IV with the unique component of the DV. It is the correlation between each IV and the DV after the variance explained by other IVs (called the controlling variables) was removed. We will say: the partial correlation of an IV is its correlation with the DV after the influence of other IVs in the model is being controlled.
Iv1 Iv2 Dv
Method Stepwise: This method enters IVs one each time by selecting the one that is significant at 0.05 level (you can change it) and made the most changes in R2 the model. It does one more thing, when an IV enters into the model, it checks whether any of the existing variables will have its partial correlation reduced and the significant level became 0.1 (this is the default, you can change it by yourself). If so, it get excluded from the model
Examining the standardized residual PRED (the standardized predicted values of the dependent variable based on the model). These values are standardized forms of the values predicted by the model. ZRESID (the standardized residuals, or errors). These values are the standardized differences between the observed data and the values that the model predicts
Multicollinearity The existence of a high degree of linear correlation amongst two or more independent variables in a multiple regression model. In the presence of multicollinearity, it will be difficult to assess the effect of the independent variables on the dependent variable. A tolerance of less than 0.20 and/or a VIF of 5 and above indicates a multicollinearity problem (Wiki)
Data Transformation Use the dataset (Transformation.sav) Plot the scatter-plot for Life-satisfaction (Y- axis) against Income (X-axis) (Graphs/Legacy dialogs/Scatter-dot)
After transformation monthly income into LnIncome, we obtain a new scatter-plot.
For a curve like this, the best way is to transform the independent variable (dosage of drug) into its inverse (1/x). With the new variable in the X-axis, the plot now looks like this.
Dummy variable If you have a categorical (i.e. nominal) variable that you want it to be included in your model as an independent variable, you need to recode it into a number of dummy variables. For a dichotomous variable, such as gender (male =1, female =0), then gender = 1 is meaning that the case is a male. You can rename it as male (1=male, 0=not male). Male is called a dummy variable. In a multiple regression equation, we can have something like this:
Marital satisfaction (MS) = a + b1 Year married + b2 Male + b3 Income Male MS = a + b1 Year married + b2 (1) + b3 Income Female MS = a + b1 Year married + b2 (0) + b3 Income
For place of residence (HK, Kowloon and NT) Suppose you choose HK as a reference group Two dummy variables: Kowloon and NT For a person living in Kowloon, enter the value of Kowloon as 1, and NT, 0. If a person lives in NT, then 0 for Kowloon and 1 for NT. For a person lives in HK, then both Kowloon and NT will be 0.
MS = a + b1 Year married + b2 Male + b3 Income + b4 Kowloon + b5 NT.
http://www.stattucino.com/berrie/dsl/regres sion/regression.html
Null hypothesis for the ANOVA test in regression: There is no linear relationship in the population between the dependent variable and the independent variables Alternative:
Null hypothesis for the t-test of regression coefficient (b) The slope of the regression line fitting two variables in the population is equal to zero OR For the regression equation: y = a + bx H0: b = 0 Ha: b ≠ 0