Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha
Previous Lecture Summary SPSS application for different Methods of Multiple Regression Multiple regression equation formulation Beta values Interpretation Reporting the model Assumptions underlying multiple regression
How to Interpret Beta Values Beta values: the change in the outcome associated with a unit change in the predictor. Standardised beta values: tell us the same but expressed as standard deviations.
Beta Values b 1 = So, as advertising increases by £1, album sales increase by units. b 2 = So, each time (per week) a song is played on the radio its sales increase by units.
Constructing a Model
Standardised Beta Values 1 = As advertising increases by 1 standard deviation, album sales increase by of a standard deviation. 2 = When the number of plays on the radio increases by 1 SD its sales increase by standard deviations.
Interpreting Standardised Betas As advertising increases by £485,655, album sales increase by 80,699 = 42,206. If the number of plays on the radio per week increases by 12, album sales increase by 80,699 = 44,062.
Reporting the Model
How well does the Model fit the data? There are two ways to assess the accuracy of the model in the sample: Residual Statistics Standardized Residuals Influential cases Cook’s distance
Standardized Residuals In an average sample, 95% of standardized residuals should lie between 2. 99% of standardized residuals should lie between 2.5. Outliers Any case for which the absolute value of the standardized residual is 3 or more, is likely to be an outlier.
Cook’s Distance Measures the influence of a single case on the model as a whole. Absolute values greater than 1 may be cause for concern.
Generalization When we run regression, we hope to be able to generalize the sample model to the entire population. To do this, several assumptions must be met. Violating these assumptions stops us generalizing conclusions to our target population.
Multicollinearity Multicollinearity exists if predictors are highly correlated. This assumption can be checked with collinearity diagnostics.
Tolerance should be more than 0.2 VIF should be less than 10
Checking Assumptions about Errors Homoscedacity/Independence of Errors: Plot ZRESID against ZPRED. Normality of Errors: Normal probability plot.
Homoscedasticity: ZRESID vs. ZPRED
Normality of Errors: Histograms and P-P plots
Outliers and Residuals The normal or unstandardized residuals are measured in the same units as the outcome variable and so are difficult to interpret across different models we cannot define a universal cut-off point for what constitutes a large residual we use standardized residuals, which are the residuals divided by an estimate of their standard deviation
Outliers and Residuals Some general rules for standardized residuals are derived from these facts: (1) standardized residuals with an absolute value greater than 3.29 (we can use 3 as an approximation) are cause for concern because in an average sample case a value this high is unlikely to happen by chance; (2) if more than 1% of our sample cases have standardized residuals with an absolute value greater than 2.58 (we usually just say 2.5) there is evidence that the level of error within our model is unacceptable (the model is a fairly poor fit of the sample data)
Outliers and Residuals (3) if more than 5% of cases have standardized residuals with an absolute value greater than 1.96 (we can use 2 for convenience) then there is also evidence that the model is a poor representation of the actual data. Studentized residual, which is the unstandardized residual divided by an estimate of its standard deviation that varies point by point. These residuals have the same properties as the standardized residuals but usually provide a more precise estimate of the error variance of a specific case.
Influential Cases There are several residual statistics that can be used to assess the influence of a particular case. Adjusted predicted value for a case when that case is excluded from the analysis. The computer calculates a new model without a particular case and then uses this new model to predict the value of the outcome variable for the case that was excluded If a case does not exert a large influence over the model then we would expect the adjusted predicted value to be very similar to the predicted value when the case is included
Influential Cases The difference between the adjusted predicted value and the original predicted value is known as DFFit We can also look at the residual based on the adjusted predicted value: that is, the difference between the adjusted predicted value and the original observed value. This is the deleted residual. The deleted residual can be divided by the standard deviation to give a standardized value known as the Studentized deleted residual. The deleted residuals are very useful to assess the influence of a case on the ability of the model to predict that case.
Influential Cases One statistic that does consider the effect of a single case on the model as a whole is Cook’s distance. Cook’s distance is a measure of the overall influence of a case on the model and Cook and Weisberg (1982) have suggested that values greater than 1 may be cause for concern.
Lecture Summary Outliers and Residuals Example of Model analysis for multiple regression