Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics.

Similar presentations


Presentation on theme: "Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics."— Presentation transcript:

1 Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics

2 2 Overview Non-linearities  Quadratics  Cubics  Log transformations  Interaction terms  Regression diagnostics

3 3 Non-linearities  Linear regression assumes a linear relationship between the dependent and independent variables  Non-linear relationships may be made linear through transformation  the relationship between y and x must be linear for it to be estimated by OLS  but x 2 can be constructed as: x 2 = x 1 × x 1 i.e. x 2 = x 1 2 (adds a quadratic term) &x 3 = x 1 3 (adds a cubic term)

4 4 The function of quadratic and cubic terms  Suppose x 1 is age (in years)  define x 2 as x 1 2 and x 3 as x 1 3  the relationship between y and x 2 is linear  the relationship between y and x 3 is linear  the relationship between y and age is non- linear

5 5 The shape of the non-linear relationship depends on the sign of the coefficients

6 6 Example

7 7 Log transformations  Where variables have a skewed distribution (to the right) you might want to correct that skew by taking a log transformation.  E.g. income is often transformed in this way  You can then regress variables against income in a log form

8 8 Example The equations for the graphically illustrated relationships between income and age were: f/time ln(income) = 8.1883 + 0.0444 × age - 0.000578 × age 2 p/time ln(income) = 9.6948 - 0.1201 × age + 0.0031 × age 2 – 0.000026 × age 3 where ln(income) is the natural log of income

9 9 Interaction terms (1)  a dummy variable for gender allows the intercept to differ between men & women  an interaction term between the dummy & continuous variables allows the slope to differ too  x 1 is a continuous variable  x 2 is 1 if female, 0 if male  x 3 = x 1 × x 2  estimate y=a + b 1 x 1 + b 2 x 2 + b 3 x 3

10 10 Interaction terms (2)  1 continuous variable, 1 dummy variable and I interaction between them is equivalent to estimating separate regressions for each of the two categories distinguished by the dummy variables e.g. men & women

11 11 Assumptions underlying regression models (1)  No specification error  relationship between dependent and independent variables is linear (NB independent variables can be transformed to allow for underlying non linearities)  inclusion of all relevant independent variables  No measurement error  or at least measurement errors are small and random

12 12 Assumptions underlying regression models (2)  No perfect multicollinearity among the independent variables  high correlations between 2 or more independent variables produce multicollinearity  multicollinearity means regression model can’t distinguish variation in dependent variable due to each of the correlated independent variables  symptom: unstable coefficients and fluctuating t- statistics depending on which of the correlated variables are included

13 13 Assumptions about the error term (regression residuals)  normally distributed, with zero mean and constant variance  non-normality of residuals invalidates t-tests for significance of coefficients in small samples (coefficient estimates are unbiased)  non zero mean  intercept is under/over- estimated by the value of the mean  constant variance = homoscedasticity = variation in residuals is same for all values of the independent variables. Heteroscedasticity  variation in residuals is larger/smaller for some values of the independent variable(s) than others

14 14 Heteroscedasticity age cigarette consumption * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * variation in the residuals is greater at younger ages than at older ages

15 15 More assumptions about the residuals  no correlation among the residuals  residuals uncorrelated with the independent variables  correlation between residuals and independent variables  biased coefficient estimates

16 16 To investigate residuals  Use the plots command in the regression dialogue box:  for examining the normality of the residuals check the histogram box  For examining heteroscedasticity plot zresid (y) against zpred (x) – and you don’t want to see a positive or negative relationship between the two

17 17 How serious are violations of these assumptions?  Difficult to generalise  Think about the application and e.g. possible omitted variables, sources of correlation between residuals and independent variable


Download ppt "Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics."

Similar presentations


Ads by Google