Download presentation
Presentation is loading. Please wait.
1
Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics
2
2 Overview Non-linearities Quadratics Cubics Log transformations Interaction terms Regression diagnostics
3
3 Non-linearities Linear regression assumes a linear relationship between the dependent and independent variables Non-linear relationships may be made linear through transformation the relationship between y and x must be linear for it to be estimated by OLS but x 2 can be constructed as: x 2 = x 1 × x 1 i.e. x 2 = x 1 2 (adds a quadratic term) &x 3 = x 1 3 (adds a cubic term)
4
4 The function of quadratic and cubic terms Suppose x 1 is age (in years) define x 2 as x 1 2 and x 3 as x 1 3 the relationship between y and x 2 is linear the relationship between y and x 3 is linear the relationship between y and age is non- linear
5
5 The shape of the non-linear relationship depends on the sign of the coefficients
6
6 Example
7
7 Log transformations Where variables have a skewed distribution (to the right) you might want to correct that skew by taking a log transformation. E.g. income is often transformed in this way You can then regress variables against income in a log form
8
8 Example The equations for the graphically illustrated relationships between income and age were: f/time ln(income) = 8.1883 + 0.0444 × age - 0.000578 × age 2 p/time ln(income) = 9.6948 - 0.1201 × age + 0.0031 × age 2 – 0.000026 × age 3 where ln(income) is the natural log of income
9
9 Interaction terms (1) a dummy variable for gender allows the intercept to differ between men & women an interaction term between the dummy & continuous variables allows the slope to differ too x 1 is a continuous variable x 2 is 1 if female, 0 if male x 3 = x 1 × x 2 estimate y=a + b 1 x 1 + b 2 x 2 + b 3 x 3
10
10 Interaction terms (2) 1 continuous variable, 1 dummy variable and I interaction between them is equivalent to estimating separate regressions for each of the two categories distinguished by the dummy variables e.g. men & women
11
11 Assumptions underlying regression models (1) No specification error relationship between dependent and independent variables is linear (NB independent variables can be transformed to allow for underlying non linearities) inclusion of all relevant independent variables No measurement error or at least measurement errors are small and random
12
12 Assumptions underlying regression models (2) No perfect multicollinearity among the independent variables high correlations between 2 or more independent variables produce multicollinearity multicollinearity means regression model can’t distinguish variation in dependent variable due to each of the correlated independent variables symptom: unstable coefficients and fluctuating t- statistics depending on which of the correlated variables are included
13
13 Assumptions about the error term (regression residuals) normally distributed, with zero mean and constant variance non-normality of residuals invalidates t-tests for significance of coefficients in small samples (coefficient estimates are unbiased) non zero mean intercept is under/over- estimated by the value of the mean constant variance = homoscedasticity = variation in residuals is same for all values of the independent variables. Heteroscedasticity variation in residuals is larger/smaller for some values of the independent variable(s) than others
14
14 Heteroscedasticity age cigarette consumption * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * variation in the residuals is greater at younger ages than at older ages
15
15 More assumptions about the residuals no correlation among the residuals residuals uncorrelated with the independent variables correlation between residuals and independent variables biased coefficient estimates
16
16 To investigate residuals Use the plots command in the regression dialogue box: for examining the normality of the residuals check the histogram box For examining heteroscedasticity plot zresid (y) against zpred (x) – and you don’t want to see a positive or negative relationship between the two
17
17 How serious are violations of these assumptions? Difficult to generalise Think about the application and e.g. possible omitted variables, sources of correlation between residuals and independent variable
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.