Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.

Similar presentations


Presentation on theme: "Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10."— Presentation transcript:

1 Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

2 Statistical Data Analysis 2 Statistical Data Analysis: Introduction Topics Summarizing data Investigating distributions Bootstrap Robust methods Nonparametric tests Analysis of categorical data Multiple linear regression (continued)

3 Statistical Data Analysis 3 Multiple linear regression (Reader: Chapter 8) Relationship between one response variable and one or more explanatory variable Last time: Statistical model Parameter estimation Selection explanatory variables (determination coef, F-, t-tests) Model quality: global methods/diagnostics (plots) This week: further investigation of model quality deviating observation points outlier, leverage point/potential, influence point plots, numerical measures and tests test for outliers, hat matrix, Cook’s distance explanatory variables that are themselves linearly related – collinearity: plots, numerical measures variance inflation factors, condition indices, variance decomposition

4 Statistical Data Analysis 4 Statistical model Multiple linear regression model independent and normally distributed Issues: 1) estimate 2) select explanatory variables 3) assess model quality

5 Statistical Data Analysis 5 3) Assessment of model quality – deviating points Consider observation point (y i, x i1,…,x ip ) types of deviating observation points deviating response: outlier deviating explanatory variable: potential or leverage point if point has influence: influence point how to detect outlier: test for outliers leverage point: hat matrix Influence point: Cook’s distance

6 Statistical Data Analysis 6 Example outlier Forbes’ data: boiling temperature for different pressure Small deviating effect in response may have large effects Generally easy to detect in plots

7 Statistical Data Analysis 7 3) Assessment of model quality – outliers Outlier: deviating response How to detect? Make plots - which ones? If possible outliers detected, do formal test Idea: if k-th point outlier, then it fits the regression model up to a shift δ i.e. it fits mean shift outlier model for sufficiently large | δ |, or in matrix notation with s.t. When is k-th point outlier in terms of δ ? How to test?

8 Statistical Data Analysis 8 3) Assessment of model quality – outliers Outlier: deviating response If k-th point outlier, then it fits mean shift outlier model for sufficiently large | δ |, with s.t. When is k-th point outlier in terms of δ ? If | δ | significantly different from 0, then k-th point outlier Test for outlier H 0 : δ = 0, β arbitrary H 1 : δ ≠ 0, β arbitrary (note: in Reader one-sided) Test statistic ~

9 Statistical Data Analysis 9 Example leverage point Huber’s data: Small deviation in explanatory variable may have large effect Often difficult to detect in plots: on edge of range of values value residual often not large

10 Statistical Data Analysis 10 3) Assessment of model quality – leverage points Potential or leverage point : deviating explanatory variable How to detect? With hatmatrix stems from Properties of H: and if h ii large then other h ij small We see and Hence, if h ii large, then i-th point has potential influence

11 Statistical Data Analysis 11 3) Assessment of model quality – influence points Influence point: if point has influence How to detect? check if point outlier or leverage point If yes, then fit model with and without this point If result very different: point is influence point Measure based on difference between estimated beta’s: Cook’s distance for i-th point: if D i larger than 1 (roughly), then i-th point is influence point Parameter estimate without i-th point

12 Statistical Data Analysis 12 3) Assessment of model quality – influence points Measure of influence based on difference between estimated beta’s: Cook’s distance for i-th point: If D i larger than 1 (roughly), then i-th point is influence point Explanation: the set is confidence region with confidence 1 – α for parameter vector β Thus defines measure of distance from For choices of α around 0.5 the values of b outside this set lie “far away” from For choices of α around 0.5 the boundary of the set,, has value around 1 Parameter estimate without i-th point

13 Statistical Data Analysis 13 Example influence points Cook’s distances for different data sets:

14 Statistical Data Analysis 14 3) Assessment of model quality – collinearity explanatory variables that are themselves linearly related – collinearity: numerical measures variance inflation factors, condition indices, variance decomposition when a problem if variance of one or more estimator is large then estimate(s) not reliable how to detect known methods? scatter plots, corr. coeff (between pairs of variables), determination coef of X j on others = squared multiple linear corr coeff between X j and others + several new numerical measures

15 Statistical Data Analysis 15 3) Assessment of model quality – collinearity exactly collinear if for some constants not all equal to 0 If one or more collinearities in (general) matrix X, then rank(X) not maximal and does not exist With approximate collinearities difficult to compute In design matrix X one or more (approximate) collinearities can exist between its columns In that case difficult to compute and/or one or more may be large

16 Statistical Data Analysis 16 3) Assessment of model quality – collinearity How to detect collinearity scatter plots, corr. coeff (between pairs of variables), determination coef of X j on all others = squared multiple linear corr coeff between X j and all others 4 new numerical measures i) variance inflation factors because VIF j is amount of increase in variance of due to relationship between X j and all others If VIF j large, then estimate unreliable

17 Statistical Data Analysis 17 3) Assessment of model quality – collinearity How to detect collinearity ii) condition number (read in Reader) iii) condition indices makes ues of singular value decomposition with and D = diagonal( ) k-th condition index: If small, thus large → collinearity because then if not too small, then X j involved in collinearity singular values of X ≥ 0

18 Statistical Data Analysis 18 3) Assessment of model quality – collinearity How to detect collinearity iv) variance decomposition proportions because (from s.v.d.) If is large, then investigate which terms involved via the Write the in matrix and look in row of large (= small ) which are close to 1 Corresponding X j involved in collinearity Easier to see then with method (iii)

19 Statistical Data Analysis 19 3) Assessment of model quality – collinearity No general guideline exists Sometimes: - leave out one or more explanatory variable - scale explanatory variables - center explanatory variables Always: - try to find explanation, this may lead to right choice Solutions for collinearity variable may loose meaning

20 Statistical Data Analysis 20 3) Assessment of model quality – example Now: Example body fat data different document


Download ppt "Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10."

Similar presentations


Ads by Google