SOME PROBLEMS THAT MIGHT APPEAR IN DATA

SOME PROBLEMS THAT MIGHT APPEAR IN DATA
MULTICOLLINEARITY, CONFOUNDING, INTERACTION

MULTICOLLINEARITY HYPOTHETICAL EXAMPLE Case i X1i X2i Yi 1 2 6 23 8 9
83 3 63 4 10 103

HYPTOHETICAL EXAMPLE Suppose that two statisticians were asked to fit a first order multiple regression model to this data. The first statistician come up with the following model The second one used: Which model is correct?

HYPTOHETICAL EXAMPLE When we look at the fitted values, you see both models fit data perfectly. It can be shown that infinitely many response functions will fit this data perfectly. from (1) from (2) 23 83 63 103

HYPTOHETICAL EXAMPLE The reason is that predictors X1 and X2 are perfectly related. When X1 and X2 are perfectly related, data do not contain any random error component, and many different models will lead to the same perfectly fitted values. In such cases, we cannot interpret the regression coefficients. Note that, model (1) suggests X2 plays a bigger role compared to X1 in estimating Y, but model (2) suggests the opposite.

MULTICOLLINEARITY The existence of strong linear relationship between and among the predicted variables in a multiple regression. This is often simply termed collinearity.

EFFECTS OF MULTICOLLINEARITY
The estimated regression coefficients tend to vary widely when we add a new variable in the regression equation. EXAMPLE: Body fat Problem (Neter et al. 1996) Variables in the model b1 b2 Triceps Skinfold Thickness, X1 X1 0.8571 - Thigh Circumference, X2 X2 0.8565 Mid-arm Circumference, X3 X1 and X2 0.224 0.6594 Bodyfat, Y X1, X2 and X3 4.334 -2.857 b1 and b2 varies markedly depending on which other variables are in the model. X1 and X2 are highly correlated. Corr(X1, X2)=0.924

2. The common interpretation of a regression coefficient is not fully applicable when multicollinearity exists. Normally interpretation of regression coefficients should be “Holding all X’s constant except one, the effect of one unit change of this variable on Y.” However, when X’s are highly correlated, you cannot really keep one constant and other changing.

3. Estimated regression coefficients b1 and b2 changing significantly and become imprecise as more X’s are added to the model when X’s are highly correlated. EXAMPLE: Inflated variability due to multicollinearity Variables in the model s(b1) s(b2) X1 0.128 - X2 0.1100 X1 and X2 0.3034 0.2912 X1, X2 and X3 3.016 2.582

4. Estimated regression coefficients individually may not be statistically significant although a definite statistical relation exists between Y and X’s. Effected b’s + effected standard errors  effected t and/or F-statistics t-test for the individual 1 and 2 might suggest that none of the two predictors are significantly associated to Y, while the F-test indicated that the model is useful for predicting Y. Contradiction!!!

NO!!! This has to do with the interpretation of the regression coefficients. 1 is the expected change in Y due to X1 given X2 is already in the model. 2 is the expected change in Y due to X2 given X1 is already in the model. Since both X1 and X2 contribute redundant information about Y, once one of the predictors in the model, the other does not have much more to contribute. This is why F-test indicates that at least one of them is important, but individual tests indicate that the contribution of the predictor given the other one already included, is not really important,

EXAMPLE Residuals: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) x y Residual standard error: on 2 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 2 DF, p-value:

DIAGNOSTICS FOR MULTICOLLINEARITY
INFORMAL DIAGNOSTICS 1. Large correlations between predictors. Matrix scatter plots are helpful library("GGally") attach(mtcars) ggpairs(mtcars[, c(1,3,4,5,6,7)])

INFORMAL DIAGNOSTICS 2. Indicators of effects of multicollinearity. Large changes in  estimates when more X’s are added or deleted. t-statistics and F-statistics Contradictory signs in regression coefficients > summary(model) Call: lm(formula = mpg ~ disp + hp + drat + wt + qsec, data = mtcars) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) disp hp drat wt ** qsec --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 26 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 5 and 26 DF, p-value: 6.892e-10

> summary(fit) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) disp hp wt ** drat qsec --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 26 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 5 and 26 DF, p-value: 6.892e-10 > vif(fit) # variance inflation factors disp hp wt drat qsec > sum(vif(fit))/4 > 1 [1] TRUE # Assume that we are fitting a multiple linear regression on the MTCARS data library(car) fit <- lm(mpg~disp+hp+wt+drat+qsec, data=mtcars) summary(fit) # Evaluate Collinearity vif(fit) # variance inflation factors sum(vif(fit))/4 > 1

REMEDIES FOR MULTICOLLINEARITY

2. One or more X’s may be dropped from the model. We can obtain information about the dropped one from the highly correlated one with the response variable. Examine the correlation matrix of Y and all X’s. Drop the one that has low correlation with Y. 3. Principle Component Analysis (PCA) reduces the dimension of uncorrelated components. It may be difficult to attach concrete meanings to the indexes.

Loss L2 penalty term

> x1 <- rnorm(20) > x2 <- rnorm(20,mean=x1,sd=.01) > y <- rnorm(20,mean=3+x1+x2) > fit=lm(y~x1+x2) > summary(fit) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-09 *** x x Residual standard error: on 17 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 17 DF, p-value: 1.947e-05 > vif(fit) x1 x > lm.ridge(y~x1+x2,lambda=1) x x2

5. LASSO: Least Absolute Shrinkage and Selection Operator The lasso has been introduced by Robert Tibshirani in 1996 and represents another modern approach in regression similar to ridge estimation. He described it in detail in the text book ”The Elements of Statistical Learning” by Hastie, T, Tibshirani, R, Friedman, J (2nd edition) Springer, available online at LASSO methods is a well established method that reduces the variability of the estimates by shrinking the coefficients and at the same time produces interpretable models by shrinking some coefficients to exactly zero.

Loss L1 penalty term

EXAMPLE OF LASSO REGRESSION*
Consider the data from a 1989 study examining the relationship prostate- specific antigen (PSA) and a number of clinical measures in a sample of 97 men who were about to receive a radical prostatectomy (Data from Stamey, T.A. et al. (1989)) The explanatory variables: lcavol: Log cancer volume lweight: Log prostate weight age lbph: Log benign prostatic hyperplasiasvi: Seminal vesicle invasion lcp: Log capsular penetration gleason: Gleason score pgg45: % Gleason score 4 or 5 *

library(lasso2) # to get the data
data(Prostate) head(Prostate) lcavol lweight age lbph svi lcp gleason pgg lpsa library(glmnet) # library including functions for finding the lasso # glmnet requires design matrix and response variable x<-(apply(Prostate[,-9],2,as.numeric)) y<-(Prostate[,9]) lassofit<-glmnet(x,y) print(lassofit) # DF gives the number of non-zero coefficients Call: glmnet(x = x, y = y)  Shows the number of non-zero entries in dependency on theta Df %Dev Lambda [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,]

print(lassofit) # DF gives the number of non-zero coefficients
par(mfrow = c(1, 2)) plot(lassofit, xvar="lambda") plot(lassofit, xvar="norm") The left figure shows how the coefficients start to decrease and finally all drop to zero as the penalty becomes more important (larger log(lambda)). The right figure shows the same but plotted against the penalty (||βˆlasso()||1). As the norm of the vector is small, many of the coefficients are 0 or close to 0. As the norm increases, the coefficients deviate from 0 and in the end all are different from 0. The results from this analysis indicates the order in which the variables should be dropped from the model as the importance of the penalty increases, but it does not answer which model should be chosen, i.e. which  results in the ”best” model.

Choosing the tuning parameter,  (in many texts, it is given as )

Choosing the tuning parameter, 

# Cross Validation cvlassofit<-cv.glmnet(x,y) print(cvlassofit) plot(cvlassofit) coef(cvlassofit, s = "lambda.min") # to get the coefficients for the "optimal" lambda coef(cvlassofit, s = "lambda.1se") # to get the coefficients for the one SE lambda $lambda.min [1] $lambda.1se [1] The vertical lines in the figure show λˆ (left) and λˆ∗ (right). The numbers on top of the figure give the number of non-zero coefficients. Which means that for this example instead of using seven predictors, we would be using three predictors for the selected model, if we would choose the one standard error estimate. At this point, one should go back and reestimate the model using the BLUE based on the variables chosen.

SOME PROBLEMS THAT MIGHT APPEAR IN DATA

Similar presentations

Presentation on theme: "SOME PROBLEMS THAT MIGHT APPEAR IN DATA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SOME PROBLEMS THAT MIGHT APPEAR IN DATA

Similar presentations

Presentation on theme: "SOME PROBLEMS THAT MIGHT APPEAR IN DATA"— Presentation transcript:

Similar presentations

About project

Feedback