Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation.

Similar presentations


Presentation on theme: "Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation."— Presentation transcript:

1 Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation

2 Standard diagnostics Before starting to model 1)Visualisation of data: 1)plotting predictor vs observations. These plots may give a clue about the relationship, outliers 2)Smootheners After modelling and fitting 2)Fitted values vs residuals. It may help to identify outliers, correctness of the model 3)Normal QQ plot of residuals. It may help to check distribution assumptions 4)Cook’s distance. Reveal outliers, check correctness of the model 5)Model assumptions - t tests given by default print of lm Checking model and designing tests 3)Cross-validation. If you have a choice of models then cross-validation may help to choose the “best” model 4)Bootstrap. Validity of the model can be checked if the distribution of statistic of interest is available. Or these distributions could be generated using bootstrap

3 Visualisation prior to modeling Different type of datasets may require different visualisation tools. For simple visualisation either plot(data) or pairs(data,panel=panel.smooth) could be used. Visualisation prior to modeling may help to propose model (form of the functional relationship between input and output, probability distribution of observation etc) For example for dataset women - where weights and heights for 15 cases have been measured. plot and pairs commands produce these plots:

4 After modeling: linear models After modelling the results should be analysed. For example attach(women) lm1 = lm(weight~height) It means that we want a liner model (we believe that dependence of weight on height is linear) weight=  0 +  1 *height Results could be viewed using lm1 summary(lm1) The last command will produce significant of various coefficients also. Significance levels produced by summary should be considered carefully. If there are many coefficients then the chance that one “significant” effect is observed is very high.

5 After modeling: linear models It is a good idea to plot data and fitted model, and differences between fitted and observed values on the same graph. For linear models with one predictor it can be done using: plot(weight,height) abline(lm1) segements(weight,fitted(lm1),weight,height) This plot already shows some systematic differences. It is an indication that model may need to be revised.

6 Checking validity of the model: standard tools Plotting fitted values vs residual, QQ plot and Cook’s distance can give some insight into model and how to improve it. Some of these plots can be done using plot(lm1)

7 Prediction and confidence bands lm1 = lm(height~weight)) pp = predict(lm1,interval='p') pc = predict(lm1,interval='c') plot(weight,height,ylim=range(height,pp)) n1=order(weight) matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red') matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red') These commands produce two sets of bands: narrow and wide. Narrow band corresponds to confidence bands and wide band is prediction band

8 Bootstrap confidence lines Similarly bootstrap line can be calculated using boot_lm(women,flm0,1000) Functions boot_lm and flm0 are available from the course’s website

9 Most of the above indicators show that quadratic (quadratic on predictor, not on parameter) model may be better. One obvious way of “improving” the model is to assume that dependence of heights on weights is quadratic. It can be done within linear model also. We can fit polynomial on predictor model height =  0 +  1 *weight+  2 *weight 2 +… We will use quadratic model: lm2 = lm(height~weight+I(weight^2)) Again summary of lm2 should be viewed Default plot now looks better

10

11 lm2 = lm(height~weight+I(weight^2)) pp = predict(lm2,interval='p') pc = predict(lm2,interval='c') plot(weight,height,ylim=range(height,pp)) n1=order(weight) matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red') matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red') Confidence bands using the following set of commands looks narrower

12

13 Spread of bootstrap confidence lines also is much smaller also

14 Which model is better? One of the ways of selecting model is cross-validation. There is no command in R for cross validation for lm models. However there is a command for glm (generalised linear model. It is the subject of the next lecture. For now we need only to know that lm and glm with family=‘gaussian’ are the same). Let us use default leave one out cross-validation lm1g = glm(height~weight,women,family=‘gaussian’) cv1.err = cv.glm(women,lm1g) cv1.err$delta Results: 0.2572698 0.2538942 women1 = data.frame(h=height,w1=weight,w2=weight^2) Lm2g = glm(h~w1+w2,data=women1,family=‘gaussian’) cv2.err = cv.glm(women1,lm2g) cv2.err$delta Results: 0.007272508, 0.007148601 The second has smaller prediction error.

15 References 1.Stuart, A., Ord, KJ, Arnold, S (1999) Kendall’s advanced theory of statistics, Volume 2A 2.Box, GEP, Hunter, WG and Hunter, JS (1978) Statistics for experimenters 3.Berthold, MJ and Hand, DJ. Intelligent Data Analysis 4.Dalgaard, Introductury statistics with R

16 Exercise 3 Take data set city and analyse it as linear model. Write a report.


Download ppt "Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation."

Similar presentations


Ads by Google