Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation.

Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation

Standard diagnostics Before starting to model 1)Visualisation of data: 1)plotting predictor vs observations. These plots may give a clue about the relationship, outliers 2)Smootheners After modelling and fitting 2)Fitted values vs residuals. It may help to identify outliers, correctness of the model 3)Normal QQ plot of residuals. It may help to check distribution assumptions 4)Cook’s distance. Reveal outliers, check correctness of the model 5)Model assumptions - t tests given by default print of lm Checking model and designing tests 3)Cross-validation. If you have a choice of models then cross-validation may help to choose the “best” model 4)Bootstrap. Validity of the model can be checked if the distribution of statistic of interest is available. Or these distributions could be generated using bootstrap

Visualisation prior to modeling Different type of datasets may require different visualisation tools. For simple visualisation either plot(data) or pairs(data,panel=panel.smooth) could be used. Visualisation prior to modeling may help to propose model (form of the functional relationship between input and output, probability distribution of observation etc) For example for dataset women - where weights and heights for 15 cases have been measured. plot and pairs commands produce these plots:

After modeling: linear models After modelling the results should be analysed. For example attach(women) lm1 = lm(weight~height) It means that we want a liner model (we believe that dependence of weight on height is linear) weight=  0 +  1 *height Results could be viewed using lm1 summary(lm1) The last command will produce significant of various coefficients also. Significance levels produced by summary should be considered carefully. If there are many coefficients then the chance that one “significant” effect is observed is very high.

After modeling: linear models It is a good idea to plot data and fitted model, and differences between fitted and observed values on the same graph. For linear models with one predictor it can be done using: plot(weight,height) abline(lm1) segements(weight,fitted(lm1),weight,height) This plot already shows some systematic differences. It is an indication that model may need to be revised.

Checking validity of the model: standard tools Plotting fitted values vs residual, QQ plot and Cook’s distance can give some insight into model and how to improve it. Some of these plots can be done using plot(lm1)

Prediction and confidence bands lm1 = lm(height~weight)) pp = predict(lm1,interval='p') pc = predict(lm1,interval='c') plot(weight,height,ylim=range(height,pp)) n1=order(weight) matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red') matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red') These commands produce two sets of bands: narrow and wide. Narrow band corresponds to confidence bands and wide band is prediction band

Bootstrap confidence lines Similarly bootstrap line can be calculated using boot_lm(women,flm0,1000) Functions boot_lm and flm0 are available from the course’s website

Most of the above indicators show that quadratic (quadratic on predictor, not on parameter) model may be better. One obvious way of “improving” the model is to assume that dependence of heights on weights is quadratic. It can be done within linear model also. We can fit polynomial on predictor model height =  0 +  1 *weight+  2 *weight 2 +… We will use quadratic model: lm2 = lm(height~weight+I(weight^2)) Again summary of lm2 should be viewed Default plot now looks better

lm2 = lm(height~weight+I(weight^2)) pp = predict(lm2,interval='p') pc = predict(lm2,interval='c') plot(weight,height,ylim=range(height,pp)) n1=order(weight) matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red') matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red') Confidence bands using the following set of commands looks narrower

Spread of bootstrap confidence lines also is much smaller also

Which model is better? One of the ways of selecting model is cross-validation. There is no command in R for cross validation for lm models. However there is a command for glm (generalised linear model. It is the subject of the next lecture. For now we need only to know that lm and glm with family=‘gaussian’ are the same). Let us use default leave one out cross-validation lm1g = glm(height~weight,women,family=‘gaussian’) cv1.err = cv.glm(women,lm1g) cv1.err$delta Results: 0.2572698 0.2538942 women1 = data.frame(h=height,w1=weight,w2=weight^2) Lm2g = glm(h~w1+w2,data=women1,family=‘gaussian’) cv2.err = cv.glm(women1,lm2g) cv2.err$delta Results: 0.007272508, 0.007148601 The second has smaller prediction error.

References 1.Stuart, A., Ord, KJ, Arnold, S (1999) Kendall’s advanced theory of statistics, Volume 2A 2.Box, GEP, Hunter, WG and Hunter, JS (1978) Statistics for experimenters 3.Berthold, MJ and Hand, DJ. Intelligent Data Analysis 4.Dalgaard, Introductury statistics with R

Exercise 3 Take data set city and analyse it as linear model. Write a report.

Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation.

Similar presentations

Presentation on theme: "Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation.

Similar presentations

Presentation on theme: "Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation."— Presentation transcript:

Similar presentations

About project

Feedback