Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation.

Slides:



Advertisements
Similar presentations
Simple linear models Straight line is simplest case, but key is that parameters appear linearly in the model Needs estimates of the model parameters (slope.
Advertisements

Regression and correlation methods
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics.
Elementary hypothesis testing
Regression Analysis. Unscheduled Maintenance Issue: l 36 flight squadrons l Each experiences unscheduled maintenance actions (UMAs) l UMAs costs $1000.
Basics of ANOVA Why ANOVA Assumptions used in ANOVA
Basics of ANOVA Why ANOVA Assumptions used in ANOVA
Regression II Model Selection Model selection based on t-distribution Information criteria Cross-validation.
Elementary hypothesis testing
Resampling techniques
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Design of experiment and ANOVA
Design of experiment: ANOVA and testing hypotheses
Design of experiment I Motivations Factorial (crossed) design
Linear and generalised linear models
Regression Diagnostics Checking Assumptions and Data.
Quantitative Business Analysis for Decision Making Simple Linear Regression.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Basics of regression analysis
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Multiple comparisons Basics of linear regression Model selection.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Correlation & Regression
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Inference for regression - Simple linear regression
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Review of Statistical Models and Linear Regression Concepts STAT E-150 Statistical Methods.
Data Handling & Analysis BD7054 Scatter Plots Andrew Jackson
Topic 14: Inference in Multiple Regression. Outline Review multiple linear regression Inference of regression coefficients –Application to book example.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
1 1 Slide Simple Linear Regression Estimation and Residuals Chapter 14 BA 303 – Spring 2011.
Worked Example Using R. > plot(y~x) >plot(epsilon1~x) This is a plot of residuals against the exploratory variable, x.
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
Simple linear regression Tron Anders Moger
Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.
Trees Example More than one variable. The residual plot suggests that the linear model is satisfactory. The R squared value seems quite low though,
Example x y We wish to check for a non zero correlation.
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
Canadian Bioinformatics Workshops
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Statistical Data Analysis - Lecture /04/03
Applied Biostatistics: Lecture 2
Multiple Linear Regression
Chapter 12: Regression Diagnostics
Checking Regression Model Assumptions
Simple Linear Regression
Lecture 14 Review of Lecture 13 What we’ll talk about today?
Checking Regression Model Assumptions
6-1 Introduction To Empirical Models
Chapter 4, Regression Diagnostics Detection of Model Violation
Regression Statistics
Essentials of Statistics for Business and Economics (8e)
Presentation transcript:

Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation

Standard diagnostics Before starting to model 1)Visualisation of data: 1)plotting predictor vs observations. These plots may give a clue about the relationship, outliers 2)Smootheners After modelling and fitting 2)Fitted values vs residuals. It may help to identify outliers, correctness of the model 3)Normal QQ plot of residuals. It may help to check distribution assumptions 4)Cook’s distance. Reveal outliers, check correctness of the model 5)Model assumptions - t tests given by default print of lm Checking model and designing tests 3)Cross-validation. If you have a choice of models then cross-validation may help to choose the “best” model 4)Bootstrap. Validity of the model can be checked if the distribution of statistic of interest is available. Or these distributions could be generated using bootstrap

Visualisation prior to modeling Different type of datasets may require different visualisation tools. For simple visualisation either plot(data) or pairs(data,panel=panel.smooth) could be used. Visualisation prior to modeling may help to propose model (form of the functional relationship between input and output, probability distribution of observation etc) For example for dataset women - where weights and heights for 15 cases have been measured. plot and pairs commands produce these plots:

After modeling: linear models After modelling the results should be analysed. For example attach(women) lm1 = lm(weight~height) It means that we want a liner model (we believe that dependence of weight on height is linear) weight=  0 +  1 *height Results could be viewed using lm1 summary(lm1) The last command will produce significant of various coefficients also. Significance levels produced by summary should be considered carefully. If there are many coefficients then the chance that one “significant” effect is observed is very high.

After modeling: linear models It is a good idea to plot data and fitted model, and differences between fitted and observed values on the same graph. For linear models with one predictor it can be done using: plot(weight,height) abline(lm1) segements(weight,fitted(lm1),weight,height) This plot already shows some systematic differences. It is an indication that model may need to be revised.

Checking validity of the model: standard tools Plotting fitted values vs residual, QQ plot and Cook’s distance can give some insight into model and how to improve it. Some of these plots can be done using plot(lm1)

Prediction and confidence bands lm1 = lm(height~weight)) pp = predict(lm1,interval='p') pc = predict(lm1,interval='c') plot(weight,height,ylim=range(height,pp)) n1=order(weight) matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red') matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red') These commands produce two sets of bands: narrow and wide. Narrow band corresponds to confidence bands and wide band is prediction band

Bootstrap confidence lines Similarly bootstrap line can be calculated using boot_lm(women,flm0,1000) Functions boot_lm and flm0 are available from the course’s website

Most of the above indicators show that quadratic (quadratic on predictor, not on parameter) model may be better. One obvious way of “improving” the model is to assume that dependence of heights on weights is quadratic. It can be done within linear model also. We can fit polynomial on predictor model height =  0 +  1 *weight+  2 *weight 2 +… We will use quadratic model: lm2 = lm(height~weight+I(weight^2)) Again summary of lm2 should be viewed Default plot now looks better

lm2 = lm(height~weight+I(weight^2)) pp = predict(lm2,interval='p') pc = predict(lm2,interval='c') plot(weight,height,ylim=range(height,pp)) n1=order(weight) matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red') matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red') Confidence bands using the following set of commands looks narrower

Spread of bootstrap confidence lines also is much smaller also

Which model is better? One of the ways of selecting model is cross-validation. There is no command in R for cross validation for lm models. However there is a command for glm (generalised linear model. It is the subject of the next lecture. For now we need only to know that lm and glm with family=‘gaussian’ are the same). Let us use default leave one out cross-validation lm1g = glm(height~weight,women,family=‘gaussian’) cv1.err = cv.glm(women,lm1g) cv1.err$delta Results: women1 = data.frame(h=height,w1=weight,w2=weight^2) Lm2g = glm(h~w1+w2,data=women1,family=‘gaussian’) cv2.err = cv.glm(women1,lm2g) cv2.err$delta Results: , The second has smaller prediction error.

References 1.Stuart, A., Ord, KJ, Arnold, S (1999) Kendall’s advanced theory of statistics, Volume 2A 2.Box, GEP, Hunter, WG and Hunter, JS (1978) Statistics for experimenters 3.Berthold, MJ and Hand, DJ. Intelligent Data Analysis 4.Dalgaard, Introductury statistics with R

Exercise 3 Take data set city and analyse it as linear model. Write a report.