5/18/2015330 lecture 101 STATS 330: Lecture 10. 5/18/2015330 lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar.

Slides:



Advertisements
Similar presentations
Simple linear models Straight line is simplest case, but key is that parameters appear linearly in the model Needs estimates of the model parameters (slope.
Advertisements

Linear regression models in R (session 1) Tom Price 3 March 2009.
Chapter 12 Inference for Linear Regression
4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.
Polynomial Regression and Transformations STA 671 Summer 2008.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
5/11/ lecture 71 STATS 330: Lecture 7. 5/11/ lecture 72 Prediction Aims of today’s lecture  Describe how to use the regression model to.
5/13/ lecture 91 STATS 330: Lecture 9. 5/13/ lecture 92 Diagnostics Aim of today’s lecture: To give you an overview of the modelling cycle,
Regression with ARMA Errors. Example: Seat-belt legislation Story: In February 1983 seat-belt legislation was introduced in UK in the hope of reducing.
Multiple Regression Predicting a response with multiple explanatory variables.
Linear Regression Exploring relationships between two metric variables.
Statistics for the Social Sciences
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
Regression Hal Varian 10 April What is regression? History Curve fitting v statistics Correlation and causation Statistical models Gauss-Markov.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Gordon Stringer, UCCS1 Regression Analysis Gordon Stringer.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation. –Polynomials and Transformations in Multiple Regression Start.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Regression Diagnostics Checking Assumptions and Data.
Empirical Estimation Review EconS 451: Lecture # 8 Describe in general terms what we are attempting to solve with empirical estimation. Understand why.
Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Inference for regression - Simple linear regression
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
Lecture 4: Inference in SLR (continued) Diagnostic approaches in SLR BMTRY 701 Biostatistical Methods II.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Topic 14: Inference in Multiple Regression. Outline Review multiple linear regression Inference of regression coefficients –Application to book example.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Exercise 8.25 Stat 121 KJ Wang. Votes for Bush and Buchanan in all Florida Counties Palm Beach County (outlier)
Regression. Population Covariance and Correlation.
Economics 173 Business Statistics Lecture 20 Fall, 2001© Professor J. Petry
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Regression Model Building LPGA Golf Performance
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
Welcome to Econ 420 Applied Regression Analysis Study Guide Week Four Ending Wednesday, September 19 (Assignment 4 which is included in this study guide.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Environmental Modeling Basic Testing Methods - Statistics III.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
12/17/ lecture 111 STATS 330: Lecture /17/ lecture 112 Outliers and high-leverage points  An outlier is a point that has a larger.
Linear Models Alan Lee Sample presentation for STATS 760.
Stat 112 Notes 14 Assessing the assumptions of the multiple regression model and remedies when assumptions are not met (Chapter 6).
Stat 1510: Statistical Thinking and Concepts REGRESSION.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Stats Methods at IC Lecture 3: Regression.
Chapter 4: Basic Estimation Techniques
Lecture 11: Simple Linear Regression
Chapter 12 Simple Linear Regression and Correlation
Basic Estimation Techniques
STATS 330: Lecture 16 Case Study 7/17/ lecture 16
Checking Regression Model Assumptions
Correlation and Simple Linear Regression
Basic Estimation Techniques
CHAPTER 29: Multiple Regression*
Checking Regression Model Assumptions
Console Editeur : myProg.R 1
Chapter 12 Simple Linear Regression and Correlation
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
Presentation transcript:

5/18/ lecture 101 STATS 330: Lecture 10

5/18/ lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar data  To look at diagnostics and remedies for non-constant scatter.

5/18/ lecture 103 Remedies for non-planar data (cont)  Last time we looked at diagnostics for non- planar data  We discussed what to do if the diagnostics indicate a problem.  The short answer was: we transform, so that the model fits the transformed data.  How to choose a transformation? Theory Ladder of powers Polynomials  We illustrate with a few examples

5/18/ lecture 104 Example: Using theory - cherry trees A tree trunk is a bit like a cylinder Volume =   (diameter/2) 2  height Log volume = log(  4) + 2 log(diameter) + log(height) so a linear regression using the logged variables should work! In fact R 2 increases from 95% to 98%, and residual plots are better

5/18/ lecture 105 Example: cherry trees (cont) > new.reg<- lm(log(volume)~log(diameter)+log(height), data=cherry.df) > summary(new.reg) Call: lm(formula = log(volume) ~ log(diameter) + log(height), data = cherry.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-09 *** log(diameter) < 2e-16 *** log(height) e-06 *** --- Residual standard error: on 28 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 2 and 28 DF, p-value: < 2.2e-16 Previously 94.8%

5/18/ lecture 106 Example: cherry trees (original)

5/18/ lecture 107 Example: cherry trees (logs)

5/18/ lecture 108 Tyre abrasion data: gam plots rubber.gam = gam(abloss~s(hardness)+s(tensile), data=rubber.df) par(mfrow=c(1,2)); plot(rubber.gam)

5/18/ lecture 109 Tyre abrasion data: polynomial GAM curve is like a polynomial, so fit a polynomial (ie include terms tensile 2, tensile 3,…) lm(abloss~hardness+poly(tensile,4), data=rubber.df) Usually a lot of trial and error involved! We have succeeded when R 2 improves Residual plots show no pattern 4 th deg polynomial works for the rubber data: R2 increases from 84% to 94% Degree of polynomial

Why 4 th degree? > rubber.lm = lm(abloss~poly(tensile,5)+hardness, data=rubber.df) > summary(rubber.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-16 *** poly(tensile, 5) e-10 *** poly(tensile, 5) poly(tensile, 5) e-05 *** poly(tensile, 5) *** poly(tensile, 5) hardness e-13 *** Residual standard error: on 23 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 6 and 23 DF, p-value: 3.931e-13 5/18/ lecture 1010 Try 5 th degree Highest significant power

Ladder of powers  Rather than fit polynomials in some independent variables, guided by gam plots, we can transform the response using the “ladder of powers” (i.e. use y p as the response rather than y for some power p)  Choose p either by trial and error using R 2 or use a “Box-Cox plot – see later in this lecture 5/18/ lecture 1011

5/18/ lecture 1012 Checking for equal scatter  The model specifies that the scatter about the regression plane is uniform  In practice this means that the scatter doesn’t depend on the explanatory variables or the mean of the response  All tests, confidence intervals rely on this

5/18/ lecture 1013 Scatter  Scatter is measured by the size of the residuals  A common problem is where the scatter increases as the mean response increases  This is means the big residuals happen when the fitted values are big  Recognize this by a “funnel effect” in the residuals versus fitted value plot

5/18/ lecture 1014 Example: Education expenditure data  Data for the 50 states of the USA  Variables are Per capita expenditure on education (response), variable educ Per capita Income, variable percap Number of residents per 1000 under 18, variable under18 Number of residents per 1000 in urban areas, variable urban Fit model educ~ percap+under18+urban

5/18/ lecture 1015 response Outlier! (response)

5/18/ lecture 1016 Outlier, pt 50 (California)

Basic fit, outlier in 5/18/ lecture 1017 > educ.lm = lm(educ~urban + percap + under18, data=educ.df) > summary(educ.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** urban percap e-07 *** under e-05 *** Residual standard error: on 46 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 46 DF, p-value: 5.271e-09 R 2 is 59%

Basic fit, outlier out 5/18/ lecture 1018 > educ50.lm = lm(educ~urban + percap + under18, data=educ.df, subset=-50) > summary(educ50.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * urban percap *** under * --- Residual standard error: on 45 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 45 DF, p-value: 8.365e-07 R 2 is now 49% See how we exclude pt 50

5/18/ lecture 1019 > par(mfrow=c(1,2)) > plot(educ50.lm, which = c(1,3)) Funnel effect Increasing relationship

5/18/ lecture 1020 Remedies  Either Transform the response  Or Estimate the variances of the observations and use “weighted least squares”

5/18/ lecture 1021 > tr.educ50.lm <- lm(I(1/educ)~urban + percap + under18,data=educ.df[-50,]) > plot(tr.educ50.lm) Transforming the response Transform to reciprocal Better!

What power to choose?  How did we know to use reciprocals?  Think of a more general model I(educ^p)~percap + under18 + urban where p is some power  Then estimate p from the data using a Box- Cox plot 5/18/ lecture 1022

5/18/ lecture 1023 boxcoxplot(educ~urban + percap + under18, educ.df[-50,]) Transforming the response (how?) Draws “Box-Cox plot” A “R330” function Min at about -1

5/18/ lecture 1024 Weighted least squares  Tests are invalid if observations do not have constant variance  If the ith observation has variance v i  2, then we can get a valid test by using “weighted least squares”, minimising the sum of the weighted squared residuals  r i 2 /v i rather than the sum of squared residuals  r i 2  Need to know the variances v i

5/18/ lecture 1025 Finding the weights  Step 1: Plot the squared residuals versus the fitted values  Step 2: Smooth the plot  Step 3: Estimate the variance of an observation by the smoothed squared residual  Step 4: weight is reciprocal of smoothed squared residual  Rationale: variance is a function of the mean  Use “R330” function funnel

5/18/ lecture 1026 Doing it in R (1) vars = funnel(educ50.lm)# a“R330” function Slope 1.7, indicates p=1-1.7=-0.7

5/18/ lecture 1027 > educ50.lm<-lm(educ~urban+percap+under18data=educ.df[-50,]) > summary(educ50.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * urban percap *** under * --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 45 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 3 and 45 DF, p-value: 8.365e-07 Note p- values

5/18/ lecture 1028 > weighted.lm<-lm(educ~urban+percap+under18, weights=1/vars, data=educ.df[-50,]) > summary(weighted.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * urban percap e-07 *** under ** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 45 degrees of freedom Multiple R-squared: 0.629, Adjusted R-squared: F-statistic: on 3 and 45 DF, p-value: 8.944e-10 Note changes! Conclusion: unequal variances matter! Can change results! Note reciprocals!