Use of Weighted Least Squares. In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n.

Slides:



Advertisements
Similar presentations
Linear regression models in R (session 1) Tom Price 3 March 2009.
Advertisements

Multiple Regression and Model Building
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Chapter 4 The Relation between Two Variables
1 Statistics & R, TiP, 2011/12 Linear Models & Smooth Regression  Linear models  Diagnostics  Robust regression  Bootstrapping linear models  Scatterplot.
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Practical Sheet 6 Solutions Practical Sheet 6 Solutions The R data frame “whiteside” which deals with gas consumption is made available in R by > data(whiteside,
Multiple Regression Predicting a response with multiple explanatory variables.
Zinc Data SPH 247 Statistical Analysis of Laboratory Data.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
Regression Hal Varian 10 April What is regression? History Curve fitting v statistics Correlation and causation Statistical models Gauss-Markov.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Ch. 14: The Multiple Regression Model building
Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis.
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Inference for regression - Simple linear regression
Chapter 14 Introduction to Multiple Regression Sections 1, 2, 3, 4, 6.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Chapter 14 Introduction to Multiple Regression
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Non-Linear Regression. The data frame trees is made available in R with >data(trees) These record the girth in inches, height in feet and volume of timber.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Regression and Analysis Variance Linear Models in R.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
LBSRE1021 Data Interpretation Lecture 11 Correlation and Regression.
Regression Model Building LPGA Golf Performance
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
Robust Regression V & R: Section 6.5 Denise Hum. Leila Saberi. Mi Lam.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Introduction to Statistical Modelling Example: Body and heart weights of cats. The R data frame cats, and the variables therein, are made available by.
Lecture 3 Linear Models II Olivier MISSA, Advanced Research Skills.
Linear Models Alan Lee Sample presentation for STATS 760.
Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.
Non-Linear Regression. The data frame trees is made available in R with >data(trees) These record the girth in inches, height in feet and volume of timber.
EPP 245 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
Stat 1510: Statistical Thinking and Concepts REGRESSION.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
The Effect of Race on Wage by Region. To what extent were black males paid less than nonblack males in the same region with the same levels of education.
Nemours Biomedical Research Statistics April 9, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Regression Several Explanatory Variables. Example: Scottish hill races data. These data are made available in R as data(hills, package=MASS) They give.
1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.
Lecture 11: Simple Linear Regression
Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y
Correlation and regression
Console Editeur : myProg.R 1
Presentation transcript:

Use of Weighted Least Squares

In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n are i.i.d. N(0,  2 ) and is a reasonable fitting method when this condition is at least approximately satisfied. (Most importantly we require here that there should be no significant outliers).

In the case where we have instead  1 ……….  n are independent N(0,  i 2 ), it is natural to use instead weighted least squares: choose f from within the permitted class of functions f to minimise  w i (y i -f(x i )) 2 Where we take w i proportional to 1/  i 2 (clearly only relative weights matter) ^ ^

For the hill races data, it is natural to assume greater variability in the times for the longer races, with the variability perhaps proportional to the distance. We therefore try refitting the quadratic model with weights proportional to 1/distance 2 > model2w = lm(time ~ -1 + dist +I(dist^2)+ climb + I(climb^2),data = hills[-18,], weights=1/dist^2)

The fitted model is now time=4.94*distance *(distance) * climb *(climb) 2 +  ’ Note that the residual summary above is on a “reweighted” scale, and cannot be directly compared with the earlier residual summaries. While the coefficients here appear to have changed somewhat from those in the earlier, unweighted, fit of Model 2, the fitted model is not really very different.

This is confirmed by the plot of the residuals from the weighted fit against those from the unweighted fit, produced by >plot(resid(model2w)~resid(model2))

Resistant Regression

As already observed, least squares fitting is very sensitive to outlying observations. However, there are also a large number of resistant fitting techniques available. One such is least trimmed squares: choose f from within the permitted class of functions f to minimise:- ^

Example: phones data. The R dataset phones in the package MASS gives the annual number of phone calls (millions) in Belgium over the period Consider the model calls = a + b*year The following two graphs plot the data and shows the result of fitting the model by least squares and then fitting the same model by least trimmed squares.

These graphs are achieved by the following code: > plot(calls~year) > phonesls=lm(calls~year) > abline(phonesls) > plot(calls~year) > library(lqs) > phoneslts=lqs(calls~year) > abline(phoneslts)

The explanation for the data is that for a period of time total length of all phone calls in each year was accidentally recorded instead.

Nonparametric Regression

Sometimes we simply wish to fit a smooth model without specifying any particular functional form for f. Again there are very many techniques here. One such is called loess. This constructs the fitted value f(x i ) for each observation i by performing a local regression using only those observations with x values in the neighbourhood of x i (and attaching most weight to the closest observations). ^

Example: cars data. The R data frame cars (in the base package) records 50 observations of speed (mph) and stopping distance (ft). These observations were collected in the 1920s! We treat stopping distance as the response variable and seek to model its dependence on speed.

We try to fit a model using loess. Possible R code is > data(cars) > attach(cars) > plot(cars) > library(modreg) > carslo=loess(dist~speed) > lines(fitted(carslo)~speed)

An optional argument span can be increased from its default value of 0:75 to give more smoothing: > plot(cars) > carslo2=loess(dist~speed, span=1) > lines(fitted(carslo2)~speed)

More robust and resistant fits can be given by specifying the further optional argument family="symmetric"

Models with Qualitative Explanatory Variables (Factors) Data: n = 22 pairs (x i, y i ) where y is the response; the data arise under two different sets of conditions (type = 1 or 2) and are presented below sorted by x within type.

Row y x type

Distinguishing the two types (an appropriate R command will do this)

We model the responses first ignoring the variable type. > mod1 = lm(y~x) > abline(mod1)

> summary(mod1) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) *** x e-08 *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 20 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: 69.4 on 1 and 20 DF, p-value: 6.201e-08

> summary.aov(mod1) Df Sum Sq Mean Sq F value Pr(>F) x e-08 *** Residuals Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

We now model the responses using a model which includes the qualitative variable type, Which was declared as a factor when the data frame was set up > type = factor(c( rep(1,14),rep(2,8))) >mod2 = lm(y~x+type)

> summary(mod2) Call: lm(formula = y ~ x + type) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** x e-11 *** type e-06 *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 19 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 2 and 19 DF, p-value: 2.001e-11

Interpreting the output: The fit is so e.g.observation 1 : x = 2.4, type = 1, and for observation 20: x = 9.1, type = 2,

> summary.aov(mod2) Df Sum Sq Mean Sq F value Pr(>F) x e-11 *** type e-06 *** Residuals Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

The fitted values for Model 2 can be obtained in R by: >fitted.values(mod2)

The total variation in the responses is S yy = ; variable x explains of this total (77.6%) and the coefficient associated with it (0.6090) is highly significant (significantly different from 0) – it has a negligible P-value.

In the presence of x, type explains a further of the total variation and its coefficient is also highly significant. Together the two variables explain 92.5% of the total variation. In the presence of x, we gain much by including type.

Finally we extend the previous model (mod2) by allowing for an interaction between the explanatory variables x and type. An interaction exists between two explanatory variables when the effect of one on a response variable is different at different values/levels of the other.

For example consider the effect of policyholder’s age and gender on a response variable claim rate. If the effect of age on claim rate is different for males and females, then there is an interaction between age and gender.

> mod3 = lm(y ~ x * type) > summary(mod5) Call: lm(formula = y ~ x * type) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** x e-10 *** type x:type Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 18 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: 74.6 on 3 and 18 DF, p-value: 2.388e-10

> summary.aov(mod5) Df Sum Sq Mean Sq F value Pr(>F) x e-11 *** type e-05 *** x:type Residuals Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

The interaction appears to have added nothing - the coefficient of determination is effectively unchanged compared to the previous model. We also note that the extra parameter value is small and is not significant. In this particular case, an interaction term is not helpful - including it has simply confused the issue.

In a case where an interaction term does improve the fit and the coefficient is significant, then both variables and the interaction between them should be included in the model