Robust Regression V & R: Section 6.5 Denise Hum. Leila Saberi. Mi Lam.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
The Multiple Regression Model.
BA 275 Quantitative Business Methods
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
EPI 809/Spring Probability Distribution of Random Error.
Chapter 4 The Relation between Two Variables
1 Statistics & R, TiP, 2011/12 Linear Models & Smooth Regression  Linear models  Diagnostics  Robust regression  Bootstrapping linear models  Scatterplot.
Generalized Linear Models (GLM)
IAOS 2014 Conference – Meeting the Demands of a Changing World Da Nang, Vietnam, 8-10 October 2014 ROBUST REGRESSION IMPUTATION: CONSIDERATION ON THE INFLUENCE.
Multiple Regression Predicting a response with multiple explanatory variables.
Multiple regression analysis
Lecture 3 Cameron Kaplan
Linear Regression Exploring relationships between two metric variables.
Chapter 10 Simple Regression.
Lecture 23: Tues., Dec. 2 Today: Thursday:
Chapter 12 Simple Regression
Regression Hal Varian 10 April What is regression? History Curve fitting v statistics Correlation and causation Statistical models Gauss-Markov.
Chapter 4 Multiple Regression.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Regression III: Robust regressions
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Linear Regression and Correlation Analysis
Chapter 11 Multiple Regression.
Simple Linear Regression Analysis
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Correlation & Regression
STA291 Statistical Methods Lecture 27. Inference for Regression.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Lecture 4: Inference in SLR (continued) Diagnostic approaches in SLR BMTRY 701 Biostatistical Methods II.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Multiple Collinearity, Serial Correlation,
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Use of Weighted Least Squares. In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Regression Model Building LPGA Golf Performance
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
The Robust Approach Dealing with real data. Estimating Population Parameters Four properties are considered desirable in a population estimator:  Sufficiency.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Robust Estimators.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Linear Models Alan Lee Sample presentation for STATS 760.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Stat 1510: Statistical Thinking and Concepts REGRESSION.
ESTIMATION METHODS We know how to calculate confidence intervals for estimates of  and  2 Now, we need procedures to calculate  and  2, themselves.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression Chapter 14.
The simple linear regression model and parameter estimation
AP Statistics Chapter 14 Section 1.
Checking Regression Model Assumptions
Statistical Methods For Engineers
Checking Regression Model Assumptions
Simple Linear Regression
Presentation transcript:

Robust Regression V & R: Section 6.5 Denise Hum. Leila Saberi. Mi Lam

Linear Regression From Ott & Longnecker Use data to fit a prediction line that relates a dependent variable y and a single independent variable x. That is, we want to write y as a linear function of x: y =  0 +  1 x +  Assumptions of regression analysis 1. The relation is linear so that the errors all have expected value zero; E(  i ) = 0 for all i 2. The errors all have the same variance: Var(  i ) =  2  for all i 3. The errors are independent of each other. 4. The errors are all normally distributed:  i is normally distributed for all i.

Example – Least squares method works well Data from Ott & Longnecker Ch. 11 exercise lm(formula = y ~ x) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) x e-06 *** Residual standard error: on 8 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 8 DF, p-value: 1.349e-06

But what happens if your data has outliers and/or fails to meet the regression analysis assumptions? Data: phones data set in the MASS library. This data represents the number of phone calls in millions in Belgium between 1950 and However, between 1964 and 1969 the total length of calls (in minutes) were recorded rather than the number, and both recording systems were used during parts of 1963 and 1970.

Outliers Outliers can cause the estimate of the regression slope line to change drastically. In the least squares approach we measure the response values in relation to the mean. However, the mean is very sensitive to outliers – one outlier can change it’s value so it has a breakdown point of 0%. On the other hand, the median is not as sensitive – it is resistant to gross errors and has a 50% breakdown point. So if the data is not normal, the mean may not be the best measure of central tendency. Another option with a higher breakdown point is the trimmed mean.

Why can’t we just delete the suspected outliers? Users don’t always screen the data. Rejecting outliers affects the distribution theory, which ought to be adjusted. In particular, variances will be underestimated from the “cleaned” data. The sharp decision to keep or reject an oberservation is wasteful. We can do better by down-weighting extreme observations rather than rejecting them, although we may wish to reject the completely wrong observations. So try robust or resistant regression

What are robust and resistant regression? Robust and resistant regression analyses provide alternatives to a least squares model when the data violates the fundamental assumptions. Robust and resistant regression procedures dampen the influence of outliers, as compared to regular least squares estimation, in an effort to provide a better fit for the majority of data. In the V&R book, robustness refers to being immune to assumption violations while resistance refers to being immune to outliers. Robust regression, which uses M-estimators, is not very resistant to outliers in most cases.

Phones data with Least Squares, Robust, and Resistant regression lines

Contrasting Three Regression Methods Least Square Linear Model Robust Methods Resistant Methods

Least Square Linear Model Is the traditional Linear Model Regression Determines the best fitting line as the line that minimizes Sum of Square of Errors. SSE=Σ(Y i - Y i -hat) If all the assumptions are met, this is the best linear unbiased estimate. (blue) Less complex in terms of computations, but very sensitive to outliers.

Robust Regression Is an alternative to Least Square method when errors are non- normal. Uses iterative methods to assign different weights to residuals until the estimation process converges. Useful to detect outliers by finding cases whose final weights are relatively small. Can be used to confirm the appropriateness of the ordinary least square model. Primarily helpful in finding cases that are outlying with respect to their y values (long-tailed errors). They can’t overcome problems due to variance structure. More complex to evaluate the precision of the regression coefficients, compared to ordinary model.

One robust method(V&Rp.158) M-estimators Assume f is a scaled pdf, set ρ = - log f, the maximum likelihood estimator minimizes the following to find the β’s: Σ ρ(y i -x i b)/s + n log s s is the scale, and it should be determined

Resistant Regression Unlike Robust Regression, it’s model-based. The answer is always the same. Rejects all possible outliers. Useful to detect outlier Requires much more computing than least squares Inefficient, only taking into account a portion of the data Compared to robust methods, they are more resistant to outliers. Two common types: Least Median of Squares (LMS) Least Trimmed Squares (LTS) l

LMS method(V&Rp.159) Minimize the median of the squared residuals min median i |y i − x i b|^2 Replaces the sum in Least Square Model method with median. Very inefficient. Not Recommended for small samples, due to high breakdown point.

LTS method(V&Rp.159) Minimize the sum of squares for the smallest q of the residuals. More efficient compared to LMS, but same resistance to errors The recommended q is: q=(n+p+1)/2 min Σ |y i − x i b| 2 (i)

Robust Regression Began developing techniques in 1960s Fitting is done by iterated re-weighted least squares (IWLS) IWLS (IRLS) uses weights based on how far outlying a case is, as measured by the residual for that case. Weights vary inversely with size of the residual Continue iteration until process converges R Code: RLM() = robust linear model summary(rlm(calls ~ year, data = phones, maxit = 50), cor = F) Call: rlm(formula = calls ~ year, data = phones, maxit = 50) Residuals: Min 1Q Median 3Q Max Coefficients: Value Std. Error t value (Intercept) year Residual standard error: on 22 degrees of freedom

Weight Functions for Robust Regression: (Linear Regression book citation) Huber’s M estimator (default in R) is used with tuning parameter c = w = ψ = { 1, |u| ≤ } { (1.345/ |u| ), |u| >1.345 } u denotes the scaled residual and is estimated using the median absolute deviation (MAD) estimator (instead of sqrt(MSE)) MAD = (1/.6745)*median { | e i - median{e i }| } So u i = e i /MAD Bisquare (redescending estimator) w = ψ = { [1 – (u / 4.685) 2 ] 2, |u| ≤ } { 0, |u| > }

R output for 3 different linear models (LM, RLM with Huber and Bisquare): summary(lm(calls ~ year, data = phones), cor = F) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) year Residual standard error: on 22 degrees of freedom summary(rlm(calls ~ year, data = phones, maxit = 50), cor = F) Coefficients: Value Std. Error t value (Intercept) year Residual standard error: on 22 degrees of freedom summary(rlm(calls ~ year, data = phones, psi = psi.bisquare), cor = F) Coefficients: Value Std. Error t value (Intercept) year Residual standard error: on 22 degrees of freedom

Comparison of Robust Weights using R attach(phones); plot(year, calls); detach(); abline(lm(calls ~ year, data = phones), lty = 1,col = 'black') abline(rlm(calls ~ year, phones, maxit=50), lty = 1, col = 'red') #default abline(rlm(calls ~ year, phones, psi=psi.bisquare, maxit=50), lty = 2, col = blue') abline(rlm(calls ~ year, phones, psi=psi.hampel, maxit=50), lty = 3, col = purple') legend(locator(1), lty = c(1,1,2,3), col = c('black','red','blue','purple'), legend = c("LM","Huber", "Bi-Square", "Hampel"))

Resistant Regression More estimators developed in 1980s designed to be more resistant to outliers The goal is to fit a regression to the good points in dataset thereby achieving a regression estimator with a high breakdown point Least Mean Squares (LMS) and Least Trimmed Squares (LTS) Both are efficient, but both very resistant S-estimation (see p. 160) More efficient than LMS and LTS when data is normal MM-estimation (combination of M-estimation and resistant regression techniques) MM-estimator is an M-estimate starting at the coefficients given by the S-estimator and with fixed scaled given by the S-estimator R Code: LQS() lqs(calls ~ year, data = phones) # default LTS method Coefficients: (Intercept) year Scale estimates

Comparison of Resistant Estimators using R: attach(phones); plot(year, calls); detach(); abline(lm(calls ~ year, data = phones), lty = 1,col = 'black') abline(lqs(calls ~ year, data = phones), lty = 1, col = 'red') abline(lqs(calls ~ year, data = phones, method = "lms"), lty = 2, col = 'blue') abline(lqs(calls ~ year, data = phones, method = "S"), lty = 3, col = 'purple') abline(rlm(calls ~ year, data = phones, method = "MM"), lty = 4, col = 'green') legend(locator(1), lty = c(1,1,2,3,4), col = c('black', 'red', 'blue', 'purple', 'green'), legend = c("LM","LTS", "LMS", "S", "MM"))

Summary Some reasons for using robust regression 1.Protect against influential outliers 2.Useful for detecting outliers 3.Check results against a least squares fit plot(x, y) abline(lm(y ~ x), lty = 1, col = 1) abline(rlm(y ~ x), lty = 2, col = 2) abline(lqs(y ~ x), lty = 3, col = 3) legend(locator(1), lty = 1:3, col = 1:3, legend = c("Least Squares", "M-estimate (Robust)", "Least Trimmed Squares (Resistant)")) To use robust regression in R: function rlm() To use resistant regression in R: function lqs()