Regression III: Robust regressions

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Biomedical Statistics Testing for Normality and Symmetry Teacher:Jang-Zern Tsai ( 蔡章仁 ) Student: 邱瑋國.
Statistical Techniques I EXST7005 Sample Size Calculation.
CHAPTER 2 Building Empirical Model. Basic Statistical Concepts Consider this situation: The tension bond strength of portland cement mortar is an important.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
IAOS 2014 Conference – Meeting the Demands of a Changing World Da Nang, Vietnam, 8-10 October 2014 ROBUST REGRESSION IMPUTATION: CONSIDERATION ON THE INFLUENCE.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Maximum likelihood (ML) and likelihood ratio (LR) test
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Chapter 10 Simple Regression.
Elementary hypothesis testing
Factor Analysis Purpose of Factor Analysis
Point estimation, interval estimation
Regression II Model Selection Model selection based on t-distribution Information criteria Cross-validation.
Elementary hypothesis testing
Resampling techniques
Generalised linear models
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Log-linear and logistic models
Generalised linear models Generalised linear model Exponential family Example: logistic model - Binomial distribution Deviances R commands for generalised.
Correlation A correlation exists between two variables when one of them is related to the other in some way. A scatterplot is a graph in which the paired.
Resampling techniques
Linear and generalised linear models
Inferences About Process Quality
Linear and generalised linear models
Basics of regression analysis
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Lorelei Howard and Nick Wright MfD 2008
Bootstrapping applied to t-tests
Nonparametrics and goodness of fit Petter Mostad
Use of Quantile Functions in Data Analysis. In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions)
Inference for regression - Simple linear regression
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
More About Significance Tests
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
1 Robust estimation techniques in real-time robot vision Ezio Malis, Eric Marchand INRIA Sophia, projet ICARE INRIA Rennes, projet Lagadic.
1 Chapter 3 Multiple Linear Regression Multiple Regression Models Suppose that the yield in pounds of conversion in a chemical process depends.
Modern Navigation Thomas Herring
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Curve-Fitting Regression
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
Robust Regression V & R: Section 6.5 Denise Hum. Leila Saberi. Mi Lam.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Robust Estimators.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
Trees Example More than one variable. The residual plot suggests that the linear model is satisfactory. The R squared value seems quite low though,
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 26 Chapter 11 Section 1 Inference about Two Means: Dependent Samples.
KNN Ch. 3 Diagnostics and Remedial Measures Applied Regression Analysis BUSI 6220.
Example x y We wish to check for a non zero correlation.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.
Tutorial I: Missing Value Analysis
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
F73DA2 INTRODUCTORY DATA ANALYSIS ANALYSIS OF VARIANCE.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Analysis of variance approach to regression analysis … an (alternative) approach to testing for a linear association.
Presentation transcript:

Regression III: Robust regressions Outliers Tests for outlier detections Robust regressions Breakdown point Least trimmed squares M-estimators

Outliers Outliers are observations that deviates from all others significantly. They may occur by accident or they may be results of measurement errors. Presence of outliers may lead to misleading results. In the example shown on the figure without outlier mean value is -0.41 and with outlier it is -0.01. If we want to test the hypothesis: H0:mean=0 then without outlier we conclude that H0 can be rejected (p-value is 0.009), however with outlier we cannot reject null-hypothesis (p-value is 0.98). Analysis and dealing with outliers is an important ingredient of modern statistical analysis. Sometimes careful analysis of outliers, their removal or weighting down may change the conclusions considerably. With outliers Outlier Without outlier

Dealing with outliers For simple cases such as mean, variance calculations one way of dealing with outliers is using trimmed data for statistic calculation. For example in the case above mean without outlier is -.41, with outlier -0.09 and trimmed mean with 10% removed is -0.33. Trimmed mean gives “better” than mean based on all data. Often mean and other statistics are used for testing some hypothesis. In these cases using non-parameteric (wilcox.test) tests may be better alternatives to t.test (wilcox.test gives p-value 0.09 and t.test gives p-value 0.98). For simple cases like mean, covariance calculations and test based on them usual approach is to use rank of observations instead of their values. Obviously when rank is used the power of tests will be reduced, however better conclusions could be more reliable. Before carrying out analysis and tests it is always good idea to visualise data and explore to see if there are outliers.

Outlier detection: Grubbs’ test It may be a good idea to test for outliers and remove them if possible before starting to analyse the data and doing hypothesis testing. One of the techniques for doing this is Grubbs’ test. It tests H0: there is an outlier versus H1: there is no outlier To do this Grubbs suggested using the statistic: G = (max(y)-mean(y))/sd(y) to test if the maximum value is outlier G = (mean(y)-mean(y))/sd(y) to test if the minimum value is outlier and G = max(|yi-mean(y)|)/sd(y) to test if maximum or minimum is outlier. There are versions for two outliers also. Obviously test statistics will depend on the number of observations. Grubbs’ test (grubbs.test) is available from the package outliers. It is not part of the standard R distributions.

Outlier detection: Grubbs’ test Applying grubbs.test for the example we get: Grubbs test for one outlier data: del1 G = 2.9063, U = 0.0709, p-value = 9.858e-06 alternative hypothesis: highest value 4 is an outlier Since p-value is very small we can reject null hypothesis that there is no outlier. Once we are sure that there is an outlier then we remove it and carry out tests and/or analysis further.

Outlier detection: Simulated distribution If outliers package is not available then we can generate distribution for the statistics given above. Let us write for one of them (H0: maximum value is not outlier): outdist = function(nsample,n){ ff = vector(length=n) for (i in 1:n){rr=rnorm(nsample);ff[i]=max(rr-mean(rr))/sd(rr)} ff } Now we can generate distribution of the statistics for samples of different sizes. For example the distribution for sample of sizes 10,15 and 20 are shown on the figure. To generate the figure the following set of commands are used oo10 = outdist(10,10000) curve(ecdf(oo10)(x),from=0,to=15,lwd=3) oo15 = outdist(15,10000) curve(ecdf(oo15)(x),from=0,to=15,lwd=3) oo20 = outdist(20,10000) curve(ecdf(oo20)(x),from=0,to=15,lwd=3) Obviously as the sample size increases probability that large values will be genuinely observed will increase. That is why as sample size increases the distribution shifts to the right.

Outlier detection: Simulated distribution Once we have the desired distributions we can calculate For example for the case we considered we sample size is 11. Let us generate empirical cumulative probability distribution and use it for outlier detection: oo11=outdist(11,10000) ec11 = ecdf(oo11) st = max(del1-mean(del1))/sd(del1) 1-ec11(st) This sequence of commands will produce p-values. If the maximum value is 4 then p-value is 0, if the maximum value is 2 then p-value is 0.001 and when maximum value is 1 then p-value is 0.04. We can reject null-hypothesis for first and the second case, however for the third case we should be careful.

Breakdown point Breakdown point of an estimator is a fraction of of sample if changed arbitrarily that does not affect the estimation significantly. For example for mean value if we changing by value arbitrarily we can change mean value as much as we want. Let us take an example: -0.8 -0.6 -0.3 0.1 -1.1 0.2 -0.3 -0.5 -0.5 -0.3 4.0 Sample size is 11, mean value is -0.09. If we change the last value to 100 then the mean value becomes 8.72. Breakdown point of mean value is 0. Another limiting case is median. Median of the above sample is -0.3. If we change one value and make it extremely large then median will not change much. For example if we change the last value to -100 then median will become -0.5. Breakdown point for median is 0.5, i.e. more than 50% of the sample should be changed arbitrarily to change the median arbitrarily. Breakdown point 0.5 is the theoretical limit. Efficiency of estimators with high breakdown point is usually worse than those with lower breakdown point. In other words variances of estimators with high breakdown point are larger.

Outliers and regression Regression: no outliers Let us remind us the form of the least-squares equations for regressions. Again x is a vector of input (predictor) parameters, β is a vector of parameters, y is output, the number of sample points is n. As we know in special case when g(β,x) =β, and β is a single value then least-squares estimation gives mean value of y. We can consider above estimation as an extension of mean value estimation. Breakdown point of this estimation is 0, so least-squares is very sensitive to outliers. There are several approaches to deal with outliers in regression analysis. We will consider only two of them: 1) least-trimmed squares; 2) M-estimators Regression: with an outlier Outlier

Least trimmed squares Least trimmed squares works iteratively. Set up initial values for the model parameters (for example using simple least squares method implemented in lm) Calculate squared residuals ri2=(yi-g(xi,β))2 Sort squared residuals Remove fraction of observations for which squared residuals are large Minimise least squares using these observations only Repeat 2)-6) until convergence achieved. The function lts in R does LTS and several others as a special case of LTS (least median and least quantile squares). The number of used residuals for different methods are different. For least-median it is [(n+1)/2], for least quantile [(n+p+1)/2] and for LTS it is [n/2]+[(p+1)/2], where [] is the integer part of the argument The result of default lqs

Robust M-estimators General extension of least-squares have the form: Form the function ρ defines various forms of robust M-estimators. When ρ(z)=z2 it becomes simple least-squares. Let us first analyse this function. To minimise this function let us use Gauss-Newton method. To use this method we need the first and second derivatives (more precisely an approximation for the second derivative) Where ρ’, ρ’’ are the first and the second derivative of ρ. In Gauss-Newton methods the second term of the second derivative equation is usually ignored. Usually ρ’=ψ and ρ’’=w notations are used. If we look at the equations we can see that it looks like an extension of least-squares equations. The minimisation of the function is done iteratively using iteratively reweighted least squares (IRLS or IWLS). ψ function is an influence function. Analysis of values of this function at the observations may help to understand outliers in the data and how are dealt with.

Forms of robust regression Example of ρ and ψ (Geman and Mcclure function) Robust M-estimators are usually chosen so that to make contribution of gradients for large residuals small, in other words to weight down large deviations. They can be chosen either using ρ or ψ. Basic idea behind robust estimators is: For small deviations behaviour of the function should be similar to least squares and for large deviations contributions should be weighted down. Different functions differ by degree of weighting.

Forms of robust regression Most popular forms of robust estimators are: Huber Tukey’s bisquare Geman and Mcclure Welsch t-distribution (actually it is a little bit modified form of –log t distribution)

Robust estimators The result of robust regression with Huber function is very good. In practice choice of the function will depend on the number and severity of outliers.

Robust estimators: Dagnostic plots

Boxplots for residuals and weighted residuals

R commands for robust estimation lqs – least trimmed, least median estimation rlm – robust linear model estimation using M-estimator outliers package may be helpful.

References P. J. Huber (1981) “Robust Statistics”. F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw and W. A. Stahel (1986) “Robust Statistics: The Approach based on Influence Functions” A. Marazzi (1993) “Algorithms, Routines and S Functions for Robust Statistics”, W. N. and Ripley, B. D. (2002) “Modern Applied Statistics with S”