Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population.

Thinking Robustly

Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population of data and calculate the relevant statistic on each sample. We can then look at the distribution of the statistic across these samples and ask a variety of questions about it.

Properties of a Statistic Sufficiency A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameter For example, this property makes the mean more attractive as a measure of central tendency compared to the mode or median. Unbiasedness A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a number of sample means) is equal to the population parameter it is estimating. As one can see using the resampling procedure, the mean can be shown to be an unbiased estimator

Properties of a Statistic Efficiency The efficiency of a statistic is reflected in the variance that is observed when one examines the statistic over independently chosen samples Standard error The smaller the variance, the more efficient the statistic is said to be Resistance The resistance of an estimator refers to the degree to which that estimate is effected by extreme values i.e. outliers Small changes in the data result in only small changes in estimate Finite-sample breakdown point Measure of resistance to contamination The smallest proportion of observations that, when altered sufficiently, can render the statistic arbitrarily large or small Median = n/2 Trimmed mean = whatever the trimming amount is Mean = 1/n

Problematic Data Typical problems include violations of normality and homoscedasticity, data have outliers etc. Mean based methods can sometimes be ‘robust’ but only with respect to type I error and that does not hold for many situations It may increase or even drastically decrease More serious is the bias, inefficiency, and the dramatic decrease in power (i.e. increase in type II error), leading to completely erroneous inferential conclusions and understatement of effects

Measures of Central Tendency What we want: A statistic whose standard error will not be grossly affected with small departures from normality Power to be comparable to that of mean and sd when dealing with a normal population The value to be fairly stable when dealing with non-normal distributions Two classes to speak of: Trimmed mean M-estimators

Trimmed mean You are very familiar with this in terms of the median, in which essentially all but the middle value is trimmed But now we want to retain as much of the data for best performance but enough to ensure resistance to outliers How much to trim? About 20%, and that means from both sides Example: 15 values..2 * 15 = 3, remove 3 largest and 3 smallest Advantages In non-normal situations it will perform better than the mean We already know it will be resistant to outliers It will have a reduced standard error as well Even under normal situations it is only slightly less efficient than the mean Windsorized means Instead of trimming, change all values past the upper and lower ‘trimming’ point to the value preceding that point X = 1 2 3 4 5 6 becomes X = 2 2 3 4 5 5

M-estimators M-estimators are another robust measure of location Involves the notion of a ‘loss function’ Examples: If we want to minimize squared errors from a particular value the result of the measure of central tendency will be the mean If we want to minimize absolute errors, the result will be a median M-estimators are more mathematically complex, but we can get the gist in that less weight is given to values that are further away from ‘center’ Think of a robust z-statistic calculated for the values and any beyond K are downweighted or possibly ignored Different M-Estimators give different weights for deviating values

M-estimators Example of the weighting compared to the mean With robust estimates of distance, weight accordingly

Measures of Variability We could also calculate trimmed and windsorized variance/standard deviation In fact, the winsorized typically performs better as a variance measure A common robust measure of deviation from center is the median absolute deviation from the median MAD For a normal distribution MAD =.67(s) However, the MAD is more efficient and resistant It also allows for a much better means of detecting a univariate outlier than the usual 2s standard If X > 2(MAD/.6745)

Inferential use of robust statistics In general, the approach will be the same using robust statistics as we have with regular ones as far as hypothesis testing and interval estimation Of particular concern will be estimating the standard error and the relative merits of the robust approach

The Trimmed Mean Consider the typical t-statistic for the one sample case This will hold using a trimmed mean as well, except we will be using the remaining values after trimming and our inference will regard the population trimmed mean instead of the population mean

Robust example: Using trimmed means for an independent samples t-test Calculating the variance for the group one mean h refers to the n remaining after trimming, the variance is winsorized Do the same for group 2 Calculate the t statistic as follows *Note that this formulation works for unequal sample sizes also

Robust example: Effect size As Cohen’s d is a sample statistic, use the appropriate data for the trimmed case Calculate Cohen’s d with the non-trimmed values (trimmed means and winsorized variance/sd) Even with M-estimators the conceptual approach remains in effect as well

Summary Given the issues regarding means, variances and inferences based on them, a robust approach is appropriate and preferred when dealing with outliers/non-normal data Increased power More accurate assessment of group tendencies and differences More accurate assessment of effect size If we want the best estimates and best inference given not-so- normal situations, standard methods simply don’t work too well We now have the methods and computing capacity to take a more robust approach to data analysis, and should not be afraid to use them.

J. W. Tukey (1979) “… just which robust/resistant methods you use is not important – what is important is that you use some. It is perfectly proper to use both classical and robust/resistant methods routinely, and only worry when they differ enough to matter. But when they differ, you should think hard.” None of you use robust stats, I can tell just by looking at you!

Intro to Robust Correlation

Effect of Outliers Outliers can artificially and dramatically increase or decrease r Options Compute r with and without outliers Transform variables Conduct robustified R! For example, recode outliers as having more conservative scores (winsorize)

What else? r is the starting point for any regression and related method Both the slope and magnitude of residuals are reflective of r R = 0 slope =0 As such a lone r is really more of a starting point for understanding the relationship between two variables, but as it is that foundation, we’d like to have a good assessment

Robust Approaches to Correlation Rank approaches Winsorized Percentage Bend

Rank approaches: Spearman’s rho and Kendall’s tau Spearman’s rho is calculated using the same formula as Pearson’s r, but when variables are in the form of ranks Simply rank the data available X = 10 15 5 35 25 becomes X = 2 3 1 5 4 Do this for X and Y and calculate r as normal Kendall’s tau is a another rank based approach but the details of its calculation are different For theoretical reasons it may be preferable to Spearman’s, but both should be consistent for the most part and perform better than Pearson’s r when dealing with non-normal data

Winsorized Correlation As mentioned before, Winsorizing data involves changing some decided upon percentage of extreme scores to the value of the most extreme score (high and low) which is not Winsorized Winsorize both X and Y values (without regard to each other) and compute Pearson’s r This has the advantage over rank-based approaches since the nature of the scales of measurement remain unchanged For theoretical reasons (recall some of our earlier discussion regarding the standard error for trimmed means) a Winsorized correlation would be preferable to trimming Though trimming is preferable for group comparisons

Methods Related to M-estimators The percentage bend correlation utilizes the median and a generalization of the MAD A criticism of the Winsorized correlation is that the amount of Winsorizing is fixed in advance rather than determined by the data, and the r pb * gets around that While the details can get a bit technical, you can get some sense of what is going on by relying on what you know regarding the robust approach in general With independent X and Y variables (i.e. r = 0) and under normality, the values of robust approaches to correlation will match the Pearson r With nonnormal data, the robust approaches described guard against outliers on the respective X and Y variables while Pearson’s r does not

Problem While these alternative methods help us in some sense, an issue remains When dealing with correlation, we are not considering the variables in isolation Outliers on one or the other variable, might not be a bivariate outlier Conversely what might be a bivariate outlier may not contain values that are outliers for X or Y themselves

Global measures of association Measures are available that take into account the bivariate nature of the situation Minimum Volume Ellipsoid Estimator (MVE) Minimum Covariance Determinant Estimator (MCD)

Minimum Volume Ellipsoid Estimator Robust elliptic plot (relplot) Relplots are like scatterplot boxplots for our data where the inner circle contains half the values and anything outside the outer circle would be considered an outlier A strategy for robust estimation of correlation would be to find the ellipse with the smallest area that contains half the data points Those points are then used to calculate the correlation The MVE

Minimum Covariance Determinant Estimator The MCD is another alternative we might use and involves the notion of a generalized variance, which is a measure of the overall variability among a cloud of points The determinant of a matrix is the generalized variance For the two variable situation As we can see, as r is a measure of linear association, the more tightly the points are packed the larger it would be, and subsequently smaller the generalized variance would be The MCD picks that half of the data which produces the smallest generalized variance, and calculates r from that

Global measures of association Note that both the MVE and MCD can be extended to situations with more than two variables We’d just be dealing with a larger matrix Example using the Robust library in S-Plus OMG! Drop down menus even!

Remaining issues: Curvature The fact is that straight lines may not capture the true story We may often fail to find noticeable relationships because our r, whichever method of “Pearsonesque” one we choose, is trying to specify a linear relationship There may still be a relationship, and a strong one, just more complex

Summary Correlation, in terms of Pearson r, gives us a sense of the strength of a linear association between two variables One data point can render it a useless measure, as it is not robust to outliers Measures which are robust are available, and some take into account the bivariate nature of the data However, curvilinear relationships may exist, and we should examine the data to see if alternative explanations are viable

Intro to Robust Regression

Violating assumptions Usual situation Slight problems may not result in much change in type I error However, type II will be a major concern with even modest violations With multiple violations, type I may also suffer Additional assumptions will be made for multiple predictors

Outliers As outliers can greatly influence r, they will naturally influence any analysis using it Detecting and dealing with outliers is a part of the process of regression analysis One issue is distinguishing univariate vs. multivariate outliers While a data point might be an outlier on a variable, it may not be as far as the model goes Conversely, what might be an outlier for the model, might not have it’s individual variable values noted as outliers

Robust Regression A single unusual point can greatly distort the picture regarding the relationship among variables Heteroscedasticity, even in ‘normal’ situations, inflates the standard error of estimate and decreases our estimate of R 2 Nonnormality can hamper our ability to come up with useful interval measures for slopes A couple examples can give you an idea of the general approach

Least Median of Squares Instead of minimizing the sum of the squared residuals, find the slope and intercept that minimizes the median of the squared residuals While conceptually straigthforward, doesn’t seem to perform as well generally as other robust approaches

Least Trimmed Squares The least trimmed squares approach involves trimming the smallest and largest residuals So if h is the amount of values left after trimming and Then the goal would be to minimize the sum of the squared residuals of the remaining data Note again that optimal trimming amount is about ~.2

Least Trimmed Absolute Value Same approach, but rather than minimize the trimmed squared residuals, we minimize the sum of the absolute residuals remaining after trimming This may be preferable to LTS in heteroscedastic situations

Theil-Sen Estimator For any pair of data points (cases) regarding a relationship between two variables, we can plot those 2 points, produce a line connecting them, and note its slope E.g. if we had 4 data points we could calculate 6 slopes X = 1,2,3,4 Y = 5,7,11,15 If each of those slopes is weighted by the squared difference in the X values for the appropriate points, the weighted average of all our slopes created would be the least squares slope for the model E.g. Create a line for the points, (1,5) and (2,7) Slope = 2 Weight by (1-2) 2 What if instead of a weighted average, the median of those slopes is chosen as our model slope estimate? That in essence is the Theil-Sen estimator

Summary In single predictor situations, alternatives are available that perform well in ideal situations, and much better than the LS approach in others Theil-Sen in particular While we have kept to the single predictor, this will typically never be our research situation in using regression analysis These methods can also be generalized to the multiple predictor setting, but their breakdown point (i.e. resistance advantage) decreases as more predictors enter into the equation Save for the recently developed ‘deepest regression line’ method which appears to maintain its breakdown point

Summary Again we call on the Tukey suggestion, use the methods for comparison to what you are seeing with traditional approaches A general approach: Check for linearity Perhaps using a smoother Default for scatterplots using the R-commander menu system If ok there, then use an estimator with a breakdown point of about.2-.3, and compare with LS output If notable differences between LS and robust exist, figure out why and determine which is more appropriate If assumptions are tenable and little difference between LS and robust exists, feel comfortable going with the LS output

Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population.

Similar presentations

Presentation on theme: "Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population.

Similar presentations

Presentation on theme: "Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population."— Presentation transcript:

Similar presentations

About project

Feedback