Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

2 Outliers – What are they?
A data point that lies outside the range of the rest of our data. An observation that is distant from other observations. ***Both definitions are fine to describe outliers, but vague when it comes to making critical decisions about your data

3 Outliers – How do they occur?
Measurement error Examine or “scrub” your data for typical errors You may never catch all errors Some errors aren’t out on the extremes of x and y

4 Outliers – How do they occur?
Extreme, but non-error, values Extreme, but accurate, data may tell an important story Perhaps this data was generated from a different process, or you are being alerted to critical omitted variables

5 Outliers – How do they affect data analysis?
Different types of outliers will have different impacts on data analysis Some observations that don’t appear to be outliers may still be influential. Data points that have a large impact on the calculated values of various estimates (e.g, mean, reg. coefficients, standard errors, etc.).

6 Outliers and Leverage (Batna, 2006)
Outliers on Y (Response, DV) variable: Outliers May represent model failure – different process Outliers on X (Predictor, IV) variable: Leverage Point Good Leverage Points Bad Leverage Points

7 Outlier examples in R ***The code used in the outlier examples comes from Dr. Westfall’s Regression class webpage, including last year’s (2014) material.

8 Quantile Wage Example First example on 3/12 in course webpage.

9 Quantile regression A regression function for each quantile (decile, quartile, quintile etc.) Compare OLS & quantile regressions: OLS: estimate the mean of the distribution of Y conditional on X quantile: estimate the quantiles of the distribution of Y conditional on X

10 Conditional quantile function
Conditional distributions p(y|x) parameters are estimated separately for each identified quantile (mean, median, quartile, decile, etc) of Y

11 Optimization

12 Optimization (cont’d)

13 Strengths Complete picture of the data generating process.
“Outliers” (birth-weight example in the reading material). Systematic difference in the process (i.e. heteroscedasticity) (income example) Require dependent variable to be continuous

14 An accounting example (Armstrong et al. 2014)
Relation between the board’s financial knowledge and level of tax avoidance. Prior study shows mixed results using OLS. This paper proposes that it’s very important to examine the tails of the tax avoidance distribution. Findings (next slide)

15 An accounting example (cont’d)

16 Other examples Peak energy use example often used by Dr. Westfall
Household income example from the reading material (R code from Koenker 2012)

17 Winsorization The transformation of statistics by limiting or replacing extreme values. Developed by Charles P. Winsor, Used heavily in accounting and finance research.

18 Winsorization – How does it work?
Replaces data points with extreme residuals with percentiles (1%, 5%, 95%, 99%). Because it uses quantiles, the winsorized variable must be continuous (not a dummy variable). Beneficial when the outliers represent a different process than that of interest in our study. Note that this is not the same as truncation, as it replaces values.

19 Limitations Changes the data we analyze.
Not appropriate to extrapolate or develop prediction interval. Originally designed to replace data points based on the residuals. Often, researchers winsorize all or certain independent and dependent variables, even if the winsorized data point is close to the fitted value. Reduces variability in the data. Biases standard errors downward.

20 Demonstrations Winsorization in R
Peak energy data from Dr. Westfall Summary demonstrations comparing OLS, quantile regression, and winsorized OLS. Example 1 – Heteroscedastic model Example 2 – Simulation of a known process with emphasis on outliers

21 Reference Armstrong, C. S., Blouin, J. L., Jagolinzer, A. D., & Larcker, D. F. (2014). Corporate Governance, Incentives, and Tax Avoidance. Working Paper. Blatna, D. ( ). Outliers in regression. Trutnov, Vol. 30. Ghosh, D., & Vogt, A. (2012). Outliers: An evaluation of methodologies. In Joint Statistical Meetings, Koenker, R. (2012). Quantile Regression in R: A Vignette. Working Paper. Koenker, R., & Hallock, K. F. (Fall 2001). Quantile Regression. Journal of Economic Perspectives, Vol. 15, No. 4., Kriegel, H., Kroger, P., & Zimek, A. (2010). Outlier detection techniques. Retrieved from Leone, A. J., Minutti-Meza, M., & Wasley, C. (2013, August). Influencial Observation and Inference in Accounting Research. Working Paper. Retrieved from Working Paper: Logan, J., & Petscher, Y. (2013, May 23). An Introduction to Quantile Regression. Retrieved from Modern Modeling Methods Conference Presentation: Mosteller, F., & Tukey, J. (1977). Data Analysis and Regression: A Second Course in Statistics. Reading, MA: Addison-Wesley. Outlier. (2015, February 11). Retrieved from Wikipedia: Quantile Regression. (2015, February 16). Retrieved from Wikipedia: Shaw-Allen, P., Suter II, G., Cormier, S., & Yuan, L. (2012, July 31). Quantile Regression: Details. Retrieved from US Environmental Protection Agency: Winsorising. (2015, January 31). Retrieved from Wikipedia: Yale, C., & Forsythe, A. B. (August 1976). Winsorized Regression. Technometrics, Vol. 18, No. 3,

