Nasty data… When killer data can ruin your analyses

Slides:



Advertisements
Similar presentations
Descriptive Statistics-II
Advertisements

Multiple regression refresher Austin Troy NR 245 Based primarily on material accessed from Garson, G. David Multiple Regression. Statnotes: Topics.
Week 13 November Three Mini-Lectures QMM 510 Fall 2014.
Descriptive Measures MARE 250 Dr. Jason Turner.
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
A Short Introduction to Curve Fitting and Regression by Brad Morantz
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Stats Lunch: Day 2 Screening Your Data: Why and How.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
Lecture 25 Multiple Regression Diagnostics (Sections )
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 13-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Lecture 24 Multiple Regression (Sections )
Regression Diagnostics - I
Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics.
(Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)
Regression Diagnostics Checking Assumptions and Data.
Pertemua 19 Regresi Linier
Business Statistics - QBM117 Statistical inference for regression.
Chapter 7 Forecasting with Simple Regression
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
Introduction to Regression Analysis, Chapter 13,
Regression and Correlation Methods Judy Zhong Ph.D.
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Basics of Data Cleaning
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
TODAY we will Review what we have learned so far about Regression Develop the ability to use Residual Analysis to assess if a model (LSRL) is appropriate.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
Maths Study Centre CB Open 11am – 5pm Semester Weekdays
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
Multivariate Data Analysis Chapter 2 – Examining Your Data
Econometrics Course: Cost as the Dependent Variable (I) Paul G. Barnett, PhD November 20, 2013.
Residual Analysis Purposes –Examine Functional Form (Linear vs. Non- Linear Model) –Evaluate Violations of Assumptions Graphical Analysis of Residuals.
KNN Ch. 3 Diagnostics and Remedial Measures Applied Regression Analysis BUSI 6220.
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
DTC Quantitative Research Methods Regression I: (Correlation and) Linear Regression Thursday 27 th November 2014.
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
Assumptions & Requirements.  Three Important Assumptions 1.The errors are normally distributed. 2.The errors have constant variance (i.e., they are homoscedastic)
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Quantitative Methods Residual Analysis Multiple Linear Regression C.W. Jackson/B. K. Gordor.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard)   Week 5 Multiple Regression  
Inference for Least Squares Lines
Correlation, Bivariate Regression, and Multiple Regression
The Correlation Coefficient (r)
Assumption of normality
Chapter 12: Regression Diagnostics
Lecture 14 Review of Lecture 13 What we’ll talk about today?
Stats Club Marnie Brennan
1. An example for using graphics
Model Diagnostics and OLS Assumptions
CH2. Cleaning and Transforming Data
Regression Diagnostics
Checking the data and assumptions before the final analysis.
Regression Assumptions
Regression Forecasting and Model Building
Checking Assumptions Primary Assumptions Secondary Assumptions
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Chapter 13 Additional Topics in Regression Analysis
Chapter 2 Examining Your Data
Model Adequacy Checking
Regression Assumptions
The Correlation Coefficient (r)
Presentation transcript:

Nasty data… When killer data can ruin your analyses JENA GRADUATE ACADEMY Dr. Friedrich Funke

Learning Objectives What will you have learnt today? Why to inspect your data Why data become nasty How to inspect your data Coping strategies

Why to inspect your data? Assumptions of parametric tests (e.g. ANOVA) The error terms are… randomly, independently, and normally distributed, with a mean of zero and a common variance (homoscedasticity)

Why to inspect your data? Basic statistical method – Ordinary least squares (OLS)

Where are we? Why to inspect your data  violation of assumptions Why data become nasty How to inspect your data Coping strategies

Where are we? Why to inspect your data  violation of assumptions Why data become nasty How to inspect your data Coping strategies Input errors (55 instead of 5) dropout/non-response human nature keeps the game interesting

Am I allowed to alter my data? 29% 67% 4% It is unethical to alter data for any reason. Or Data points should be removed if they are outliers and there is a identifiable reason for invalidity. Data points should be removed if they are outliers. Extremity is reason enough.

Am I allowed to alter my data? It is unethical to alter data for any reason A good model for most data is better than a poor model for all of your data.

Where are we? Why to inspect your data  violation of assumptions Why data become nasty How to inspect your data Coping strategies

Graphical data screening

Normal q-q plot

Test on normality Access e.g. via EXPLORE

My data are skewed – what shall i do? Transformed variables are difficult to interpret Scales are often arbitrary  no problem of interpretation Find a transformation that produces the prettiest picture and skewness and kurtosis near 0 (iterative)

Common data transformations Before/after COMPUTE after = sqrt(before). or COMPUTE after = lg10(before+constant). COMPUTE after = 1/(before+constant).

Common data transformations Add a constant to make the smallest value > 1 For left-skewed variables reverse the variables (reversed = max+1-old_var)

To be completed with residual Analysis

Rules of thumb Studentized deleted residuals with an absolute value greater than 2 deserve a look (greater than 4, alarm bells) Cook's D problematic if D. One recommendation is to consider values to be large which exceed 4/PAn. Another suggested rule is to consider any value greater than 1 or 2 as indicating that an observation requires a careful look. Finally, some researchers look for gaps between the D values.

Checklist For Screening Data Inspect univariate descriptive statistics for accuracy of input out-of-range values, be aware of measurement scales plausible means and standard deviations coefficient of variation Evaluate amount and distribution of missing data: deal with problem Independence of variables Identify and deal with nonnormal variables check skewness and kurtosis, probability plots transform variables (if desirable) check results of transformations Identify and deal with outliers univariate outliers multivariate outliers Check pairwise plots for nonlinearity and heteroscedasticity Evaluate variables for multicollinearity and singularity Check for spatial autocorrelation Adapted from Tabachnick & Fidell

Best practice flow chart Plausible range, missing, normality, outliers, homoscedascity Pairwise linearity (differential skewness?) Studentized deleted residuals, leverage, Cooks‘s D … e.g. squareroot, lg10, arcsin

Understanding WHY they are there is most important Take home message Detecting nasty data is important Knowing how to handle them is better Understanding WHY they are there is most important

Francis Bacon in Novum Organum: » For whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately describe her ways «