Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.

Slides:



Advertisements
Similar presentations
Transformations & Data Cleaning
Advertisements

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Inference for Regression
Chapter 8 Linear Regression © 2010 Pearson Education 1.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Multiple regression analysis
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
REGRESSION What is Regression? What is the Regression Equation? What is the Least-Squares Solution? How is Regression Based on Correlation? What are the.
Lecture 6: Multiple Regression
Lecture 24 Multiple Regression (Sections )
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Stat 217 – Day 25 Regression. Last Time - ANOVA When?  Comparing 2 or means (one categorical and one quantitative variable) Research question  Null.
1 4. Multiple Regression I ECON 251 Research Methods.
Using Statistics in Research Psych 231: Research Methods in Psychology.
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Business Statistics - QBM117 Statistical inference for regression.
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Basic Analysis of Variance and the General Linear Model Psy 420 Andrew Ainsworth.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Selecting the Correct Statistical Test
Linear Regression Inference
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Inference for Regression
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
SEM: Basics Byrne Chapter 1 Tabachnick SEM
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Accuracy Chapter 5.1 Data Screening. Data Screening So, I’ve got all this data…what now? – Please note this is going to deviate from the book a bit and.
The Beast of Bias Data Screening Chapter 5. Bias Datasets can be biased in many ways – but here are the important ones: – Bias in parameter estimates.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Multigroup Models Byrne Chapter 7 Brown Chapter 7.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
Hypothesis Testing. Why do we need it? – simply, we are looking for something – a statistical measure - that will allow us to conclude there is truly.
CFA: Basics Beaujean Chapter 3. Other readings Kline 9 – a good reference, but lumps this entire section into one chapter.
» So, I’ve got all this data…what now? » Data screening – important to check for errors, assumptions, and outliers. » What’s the most important? ˃Depends.
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Linear Regression Chapter 8. Slide 2 What is Regression? A way of predicting the value of one variable from another. – It is a hypothetical model of the.
Assumptions 5.4 Data Screening. Assumptions Parametric tests based on the normal distribution assume: – Independence – Additivity and linearity – Normality.
Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.
Comparing Two Means Chapter 9. Experiments Simple experiments – One IV that’s categorical (two levels!) – One DV that’s interval/ratio/continuous – For.
ANOVA, Regression and Multiple Regression March
Missing Values C5.2 Data Screening. Missing Data Use the summary function to check out the missing data for your dataset. summary(notypos)
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
Linear Regression Chapter 7. Slide 2 What is Regression? A way of predicting the value of one variable from another. – It is a hypothetical model of the.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
CFA Model Revision Byrne Chapter 4 Brown Chapter 5.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
PROFILE ANALYSIS. Profile Analysis Main Point: Repeated measures multivariate analysis One/Several DVs all measured on the same scale.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Step 1: Specify a null hypothesis
Linear Regression.
Chapter 12: Regression Diagnostics
Applied Statistical Analysis
Regression.
Lecture Slides Elementary Statistics Thirteenth Edition
Diagnostics and Transformation for SLR
Multiple Regression A curvilinear relationship between one variable and the values of two or more other independent variables. Y = intercept + (slope1.
Basic Practice of Statistics - 3rd Edition Inference for Regression
Checking the data and assumptions before the final analysis.
Regression Forecasting and Model Building
Inference for Regression Slope
Diagnostics and Transformation for SLR
Presentation transcript:

Data Screening

What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of analysis will have different types of data screening. This lecture lists all the types, and check out the individual analysis for the important ones.

BIG IMPORTANT RULE Hypothesis testing: Traditionally we use p <.05 (so you want values less than.05 because that’s what you are trying to find … statistically significant).

BIG IMPORTANT RULE Data screening: We want to use p <.001 because we want to make sure things are really crazy before we fix/delete/etc. If the scores are less than.1% then it’s really, really different/skewed/what have you, so you would want to fix it.

The Order IN THIS ORDER: – Accuracy – Missing – Outliers – Assumptions: Additivity Normality Linearity Homogeneity / Homoscedasticity

Accuracy Check for typos and problems with the dataset. Generally, you are looking for values that are out of range. – Check out min and max to see if they are within what you would expect. – Fix’em! Or delete that data point. Do not delete the whole person, just the wrong data point.

Missing Data Two types of missing data: – MCAR – missing completely at random (you want this) – MCAR – probably caused by skipping a question or missing a trial.

Missing Data Two types of missing data: – MNAR – missing not at random – MNAR – may be the question that’s causing a problem. – For instance, what if you surveyed campus about alcohol abuse? What does it mean if everyone skips the same question?

Missing Data What can I do? – MNAR – exclude or eliminate the data – MCAR – replace the data with a special function

Missing Data How much can I replace? – Depends on your sample size – in large datasets <=5% is ok. – The 5% rule applies either across (by participant – how much data are they missing?) or down (by variable – how much of that variable is missing). – Small samples = you may need to collect more data.

Missing Data What should I replace? – Do not replace categorical variables. – Do not replace demographic variables. – Do not replace data that is MNAR. – Most people replace continuous variables (interval / ratio).

Missing Data How to replace? Old ways to replace: – Mean substitution – used to be a popular way to fill in data (cuz it was easy, but now most people don’t recommend this solution). – Regression – uses the data given and estimates the missing values

Missing Data Problems with the old ways: – Conservative – doesn’t change the mean values used to find significant differences – Does reduce the variance, which may cause significance tests to change with a lot of missing data

Missing Data New fancy ways to replace – Expected maximization – now considered the best at replacing missing data – uses multiple imputation. – Creates an expected values set for each missing point – Using matrix algebra, the program estimates the probably of each value and picks the highest one

Missing Data Practical point here: – Noncompleters – Breaking apart rows and columns for replacement

Outliers Outliers are case(s) with extreme value on one variable or multiple variables. – Univariate outliers: you are an outlier for one variable. – Multivariate outliers: you are an outlier for multiple variables. Your pattern of data is weird.

Outliers You can check for univariate outliers, but when you have a big dataset, that’s really not necessary. – We will cover how to check for univariate outliers when necessary, since it really only applies to those analyses (when you have 1 DV like ANOVA).

Outliers Multivariate – check with Mahalanobis distance. – Mahalanobis distance – distance of a case from the centroid of rest of cases. Centroid is created by plotting the means of the all the variables (like an average of averages), and then seeing how far each person’s scores are from the middle.

Outliers How to check: – Create Mahalanobis scores. – Use the chi-square function to find the cut off score (anything past this score is an outlier). – df = the number of variables you are testing (important: testing! Not all the variables!) – Use the p <.001 value.

Outliers

Regression based outlier analyses: – Leverage – a score that is far out on line but doesn’t influence regression slopes (measured by leverage values). – Discrepancy – a score that is far away from everyone else that affects the slope – Influence – product of leverage and discrepancy (measured by Cook’s values).

Outliers What do I do with them when I find them? – Ask yourself: Did they do the study correctly? Are they part of the population you wanted? – Eliminate them – Leave them in

Assumptions Checks Additivity Normality Linearity Homogeneity Homoscedasticity

Additivity Additivity is the assumption that each variable adds something to the analysis. Often, this assumption is thought of as the lack of multicollinearity. – Which is when variables are too highly correlated. – What is too high? General rule = r >.90

Additivity So, we don’t want to use two variables in an analysis they are essentially the same. – Lowers power – Doesn’t run – Waste of time If you get a “singular matrix” error, you’ve used two variables in an analysis that are too similar or the variance of a variable is essentially zero.

Assumption Checks A quick note: for many of the statistical tests you would run, there are diagnostic plots / assumptions built into them. This guide lets you apply data screening to any analysis, if you wanted to learn one set of rules, rather than one for each analysis. (BUT there are still things that only apply to ANOVA that you’d want to add when you run ANOVA).

Assumptions For ANOVA, t-tests, correlation: you will use a fake regression analyses – it’s considered fake because it’s not the real analysis, just a way to get the information you need to do data screening. For regression based tests: you can run the real regression analysis to get the same information. The rules are altered slightly, so make sure you make notes in the regression section on what’s different.

Assumption Checks The random thing and why chi square: – Why chi-square? – For many of these assumptions, the errors should be chi-square distributed (aka lots of small errors, only a few big ones). – However, the standardized errors should be normally distributed around zero (don’t get these two things confused – we want the actual error numbers to be chi-square distributed, the standardized ones to be normal).

Assumption Checks For each of these assumptions, there are ways to check the univariate and multivariate issues. Start with the multivariate  if it’s bad, then check the univariate.

Normality The assumption of normality is that the sampling distribution is normal. – Not the sample. – Not the population. – Distribution bunnies to the rescue!

Normality Multivariate Normality – each variable and all linear combinations of variables are normal Given the Central Limit Theorem – with N > 30 tests are robust (meaning that you can violate this assumption and get reasonably accurate statistics).

Normality So, the way to check for normality though, is to use the sample. – If N > 30, we tend not to worry too much. – If N < 30, and it’s bad, you may consider changing analyses.

Linearity Assumption that there is a straight line relationship between two variables (or the combination of all the variables)

HomoG + S Homogeneity: equal variances – the variables (or groups) have roughly equal variances. Homoscedasticity: spread of the variance of a variable is the same across all values of the other variables.

HomoG + S How to check them: – Both of these can be checked by looking at the residual scatterplot. – Fitted values = the predicted score for a person in your regression. – Residuals = the difference between the predicted score and a person’s actual score in the regression (y – y hat).

HomoG + S How to check them: – We are plotting them against each other. In theory, the residuals should be randomly distributed (hence why we created a random variable to test with). – Therefore, they should look like a bunch of random dots.

R-Tips Multiple datasets will be created. – Remember to use the right data set. Sometimes, you don’t have problems. – So, you would just SKIP a step making sure to use the right dataset.