The Beast of Bias Data Screening Chapter 5. Bias Datasets can be biased in many ways – but here are the important ones: – Bias in parameter estimates.

Slides:



Advertisements
Similar presentations
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Advertisements

Analysis of variance (ANOVA)-the General Linear Model (GLM)
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Stats Lunch: Day 2 Screening Your Data: Why and How.
By Wendiann Sethi Spring  The second stages of using SPSS is data analysis. We will review descriptive statistics and then move onto other methods.
Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
Lecture 6: Multiple Regression
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Multiple Regression – Assumptions and Outliers
Psych 524 Andrew Ainsworth Data Screening 2. Transformation allows for the correction of non-normality caused by skewness, kurtosis, or other problems.
Business Statistics - QBM117 Statistical inference for regression.
Assumption of Homoscedasticity
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of normality Transformations Assumption of normality script Practice problems.
Basic Analysis of Variance and the General Linear Model Psy 420 Andrew Ainsworth.
1. Homework #2 2. Inferential Statistics 3. Review for Exam.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
Statistics for the Social Sciences Psychology 340 Fall 2013 Thursday, November 21 Review for Exam #4.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Selecting the Correct Statistical Test
Chapter 13: Inference in Regression
Statistical Methods For Health Research. History Blaise Pascl: tossing ……probability William Gossett: standard error of mean “ how large the sample should.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of Homoscedasticity Homoscedasticity (aka homogeneity or uniformity of variance) Transformations.
Multivariate Statistical Data Analysis with Its Applications
Covariance and correlation
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Inference for Linear Regression Conditions for Regression Inference: Suppose we have n observations on an explanatory variable x and a response variable.
SEM: Basics Byrne Chapter 1 Tabachnick SEM
Central Tendency and Variability Chapter 4. Variability In reality – all of statistics can be summed into one statement: – Variability matters. – (and.
Estimation Kline Chapter 7 (skip , appendices)
Basics of Data Cleaning
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Accuracy Chapter 5.1 Data Screening. Data Screening So, I’ve got all this data…what now? – Please note this is going to deviate from the book a bit and.
Introduction to Quantitative Research Analysis and SPSS SW242 – Session 6 Slides.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Correlation & Regression Chapter 15. Correlation It is a statistical technique that is used to measure and describe a relationship between two variables.
Review. Statistics Types Descriptive – describe the data, create a picture of the data Mean – average of all scores Mode – score that appears the most.
» So, I’ve got all this data…what now? » Data screening – important to check for errors, assumptions, and outliers. » What’s the most important? ˃Depends.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
Assumptions 5.4 Data Screening. Assumptions Parametric tests based on the normal distribution assume: – Independence – Additivity and linearity – Normality.
Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.
Comparing Two Means Chapter 9. Experiments Simple experiments – One IV that’s categorical (two levels!) – One DV that’s interval/ratio/continuous – For.
Chapter 13.  Both Principle components analysis (PCA) and Exploratory factor analysis (EFA) are used to understand the underlying patterns in the data.
ANOVA, Regression and Multiple Regression March
Missing Values C5.2 Data Screening. Missing Data Use the summary function to check out the missing data for your dataset. summary(notypos)
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
Linear Regression Chapter 7. Slide 2 What is Regression? A way of predicting the value of one variable from another. – It is a hypothetical model of the.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Analysis of Variance STAT E-150 Statistical Methods.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
PROFILE ANALYSIS. Profile Analysis Main Point: Repeated measures multivariate analysis One/Several DVs all measured on the same scale.
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Multiple Regression.
Applied Statistical Analysis
CHAPTER 29: Multiple Regression*
CH2. Cleaning and Transforming Data
Basic Practice of Statistics - 3rd Edition Inference for Regression
Exercise 1 Use Transform  Compute variable to calculate weight lost by each person Calculate the overall mean weight lost Calculate the means and standard.
Presentation transcript:

The Beast of Bias Data Screening Chapter 5

Bias Datasets can be biased in many ways – but here are the important ones: – Bias in parameter estimates (M) – Bias in SE, CI – Bias in test statistic

Data Screening So, I’ve got all this data…what now? – Please note this is going to deviate from the book a bit and is based on Tabachnick & Fidell’s data screening chapter Which is fantastic but terribly technical and can cure insomnia.

Why? Data screening – important to check for errors, outliers, and assumptions. What’s the most important? – Always check for errors, outliers, missing data. – For assumptions, it depends on the type of test because they have different assumptions.

The List – In Order Accuracy Missing Data Outliers It Depends (we’ll come back to these): – Correlations/Multicollinearity – Normality – Linearity – Homogeneity – Homoscedasticity

The List – In Order Why this order? – Because if you fix something (accuracy) – Or replace missing data – Or take out outliers – ALL THE REST OF THE ANALYSES CHANGE.

Accuracy Check for typos – Frequencies – you can see if there are numbers that shouldn’t be in your data set – Check: Min Max Means SD Missing values

Accuracy

Interpret the output: – Check for high and low values in minimum and maximum – (You can also see the missing data). – Are the standard deviations really high? – Are the means strange looking? – This output will also give you a zillion charts – great for examining Likert scale data to see if you have all ceiling or floor effects.

Missing Data With the output you already have you can see if you have missing data in the variables. – Go to the main box that is first shown in the data. – See the line that says missing? – Check it out!

Missing Data Missing data is an important problem. First, ask yourself, “why is this data missing?” – Because you forgot to enter it? – Because there’s a typo? – Because people skipped one question? Or the whole end of the scale?

Missing Data Two Types of Missing Data: – MCAR – missing completely at random (you want this) – MNAR – missing not at random (eek!) There are ways to test for the type, but usually you can see it – Randomly missing data appears all across your dataset. – If everyone missed question 7 – that’s not random.

Missing Data MCAR – probably caused by skipping a question or missing a trial. MNAR – may be the question that’s causing a problem. – For instance, what if you surveyed campus about alcohol abuse? What does it mean if everyone skips the same question?

Missing Data How much can I have? – Depends on your sample size – in large datasets <5% is ok. – Small samples = you may need to collect more data. Please note: there is a difference between “missing data” and “did not finish the experiment”.

Missing Data How do I check if it’s going to be a big deal? Frequencies – you can see which variables have the missing data. Sample test – you can code people into two groups. Test the people with missing data against those who don’t have missing data. Regular analysis – you can also try dropping the people with missing data and see if you get the same results as your regular analysis with the missing data.

Missing Data Deleting people / variables You can exclude people “pairwise” or “listwise” – Pairwise – only excludes people when they have missing values for that analysis – Listwise – excludes them for all analyses Variables – if it’s just an extraneous variable (like GPA) you can just delete the variable

Missing Data What if you don’t want to delete people (using special people or can’t get others)? – Several estimation methods to “fill in” missing data

Missing Data Prior knowledge – if there is an obvious value for missing data – Such as the median income when people don’t list it – You have been working in the field for a while – Small number of missing cases

Missing Data Mean substitution – fairly popular way to enter missing data – Conservative – doesn’t change the mean values used to find significant differences – Does change the variance, which may cause significance tests to change with a lot of missing data – SPSS will do this substitution with the grand mean

Missing Data Regression – uses the data given and estimates the missing values – This analysis is becoming more popular since a computer will do it for you. – More theoretically driven than mean substitution – Reduces variance

Missing Data Expected maximization – now considered the best at replacing missing data – Creates an expected values set for each missing point – Using matrix algebra, the program estimates the probably of each value and picks the highest one

Missing Data Multiple Imputation – for dichotomous variables, uses log regression similar to regular regression to predict which category a case should go into

Missing Data DO NOT mean replace categorical variables – You can’t be 1.5 gender. – So, either leave them out OR pairwise eliminate them (aka eliminate only for the analysis they are used in). Continuous variables – mean replace, linear trend, etc. – Or leave them out.

Outliers can Bias a Parameter Estimate

…and the Error associated with that Estimate

Outliers Outlier – case with extreme value on one variable or multiple variables Why? – Data input error – Missing values as “9999” – Not a population you meant to sample – From the population but has really long tails and very extreme values

Outliers Outliers – Two Types Univariate – for basic univariate statistics – Use these when you have ONE DV or Y variable. Multivariate – for some univariate statistics and all multivariate statistics – Use these when you have multiple continuous variables or lots of DVs.

Outliers Univariate In a normal z-distribution anyone who has a z- score of +/- 3 is less than 2% of the population. Therefore, we want to eliminate people who’s scores are SO far away from the mean that they are very strange.

Outliers Univariate

Outliers Univariate Now you can scroll through and find all the |3| scores OR – Rerun your frequency analysis on the Z-scored data. – Now you can see which variables have a min/max of |3|, which will tell you which ones to look at.

Spotting outliers With Graphs

Outliers Multivariate Now we need some way to measure distance from the mean (because Z-scores are the distance from the mean), but the mean of means (or all the means at once!) Mahalanobis distance – Creates a distance from the centroid (mean of means)

Outliers Multivariate Centroid is created by plotting the 3D picture of the means of all the means and measuring the distance – Similar to Euclidean distance No set cut off rule  – Use a chi-square table. – DF = # of variables (DVs, variables that you used to calculate Mahalanobis) – Use p<.001

Outliers The following steps will actually give you many of the “it depends” output. You will only check them AFTER you decide what to do about outliers. So you may have to run this twice. – Don’t delete outliers twice!

Outliers Go to the Mahalanobis variable (last new variable on the right) Right click on the column Sort DESCENDING Look for scores that are past your cut off score

Outliers So do I delete them? Yes: they are far away from the middle! No: they may not affect your analysis! It depends: I need the sample size! SO?! – Try it with and without them. See what happens. FISH!

Reducing Bias Trim the data: – Delete a certain amount of scores from the extremes. Windsorizing: – Substitute outliers with the highest value that isn’t an outlier Analyse with Robust Methods: – Bootstrapping Transform the data: – By applying a mathematical function to scores.

Assumptions Parametric tests based on the normal distribution assume: – Additivity and linearity – Normality something or other – Homogeneity of Variance – Independence

Additivity and Linearity The outcome variable is, in reality, linearly related to any predictors. If you have several predictors then their combined effect is best described by adding their effects together. If this assumption is not met then your model is invalid.

Additivity One problem with additivity = multicolllinearity/singularlity – The idea that variables are too correlated to be used together, as they do not both add something to the model.

Correlation This analysis will only be necessary if you have multiple continuous variables Regression, multivariate statistics, repeated measures, etc. You want to make sure that your variables aren’t so correlated the math explodes.

Correlation Multicollinearity = r >.90 Singularity = r >.95 SPSS will give you a “matrix is singular” error when you have variables that are too highly correlated Or “hessian matrix not definite”

Correlation Run a bivariate correlation on all the variables Look at the scores, see if they are too high If so: – Combine them (average, total) – Use one of them Basically, you do not want to use the same variable twice  reduces power and interpretability

Linearity Assumption that the relationship between variables is linear (and not curved). Most parametric statistics have this assumption (ANOVAs, Regression, etc.).

Linearity Univariate You can create bivariate scatter plots and make sure you don’t see curved lines or rainbows. – Matrix scatterplots to the rescue!

Linearity Multivariate – all the combinations of the variables are linear (especially important for multiple regression and MANOVA) Use the output from your fake regression for Mahalanobis.

The P-P Plot

Normally Distributed Something or Other The normal distribution is relevant to: – Parameters – Confidence intervals around a parameter – Null hypothesis significance testing This assumption tends to get incorrectly translated as ‘your data need to be normally distributed’.

Normally Distributed Something or Other Parameters – we assume the sampling distribution is normal, so if our sample is not … then our estimates (and their errors) of the parameters is not correct. CIs – same problem – since they are based on our sample. NHST – if the sampling distribution is not normal, then our test will be biased.

When does the Assumption of Normality Matter? In small samples. – The central limit theorem allows us to forget about this assumption in larger samples. In practical terms, as long as your sample is fairly large, outliers are a much more pressing concern than normality.

Normality See page 171 for a fantastic graph about why large samples are awesome – Remember the magic number is N = 30

Normality Nonparametric statistics (chi-square, log regression) do NOT require this assumption, so you don’t have to check.

Spotting Normality We don’t have access to the sampling distribution so we usually test the observed data Central Limit Theorem – If N > 30, the sampling distribution is normal anyway Graphical displays – P-P Plot (or Q-Q plot) – Histogram Values of Skew/Kurtosis – 0 in a normal distribution – Convert to z (by dividing value by SE)** Kolmogorov-Smirnov Test – Tests if data differ from a normal distribution – Significant = non-Normal data – Non-Significant = Normal data Slide 69

Spotting Normality with Numbers: Skew and Kurtosis

Assessing Skew and Kurtosis

Assessing Normality

Tests of Normality

Normality within Groups The Split File command

Normality Within Groups

Normality within Groups

Normality Multivariate – all the linear combinations of the variables need to be normal Use this version when you have more than one variable Basically if you ran the Mahalanobis analysis – you want to analyze multivariate normality.

Homogeneity Assumption that the variances of the variables are roughly equal. Ways to check – you do NOT want p <.001: – Levene’s - Univariate – Box’s – Multivariate You can also check a residual plot (this will give you both uni/multivariate)

Homogeneity Spherecity – the assumption that the time measurements in repeated measures have approximately the same variance Difficult assumption…

Assessing Homogeneity of Variance

Output for Levene’s Test Slide 83

Homoscedasticity Spread of the variance of a variable is the same across all values of the other variable – Can’t look like a snake ate something or megaphones. Best way to check is by looking at scatterplots.

Homoscedasticity/ Homogeneity of Variance Can affect the two main things that we might do when we fit models to data: – Parameters – Null Hypothesis significance testing

Spotting problems with Linearity or Homoscedasticity

Homogeneity of Variance Slide 88

Independence The errors in your model should not be related to each other. If this assumption is violated: – Confidence intervals and significance tests will be invalid. – You should apply the techniques covered in Chapter 20.

Transforming Data Log Transformation (log(X i )) – Reduce positive skew. Square Root Transformation (√X i ): – Also reduces positive skew. Can also be useful for stabilizing variance. Reciprocal Transformation (1/ X i ): – Dividing 1 by each score also reduces the impact of large scores. This transformation reverses the scores, you can avoid this by reversing the scores before the transformation, 1/(X Highest – X i ). Slide 90

Log Transformation Slide 91 BeforeAfter

Square Root Transformation Slide 92 BeforeAfter

Reciprocal Transformation Slide 93 BeforeAfter

But … Slide 94 BeforeAfter

To Transform … Or Not Transforming the data helps as often as it hinders the accuracy of F (Games & Lucas, 1966). Games (1984): – The central limit theorem: sampling distribution will be normal in samples > 40 anyway. – Transforming the data changes the hypothesis being tested E.g. when using a log transformation and comparing means you change from comparing arithmetic means to comparing geometric means – In small samples it is tricky to determine normality one way or another. – The consequences for the statistical model of applying the ‘wrong’ transformation could be worse than the consequences of analysing the untransformed scores.

SPSS Compute Function Be sure you understand how to: – Create an average score mean(var,var,var) – Create a random variable I like rv.chisq, but rv.normal works too – Create a sum score sum(var,var,var) – Square root sqrt(var) – Etc (page 207).