» So, I’ve got all this data…what now? » Data screening – important to check for errors, assumptions, and outliers. » What’s the most important? ˃Depends.

Slides:



Advertisements
Similar presentations
Theoretical Probability Distributions We have talked about the idea of frequency distributions as a way to see what is happening with our data. We have.
Advertisements

5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Stats Lunch: Day 2 Screening Your Data: Why and How.
By Wendiann Sethi Spring  The second stages of using SPSS is data analysis. We will review descriptive statistics and then move onto other methods.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
Detecting univariate outliers Detecting multivariate outliers
Lecture 6: Multiple Regression
Multiple Regression.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Multiple Regression – Assumptions and Outliers
Psych 524 Andrew Ainsworth Data Screening 2. Transformation allows for the correction of non-normality caused by skewness, kurtosis, or other problems.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Assumption of Homoscedasticity
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
Repeated Measures ANOVA Used when the research design contains one factor on which participants are measured more than twice (dependent, or within- groups.
Basic Analysis of Variance and the General Linear Model Psy 420 Andrew Ainsworth.
Screening the Data Tedious but essential!.
Statistics for the Social Sciences Psychology 340 Fall 2013 Thursday, November 21 Review for Exam #4.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Hypothesis Testing:.
Multivariate Statistical Data Analysis with Its Applications
Simple Covariation Focus is still on ‘Understanding the Variability” With Group Difference approaches, issue has been: Can group membership (based on ‘levels.
Inference for Regression
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Inference for Linear Regression Conditions for Regression Inference: Suppose we have n observations on an explanatory variable x and a response variable.
2 nd Order CFA Byrne Chapter 5. 2 nd Order Models The idea of a 2 nd order model (sometimes called a bi-factor model) is: – You have some latent variables.
SEM: Basics Byrne Chapter 1 Tabachnick SEM
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
1 rules of engagement no computer or no power → no lesson no SPSS → no lesson no homework done → no lesson GE 5 Tutorial 5.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Basics of Data Cleaning
Hypothesis testing Intermediate Food Security Analysis Training Rome, July 2010.
 Slide 1 Two-Way Independent ANOVA (GLM 3) Chapter 13.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Accuracy Chapter 5.1 Data Screening. Data Screening So, I’ve got all this data…what now? – Please note this is going to deviate from the book a bit and.
The Beast of Bias Data Screening Chapter 5. Bias Datasets can be biased in many ways – but here are the important ones: – Bias in parameter estimates.
ANOVA: Analysis of Variance.
Review. Statistics Types Descriptive – describe the data, create a picture of the data Mean – average of all scores Mode – score that appears the most.
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Repeated-measures designs (GLM 4) Chapter 13. Terms Between subjects = independent – Each subject gets only one level of the variable. Repeated measures.
Linear Regression Chapter 8. Slide 2 What is Regression? A way of predicting the value of one variable from another. – It is a hypothetical model of the.
Introduction to Basic Statistical Tools for Research OCED 5443 Interpreting Research in OCED Dr. Ausburn OCED 5443 Interpreting Research in OCED Dr. Ausburn.
Multiple regression.
Assumptions 5.4 Data Screening. Assumptions Parametric tests based on the normal distribution assume: – Independence – Additivity and linearity – Normality.
Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.
SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers.
Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.
Comparing Two Means Chapter 9. Experiments Simple experiments – One IV that’s categorical (two levels!) – One DV that’s interval/ratio/continuous – For.
Descriptions. Description Correlation – simply finding the relationship between two scores ○ Both the magnitude (how strong or how big) ○ And direction.
Quadratic Regression ©2005 Dr. B. C. Paul. Fitting Second Order Effects Can also use least square error formulation to fit an equation of the form Math.
Chapter 13.  Both Principle components analysis (PCA) and Exploratory factor analysis (EFA) are used to understand the underlying patterns in the data.
PSY6010: Statistics, Psychometrics and Research Design Professor Leora Lawton Spring 2007 Wednesdays 7-10 PM Room 204.
Missing Values C5.2 Data Screening. Missing Data Use the summary function to check out the missing data for your dataset. summary(notypos)
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
Linear Regression Chapter 7. Slide 2 What is Regression? A way of predicting the value of one variable from another. – It is a hypothetical model of the.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
CFA Model Revision Byrne Chapter 4 Brown Chapter 5.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
PROFILE ANALYSIS. Profile Analysis Main Point: Repeated measures multivariate analysis One/Several DVs all measured on the same scale.
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Exploring Group Differences
Applied Statistical Analysis
Regression.
CH2. Cleaning and Transforming Data
Presentation transcript:

» So, I’ve got all this data…what now?

» Data screening – important to check for errors, assumptions, and outliers. » What’s the most important? ˃Depends on the type of test because they have different assumptions.

» Accuracy » Missing Data » Outliers » It Depends: ˃Correlations ˃Normality ˃Linearity ˃Homogeneity ˃Homoscedasticity

» Why this order? ˃Because if you fix something (accuracy) ˃Or replace missing data ˃Or take out outliers ˃ALL THE REST OF THE ANALYSES CHANGE.

» Check for typos ˃Frequencies – you can see if there are numbers that shouldn’t be in your data set ˃Check: +Min +Max +Means +SD +Missing values

» Interpret the output: ˃Check for high and low values in minimum and maximum ˃(You can also see the missing data). ˃Are the standard deviations really high? ˃Are the means strange looking? ˃This output will also give you a zillion charts – great for examining Likert scale data to see if you have all ceiling or floor effects.

» With the output you already have you can see if you have missing data in the variables. ˃Go to the main box that is first shown in the data. ˃See the line that says missing? ˃Check it out!

» Missing data is an important problem. » First, ask yourself, “why is this data missing?” ˃Because you forgot to enter it? ˃Because there’s a typo? ˃Because people skipped one question? Or the whole end of the scale?

» Two Types of Missing Data: ˃MCAR – missing completely at random (you want this) ˃MNAR – missing not at random (eek!) » There are ways to test for the type, but usually you can see it ˃Randomly missing data appears all across your dataset. ˃If everyone missed question 7 – that’s not random.

» MCAR – probably caused by skipping a question or missing a trial. » MNAR – may be the question that’s causing a problem. ˃For instance, what if you surveyed campus about alcohol abuse? What does it mean if everyone skips the same question?

» How much can I have? ˃Depends on your sample size – in large datasets <5% is ok. ˃Small samples = you may need to collect more data. » Please note: there is a difference between “missing data” and “did not finish the experiment”.

» How do I check if it’s going to be a big deal? » Frequencies – you can see which variables have the missing data. » Sample test – you can code people into two groups. Test the people with missing data against those who don’t have missing data. » Regular analysis – you can also try dropping the people with missing data and see if you get the same results as your regular analysis with the missing data.

» Deleting people / variables » You can exclude people “pairwise” or “listwise” ˃Pairwise – only excludes people when they have missing values for that analysis ˃Listwise – excludes them for all analyses » Variables – if it’s just an extraneous variable (like GPA) you can just delete the variable

» What if you don’t want to delete people (using special people or can’t get others)? ˃Several estimation methods to “fill in” missing data

» Prior knowledge – if there is an obvious value for missing data ˃Such as the median income when people don’t list it ˃You have been working in the field for a while ˃Small number of missing cases

» Mean substitution – fairly popular way to enter missing data ˃Conservative – doesn’t change the mean values used to find significant differences ˃Does change the variance, which may cause significance tests to change with a lot of missing data ˃SPSS will do this substitution with the grand mean

» Regression – uses the data given and estimates the missing values ˃This analysis is becoming more popular since a computer will do it for you. ˃More theoretically driven than mean substitution ˃Reduces variance

» Expected maximization – now considered the best at replacing missing data ˃Creates an expected values set for each missing point ˃Using matrix algebra, the program estimates the probably of each value and picks the highest one

» Multiple Imputation – for dichotomous variables, uses log regression similar to regular regression to predict which category a case should go into

» DO NOT mean replace categorical variables ˃You can’t be 1.5 gender. ˃So, either leave them out OR pairwise eliminate them (aka eliminate only for the analysis they are used in). » Continuous variables – mean replace, linear trend, etc. ˃Or leave them out.

» Outlier – case with extreme value on one variable or multiple variables » Why? ˃Data input error ˃Missing values as “9999” ˃Not a population you meant to sample ˃From the population but has really long tails and very extreme values

» Outliers – Two Types » Univariate – for basic univariate statistics ˃Use these when you have ONE DV or Y variable. » Multivariate – for some univariate statistics and all multivariate statistics ˃Use these when you have multiple continuous variables or lots of DVs.

» Univariate » In a normal z-distribution anyone who has a z- score of +/- 3 is less than 2% of the population. » Therefore, we want to eliminate people who’s scores are SO far away from the mean that they are very strange.

» Univariate

» Now you can scroll through and find all the |3| scores » OR ˃Rerun your frequency analysis on the Z-scored data. ˃Now you can see which variables have a min/max of |3|, which will tell you which ones to look at.

» Multivariate » Now we need some way to measure distance from the mean (because Z-scores are the distance from the mean), but the mean of means (or all the means at once!) » Mahalanobis distance ˃Creates a distance from the centroid (mean of means)

» Multivariate » Centroid is created by plotting the 3D picture of the means of all the means and measuring the distance ˃Similar to Euclidean distance » No set cut off rule  ˃Use a chi-square table. ˃DF = # of variables (DVs, variables that you used to calculate Mahalanobis) ˃Use p<.001

» The following steps will actually give you many of the “it depends” output. » You will only check them AFTER you decide what to do about outliers. » So you may have to run this twice. ˃Don’t delete outliers twice!

» Go to the Mahalanobis variable (last new variable on the right) » Right click on the column » Sort DESCENDING » Look for scores that are past your cut off score

» So do I delete them? » Yes: they are far away from the middle! » No: they may not affect your analysis! » It depends: I need the sample size! » SO?! ˃Try it with and without them. See what happens. FISH!

» This analysis will only be necessary if you have multiple variables » Regression, multivariate statistics, repeated measures, etc. » You want to make sure that your variables aren’t so correlated the math explodes.

» Multicollinearity = r >.90 » Singularity = r >.95 » SPSS will give you a “matrix is singular” error when you have variables that are too highly correlated » Or “hessian matrix not definite”

» Run a bivariate correlation on all the variables » Look at the scores, see if they are too high » If so: ˃Combine them (average, total) ˃Use one of them » Basically, you do not want to use the same variable twice  reduces power and interpretability

» This assumption is implied for nearly everything we are going to cover in this course. » Parametric statistics (the things you know: ANOVA, MANOVA, t-tests, z-scores, etc.) – require that the underlying distribution is normal. » Why?

» However, it’s hard to know if that’s true. So you can check if the data you have is normal. » OR You can make sure you have the magical statistical number N = 30. » Why?

» Nonparametric statistics (chi-square, log regression) do NOT require this assumption, so you don’t have to check.

» Univariate » Check by looking at your skew and kurtosis values. » You want them to be < |3| - same idea as z- scores.

» Skewness – symmetry of a distribution ˃Skewed – mean not in the middle » Kurtosis – peakedness of a distribution ˃Tall and skinny or fat and short » SPSS ˃Frequencies will give you values for testing (see analysis we did earlier). ˃Remember – if you changed something (deleted, whatever) you need to rerun those numbers!

» Multivariate – all the linear combinations of the variables need to be normal » Use this version when you have more than one variable » Basically if you ran the Mahalanobis analysis – you want to analyze multivariate normality.

» Assumption that the relationship between variables is linear (and not curved). » Most parametric statistics have this assumption (ANOVAs, Regression, etc.).

» Univariate » You can create bivariate scatter plots and make sure you don’t see curved lines or rainbows.

» Talk about chart builder here.

» Multivariate – all the combinations of the variables are linear (especially important for multiple regression and MANOVA) » Use the output from your fake regression for Mahalanobis.

» Assumption that the variances of the variables are roughly equal. » Ways to check – you do NOT want p <.001: ˃Levene’s - Univariate ˃Box’s – Multivariate » You can also check a residual plot (this will give you both uni/multivariate)

» Spherecity – the assumption that the time measurements in repeated measures have approximately the same variance » Difficult assumption…

» Spread of the variance of a variable is the same across all values of the other variable ˃Can’t look like a snake ate something or megaphones. » Best way to check is by looking at scatterplots.