Stats Lunch: Day 2 Screening Your Data: Why and How.

Slides:



Advertisements
Similar presentations
Descriptive Statistics-II
Advertisements

Computing Transformations
Transformations & Data Cleaning
5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.
Assumption of normality
Copyright © Allyn & Bacon (2007) Using SPSS for Windows Graziano and Raulin Research Methods This multimedia product and its contents are protected under.
1 Assessing Normality and Data Transformations Many statistical methods require that the numeric variables we are working with have an approximate normal.
Intro to Factorial ANOVA
By Wendiann Sethi Spring  The second stages of using SPSS is data analysis. We will review descriptive statistics and then move onto other methods.
Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
Detecting univariate outliers Detecting multivariate outliers
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Test statistic: Group Comparison Jobayer Hossain Larry Holmes, Jr Research Statistics, Lecture 5 October 30,2008.
Lecture 6: Multiple Regression
A Simple Guide to Using SPSS© for Windows
Multiple Regression – Assumptions and Outliers
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
Psych 524 Andrew Ainsworth Data Screening 2. Transformation allows for the correction of non-normality caused by skewness, kurtosis, or other problems.
SW388R7 Data Analysis & Computers II Slide 1 Computing Transformations Transforming variables Transformations for normality Transformations for linearity.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Assumption of Homoscedasticity
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of normality Transformations Assumption of normality script Practice problems.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
Assumption of linearity
FEBRUARY, 2013 BY: ABDUL-RAUF A TRAINING WORKSHOP ON STATISTICAL AND PRESENTATIONAL SYSTEM SOFTWARE (SPSS) 18.0 WINDOWS.
Slide 1 SOLVING THE HOMEWORK PROBLEMS Simple linear regression is an appropriate model of the relationship between two quantitative variables provided.
8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall. 2-1 Chapter 2 Examining Your Data.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of Homoscedasticity Homoscedasticity (aka homogeneity or uniformity of variance) Transformations.
Chapter 1 Displaying the Order in a Group of Numbers and… Intro to SPSS (Activity 1) Thurs. Aug 22, 2013.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
SW388R6 Data Analysis and Computers I Slide 1 Central Tendency and Variability Sample Homework Problem Solving the Problem with SPSS Logic for Central.
Using SPSS for Windows Part II Jie Chen Ph.D. Phone: /6/20151.
Copyright © 2010 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 4 l Introduction to Statistical Software Package 4.1 Data Input 4.2 Data Editor 4.3 Data.
Central Tendency and Variability Chapter 4. Variability In reality – all of statistics can be summed into one statement: – Variability matters. – (and.
Basics of Data Cleaning
6/2/2016Slide 1 To extend the comparison of population means beyond the two groups tested by the independent samples t-test, we use a one-way analysis.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
SW388R7 Data Analysis & Computers II Slide 1 Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample.
SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.
11/4/2015Slide 1 SOLVING THE PROBLEM Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the.
Analysis of Residuals ©2005 Dr. B. C. Paul. Examining Residuals of Regression (From our Previous Example) Set up your linear regression in the Usual manner.
» So, I’ve got all this data…what now? » Data screening – important to check for errors, assumptions, and outliers. » What’s the most important? ˃Depends.
ANCOVA. What is Analysis of Covariance? When you think of Ancova, you should think of sequential regression, because really that’s all it is Covariate(s)
Assumptions 5.4 Data Screening. Assumptions Parametric tests based on the normal distribution assume: – Independence – Additivity and linearity – Normality.
Mr. Magdi Morsi Statistician Department of Research and Studies, MOH
Conduct Simple Correlations Section 7. Correlation –A Pearson correlation analyzes relationships between parametric, linear (interval or ratio which are.
1/11/2016Slide 1 Extending the relationships found in linear regression to a population is procedurally similar to what we have done for t-tests and chi-square.
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
Describing Distributions Statistics for the Social Sciences Psychology 340 Spring 2010.
STATS 10x Revision CONTENT COVERED: CHAPTERS
Elementary Analysis Richard LeGates URBS 492. Univariate Analysis Distributions –SPSS Command Statistics | Summarize | Frequencies Presents label, total.
Preparing to collect data. Make sure you have your materials Surveys –All surveys should have a unique numerical identifier on each page –You can write.
Correlation  We can often see the strength of the relationship between two quantitative variables in a scatterplot, but be careful. The two figures here.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Introduction to Statistical Software Package
Assumption of normality
Computing Transformations
Nasty data… When killer data can ruin your analyses
CH2. Cleaning and Transforming Data
Multiple Regression – Split Sample Validation
Chapter 2 Examining Your Data
Presentation transcript:

Stats Lunch: Day 2 Screening Your Data: Why and How

Data Screening Much of the following info comes from a more thorough (and frankly much better) description of data screening/cleaning: Tabachnick & Fidell (2001). Cleaning up your act: screening data prior to analysis, Using Multivariate Statistics (4 th Ed.), Allyn & Bacon, MA. Why Screen? 1)Ensure the accuracy of your data hence insuring you don’t publish absolute sh!t… 3)Maximize the power of your data set Which helps ensure you can find at least SOME sh!t to publish… 2)Ensure you meet assumptions of your planned analyses, and to identify any special analysis needs Helps ensure your stats aren’t absolute sh!t…

The Usual Suspects 1)Inaccurate Data a) Data entry errors b) Data merge errors 2) Missing Values 3)Outliers 4)Lack of Normality 5)Lack of Linearity 6)Homoscedasticity and Multivariate Problems

Avoiding Inaccurate Data 1)KNOW WHAT YOUR DATA ARE SUPPOSED TO LOOK LIKE What possible values are (including range). How many categories of subjects (controls, patients, etc) there should be. How many subjects should be in each group. 2)Find out what your data DO look like..  Run Descriptives and Frequencies in SPSS  Can use “Explore” and/or “Crosstabs” option to look at values by Group

Example: Avoiding Inaccurate Data 1)Choose “Descriptive Statistics” 2)Choose “Explore” 3)Explore by “Factor List” 4)Enter variables you care about in “Dependent List” 5)Click on Display “Statistics”, “Plots”, or “Both” 6)And then on the appropriated tabs… Yes, I realize the crossed arrows make for a terrible slide…

Example: Avoiding Inaccurate Data 7)Choose whatever you want, then click “continue” 8)It might be helpful to then click on options and choose “Report Missing Values”, then “continue” and “Ok”

What’s wrong with this picture?  Look for any warnings  Any unexpected groups  Unexpected N’s  Missing Values

What’s wrong with this picture? These are ITC Data  Look for means and Standard Deviations that are very different from other groups.  Do these differences make any sense?

What’s wrong with this picture?  Look for extreme cases  Do they look like OUTLIERS or INNACURATE DATA?

Dealing With Missing Data  Are missing values RANDOM or Systematic E.G., are 2 subjects missing electrode F4, or are %50 of your ketamine group missing day two data…  Options if missing values are RANDOM: 1)Omit the case (but don’t DELETE it) 2)Substitute in a score  Of that subject’s scores (e.g., mean of other frontal electrodes)  Group mean (e.g., F4 mean of all other controls)  Substitute predicted score from regression  Lots of other more complicated substitution algorithms.

Dealing With Missing Data  Options if missing values are NOT RANDOM: 1)You’re pretty much screwed…regardless of your response to problem, it will effect the validity of the study. However:  You might randomly delete subjects in other groups to match the sample sizes with your attenuated group.  Analyze data: both with and without missing cases  Treat missing data as a variable: Use of dummy variables in multiple regression models.

Outliers and Normality  Parametric Statistics (e.g., r, t- tests, ANOVA, Regressions) are built on the assumption of NORMALITY and LINEARITY…  If data are skewed or kurtotic, then this assumption is violated  If violation is large enough, it effects the accuracy of our results (typically towards making Type I errors).  Outliers (extreme scores) can be a major contributor to both (but particularly skewness)

Looking for Non-Normality  Can get measures of skewness and kurtosis from several “Descriptives” modules in SPSS  A basic rule of thumb is that if either number is greater than 1, or less than –1, you might have a problem…  However, dependent on sample size

Looking for Outliers  Outliers can be detected using several SPSS routines, but the easiest is by plotting your data (same for shape of distribution).  Two ways I like to do it 1)The Histogram A.Go to “graphs” and then select “Histogram”

Looking for Outliers B.Enter Variable you want to look at C.I like to display the normal curve… D.Then click “ok” Any score falling outside the curve is an outlier

Looking for Outliers  Another option is to “panel” by your Grouping variable…

Looking for Outliers 2) The sneaky scatterplot method  Make sure your subject # variable is numeric  Plot subject number on X axis, and score of interest on Y  Easiest to do if you “Select Cases” in the data menu (e.g., look at each group one at a time)… A.Go to “graphs” and then select “Scatter/Dot”

B.Choose the kind of plot you want and click “define” Looking for Outliers C. Enter the variales you want and click “ok”

Looking for Outliers  You’ll get a graph where you can easily see if there’s outliers…. D. By messing with the scale and chart size under the “chart editor” function (double click the graph on your output page), you can figure out the subject number of the outlier

Dealing with Outliers 1)Make sure it’s not a mistake (data entry error, impossible value, subject included in wrong group, etc.) 2)If it’s “real” data then you can:  Run with the data as is (and be sure you’re prepared to defend this choice)  Delete the case: this is dangerous ground, but can sometimes be justified (e.g., if z > 3…and again, be sure you can defend your choice)  Transform the data

Transformations  Mathematical routine applied to ALL data in your set, designed to reduce effects of extreme values and/or kurtosis (helps for both skewness and kurtosis): Doesn’t effect relationship between scores! Common Transformations: 1)Taking Square Root: Good for positive skew 2)Taking the Log10: Good for substantial positive skew 3)Taking the Inverse: good for severe positive skewness  Might need to add or subtract a constant from each score prior to transforming:  Add: if you have positively skewed data with some scores = 0  Subtract: If you’re dealing with negative skewness

Transformations These transforms are done using “Compute” commands in SPSS… Tabachnick and Fidel, Table 4.3

Example of a SPSS Transformation 1)Click on “transform”, and then “compute” 2)Name your new variable 3)Type in your expression 4)Click OK

Example of a SPSS Transformation 5)Rerun your descriptives, graphs, etc. Before: After Square Root Transform: If first you don’t succeed… After Log Transform

Linearity The second major assumption of parametric stats is that relationships are LINEAR  If your data fail to meet this assumption the stats won’t work…even if there is a real relationship there…  Another reason why you want to plot out your data…  We’ll deal with multivariate linearity at some later lunch…

If all else fails:  No matter how you transform them, your data are non- normal, or you see a non-linear relationship between your variables… 1)Run the data using parametric stats, but recognize (and describe/defend) limitations:  Use very conservative alpha levels (e.g., p <.001)  Treat your inferential stats primarily as descriptors… 2)Use nonparametric statistics