Psych 706: Stats II Class #1.

Slides:



Advertisements
Similar presentations
Quntative Data Analysis SPSS Exploring Assumptions
Advertisements

Describing Quantitative Variables
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
POPULATION DYNAMICS Required background knowledge:
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Quantitative Skills: Data Analysis
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
Copyright © 2012 Pearson Education. All rights reserved © 2010 Pearson Education Copyright © 2012 Pearson Education. All rights reserved. Chapter.
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
Chapter Eight: Using Statistics to Answer Questions.
Data Analysis.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Introduction to statistics I Sophia King Rm. P24 HWB
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
BPS - 5th Ed. Chapter 231 Inference for Regression.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Descriptive Statistics ( )
Data analysis is one of the first steps toward determining whether an observed pattern has validity. Data analysis also helps distinguish among multiple.
CHAPTER 12 More About Regression
Advanced Quantitative Techniques
Different Types of Data
BAE 6520 Applied Environmental Statistics
BAE 5333 Applied Water Resources Statistics
Psych 706: stats II Class #4.
INF397C Introduction to Research in Information Studies Spring, Day 12
2.5: Numerical Measures of Variability (Spread)
Descriptive measures Capture the main 4 basic Ch.Ch. of the sample distribution: Central tendency Variability (variance) Skewness kurtosis.
Research Methods in Psychology PSY 311
CHAPTER 2 Modeling Distributions of Data
PCB 3043L - General Ecology Data Analysis.
APPROACHES TO QUANTITATIVE DATA ANALYSIS
CHAPTER 12 More About Regression
AP Lab Skills Guide Data will fall into three categories:
Central Tendency and Variability
Description of Data (Summary and Variability measures)
CHAPTER 2 Modeling Distributions of Data
Inferential Statistics
Numerical Descriptive Measures
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Lecture Slides Elementary Statistics Thirteenth Edition
Central tendency and spread
CHAPTER 26: Inference for Regression
Descriptive and inferential statistics. Confidence interval
Assessing Normality and Data Transformations
Displaying Distributions with Graphs
Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of
POPULATION VS. SAMPLE Population: a collection of ALL outcomes, responses, measurements or counts that are of interest. Sample: a subset of a population.
CHAPTER 2 Modeling Distributions of Data
Basic Practice of Statistics - 3rd Edition Inference for Regression
CHAPTER 12 More About Regression
1. Homework #2 (not on posted slides) 2. Inferential Statistics 3
Summary (Week 1) Categorical vs. Quantitative Variables
Honors Statistics Review Chapters 4 - 5
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
CHAPTER 2 Modeling Distributions of Data
CHAPTER 12 More About Regression
Chapter Nine: Using Statistics to Answer Questions
8.3 Estimating a Population Mean
CHAPTER 2 Modeling Distributions of Data
Advanced Algebra Unit 1 Vocabulary
Measures of Dispersion
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Presentation transcript:

Psych 706: Stats II Class #1

What is the point of this class? Reinforce the basics you learned in Stats I, focusing on mostly parametric statistics Help you feel comfortable: Exploring data and looking for outliers Choosing the appropriate statistical test(s) Explaining results of tests clearly to audiences Displaying your results as clearly and simply as possible

Syllabus “Dr. Stewart” or “Jenny” – either is fine Office Hours in SB A-312: by appt. Grade based on: Four homework assignments (40% grade) Two exams in-class (open book) (30% grade) Final take-home exam (20% grade) Attendance / Participation (10% grade)

Required textbook

Class structure Tuesdays 5-7:50pm 5-6pm Lecture/Discussion about main concepts Quick break, then 6-7pm SPSS tutorial 7-8pm lab, you can work on homework and I’ll answer questions you might have This picture makes me laugh - Both men and women can be good at statistics, but according to clipart you wouldn’t think that was true

Blackboard SITE FOR THIS CLASS Contains: SPSS files used in the Field chapters so you can practice along with book Assignments Additional required readings SPSS handouts we’ll review during in-class tutorials PowerPoint slides

Psychological research You want to understand relationships between particular constructs Develop a model about how constructs are related in a population Collect data on variables representing those constructs in a sample of the population See how well your hypotheses about the population fit the actual sample data you collected = STATISTICS

Depends on two factors: population mean (μ) and standard deviation (σ) normal distribution Mean determines the center SD determines height and width Symmetric around the mean Area under the curve totals 1

We rarely have access to actual population parameters: means (μ) and standard deviations (σ) Instead we use sample statistics to estimate means and standard deviations and compare them to what we would expect in the normal distribution

Characterizing normal distribution Review descriptive statistics Go over formulas for mean and standard deviation Purpose of z-scores Graphing mean differences: Confidence intervals vs. standard error Assumptions of normal distribution and how to check for violations Reducing bias (outliers, violations) in your data

Descriptive statistics Center Spread Shape Mode Range Skewness Median Variance Kurtosis Mean Standard Deviation Sum up all sample scores and divide by # scores

Range/QUARTILES (used in boxplots) Range = Largest number minus smallest Quartiles = 3 values split sorted data into four equal parts. Second Quartile = median Lower quartile = median of lower half of the data Upper quartile = median of upper half of the data Interquartile Range

Now we’re going to go through step by step to show: why the standard deviation is used to REFLECT the spread of a distribution

DEVIANCE We can calculate the spread of scores (the error in our model) by looking at how different each score is from the center of a distribution (the mean): Score Mean Deviance 7 4 3 8 2 -2 -4 -1 TOTAL DEVIANCE:

DEVIANCE PROBLEM: When we calculate total deviance, since some deviances (errors) will be negative and others will be positive, they’ll cancel each other out! Score Mean Deviance 7 4 3 8 2 -2 -4 -1 TOTAL DEVIANCE:

SUM OF SQUARED ERRORS (SS) SOLUTION: square each individual deviance and THEN sum them up! Score Mean Deviance Deviance Squared 7 4 3 9 8 16 2 -2 -4 -1 1 SS: 46

SUM OF SQUARED ERRORS (SS) PROBLEM: The larger the number of observations, the larger SS will automatically be. Score Mean Deviance Deviance Squared 7 4 3 9 8 16 2 -2 -4 -1 1 SS: 46

VARIANCE (s²) N=5 observations s² = 46/4 = 11.5 SOLUTION: Divide the SS by the number of observations (minus 1*) to scale it! * You lose 1 degree of freedom for estimating the mean in the first place Score Mean Deviance Deviance Squared 7 4 3 9 8 16 2 -2 -4 -1 1 SS: 46 N=5 observations s² = 46/4 = 11.5 The number of degrees of freedom generally refers to the number of independent observations in a sample minus the number of population parameters that must be estimated from sample data. One population parameter (the mean) is estimated from sample data. Therefore, the number of degrees of freedom is equal to the sample size minus one.

VARIANCE (s²) N=5 observations s² = 46/4 = 11.5 PROBLEM: Then our measure is in units squared, which is confusing to report in papers. * You lose 1 degree of freedom for estimating the mean in the first place Score Mean Deviance Deviance Squared 7 4 3 9 8 16 2 -2 -4 -1 1 SS: 46 N=5 observations s² = 46/4 = 11.5

STANDARD DEVIATION (S; or sd) SOLUTION: Take the square root of the whole darn equation and then you’re back in regular units! Whew! s = sqrt (11.5 units squared) = 3.39 units

Same mean, different standard deviation

skewness Symmetry of Distribution Tail pointing at high values Tail pointing at low values

Kurtosis Leptokurtic Platykurtic Heaviness of the Tails Heavy Tails Light Tails

SPSS descriptive statistics: ONE format

SPSS descriptive statistics: ANOTHER FORMAT

Creating a standard score: Z-scores Allows us to calculate probability of a sample score occurring within normal distribution Enables us to compare two scores that are from different normal distributions Expresses a score in terms of how many standard deviations it is away from the mean z distribution: mean of 0 and SD = 1.

normal distribution

Calculating Z-scores Using sample statistics based on our own data set Using population parameters (we typically don’t know these) Using sample statistics based on our own data set Ideally we would take each score and subtract the population mean, divided by the population standard deviation But since we typically do not have that information, we’ll instead subtract the sample mean, divided by the sample standard deviation.

Top/Bottom of Distribution Z-score CUTOFFS Top/Bottom of Distribution Z-Score 2.5% (5% two-tailed) +/- 1.96 .05% (1% two-tailed) +/- 2.58 .005% (0.1% two-tailed) +/- 3.29

Once you’ve calculated descriptive statistics, what is a simple and clear way to display them?

Graphing mean differences Confidence Intervals (CI) Standard Error (SE) Vs.

CI = SE = +/- 1SE z = -1.96 z = +1.96 95% CI or +/- 2SE

Confidence intervals (CI) True mean reaction time for all women and men in the population is unknowable. 95% CI = If we repeatedly studied a different random sample of women, 95% of the time the true mean for all women will fall within these upper/lower values. You can do the same calculation for men. Say you did significance testing and decided to plot means with CIs. CIs can overlap as much as 25% and there can still be a significant (p<.05) difference between means for men and women.

Standard error (SE) +/- 1 SE = 68% chance that the true mean falls within this range (sort-of like 68% of a CI). +/- 2 SE = 95% chance that the true mean falls within this range (almost equivalent to a CI). Typically you only plot +/- 1 SE though. When SEs overlap, differences between men and women are not significant at p<.05 (here they are far from overlapping). Actually to be significant, there needs to be about ½ an error bar’s space between the two means for significance to occur.

So which do you use, CIs or SEs So which do you use, CIs or SEs? It often depends on the preference of a particular journal as well as your personal preference. What about line versus bar graphs for plotting means? Lines work well for continuous data Bars work well for categorical data. Often people go with personal preference.

Assumptions of the normal distribution* Additivity and Linearity Normality Homoscedasticity / Homogeneity of variance Independence of errors *As sample sizes for each group approach 30 or larger, the less you have to worry about this because sample stats approximate population parameters exponentially as sample size increases

Assumptions of the normal distribution Additivity and Linearity: Outcome of any model we create is linearly related to predictor variables

Assumptions of the normal distribution Normality Confidence intervals, sampling distribution of means, and errors all need to be normally distributed

Assumptions of the normal distribution Homoscedasticity / Homogeneity of variance

Assumptions of the normal distribution Independence of errors They should not be correlated with each other!

Checking for OUTLIERS AND VIOLATION OF ASSUMPTIONS Histograms Boxplots Q-Q (quartile-quartile) plots P-P (probability-probability) plots Scatterplots Skewness/Kurtosis z-score checks K-S test and Levine’s test

HISTOGRAM Plots variable values on x-axis Plots frequency of responses on y-axis Can eyeball skewness, kurtosis, and outliers with help from normal curve outlined in black

BOXPLOT Uses info from the median and interquartile range to determine outliers and extreme values Outliers = any scores in the upper quartile + [1.5*interquartile range] Extreme scores = any scores in the upper quartile + [3*interquartile range]

P-P and Q-Q Plots PDF = probability distribution function – plotting two non-cumulative datasets against each other Example: Q-Q plot CDF = cumulative distribution function – plotting two cumulative (range: 0-1) datasets against each other Example: P-P plot

Q-Q plot Plots observed sample values on the X-axis and the expected values (assuming a normal distribution) on the Y-axis. If the sample distribution is distributed exactly like a normal distribution, the points should fall on a straight line. Plots quartiles in your sample data, not all points Magnifies deviations from proposed distribution on tails

Interpreting q-q plots

P-P Plot Plots cumulative probabilities, with observed probabilities on the X-axis and expected probabilities given the normal curve on the Y-axis. Again, if the sample were exactly normally distributed, the points would lie on a straight line Plots all probabilities from your data Magnifies deviations from normal distribution in middle

SCATTERPLOT Plotting data for one variable against another variable Look for outliers Check for homoscedasticity

Skewness/kurtosis z-score checks Divide each skewness or kurtosis value by its standard error and compare to a z-distribution Exam Performance Skewness -.373/.238 = -1.56 Kurtosis -.852/.472 = -1.81 WITHIN NORMAL LIMITS Top/Bottom of Distribution Z-Score 2.5% (5% two-tailed) +/- 1.96 .05% (1% two-tailed) +/- 2.58 .005% (0.1% two-tailed) +/- 3.29 Exam Anxiety Skewness -2.012 /.238 = -8.45 Kurtosis 5.192/.472 = 11 EXTREMELY NON-NORMAL!

Kolmogorov-smirnoV (K-S)test Testing normality assumption Compares scores in your sample to a normally distributed set of scores with the same mean and standard deviation If p < .05 your sample distribution is significantly different from the normal distribution Shapiro-Wilks test does the same thing

Levine’s test Testing homogeneity of variance assumption The variance of your outcome variable should be the same for each groups (e.g., depressed vs. non-depressed) Tests whether variances in each group are equal If p < .05, variances are unequal

Reducing bias Trim the data Winsorizing Analyze with robust methods Transform the data

Reducing bias Trim the data: remove outliers based on a rule %-based rules = trimmed mean, M-estimator (trims your data for you) SD-based rules = certain # SDs above/below mean; but SD/mean influenced by outliers in the first place so the criterion used to remove them is inherently biased Winsorizing: replacing outliers with next-highest score that is not an outlier (e.g., a score 3 SD from the mean) Analyze with robust methods (non-parametric tests, see Chapter 6 Field; or bootstrapping: sampling with replacement from our dataset 1000-2000x to get estimated confidence interval and SE for our data) Transform the data (log, square root, reciprocal)

STEPS: Exploring your data Let’s say you collected the same data in two groups (depressed and non-depressed). You entered the data in SPSS. What next? You’re going to do the following, once across ALL subjects, and once for each group separately: Descriptive statistics (mean, standard deviation, range, skewness, kurtosis etc.) Graphs (histograms, boxplots, P-P plots, Q-Q plots) Check if your data is normally distributed (compute z-scores for skewness, kurtosis, compute K-S and Wilks-Shapiro tests) Figure out how you want to handle outliers and extreme scores Compute Levine’s test to determine whether groups violate homogeneity of variance assumptions Reduce bias (e.g., Perform transformations on any problematic variables and test how well the distribution changes w/ Levene’s test)

SPSS tutorials (handout) after the break!

Detrended q-q plot Here, the Y-axis is the deviation (difference) between what was observed and what was expected. This plot sometimes makes the pattern easier to decipher (note the clear “S” pattern indicating skew)