Psych 706: Stats II Class #1
What is the point of this class? Reinforce the basics you learned in Stats I, focusing on mostly parametric statistics Help you feel comfortable: Exploring data and looking for outliers Choosing the appropriate statistical test(s) Explaining results of tests clearly to audiences Displaying your results as clearly and simply as possible
Syllabus “Dr. Stewart” or “Jenny” – either is fine Office Hours in SB A-312: by appt. Grade based on: Four homework assignments (40% grade) Two exams in-class (open book) (30% grade) Final take-home exam (20% grade) Attendance / Participation (10% grade)
Required textbook
Class structure Tuesdays 5-7:50pm 5-6pm Lecture/Discussion about main concepts Quick break, then 6-7pm SPSS tutorial 7-8pm lab, you can work on homework and I’ll answer questions you might have This picture makes me laugh - Both men and women can be good at statistics, but according to clipart you wouldn’t think that was true
Blackboard SITE FOR THIS CLASS Contains: SPSS files used in the Field chapters so you can practice along with book Assignments Additional required readings SPSS handouts we’ll review during in-class tutorials PowerPoint slides
Psychological research You want to understand relationships between particular constructs Develop a model about how constructs are related in a population Collect data on variables representing those constructs in a sample of the population See how well your hypotheses about the population fit the actual sample data you collected = STATISTICS
Depends on two factors: population mean (μ) and standard deviation (σ) normal distribution Mean determines the center SD determines height and width Symmetric around the mean Area under the curve totals 1
We rarely have access to actual population parameters: means (μ) and standard deviations (σ) Instead we use sample statistics to estimate means and standard deviations and compare them to what we would expect in the normal distribution
Characterizing normal distribution Review descriptive statistics Go over formulas for mean and standard deviation Purpose of z-scores Graphing mean differences: Confidence intervals vs. standard error Assumptions of normal distribution and how to check for violations Reducing bias (outliers, violations) in your data
Descriptive statistics Center Spread Shape Mode Range Skewness Median Variance Kurtosis Mean Standard Deviation Sum up all sample scores and divide by # scores
Range/QUARTILES (used in boxplots) Range = Largest number minus smallest Quartiles = 3 values split sorted data into four equal parts. Second Quartile = median Lower quartile = median of lower half of the data Upper quartile = median of upper half of the data Interquartile Range
Now we’re going to go through step by step to show: why the standard deviation is used to REFLECT the spread of a distribution
DEVIANCE We can calculate the spread of scores (the error in our model) by looking at how different each score is from the center of a distribution (the mean): Score Mean Deviance 7 4 3 8 2 -2 -4 -1 TOTAL DEVIANCE:
DEVIANCE PROBLEM: When we calculate total deviance, since some deviances (errors) will be negative and others will be positive, they’ll cancel each other out! Score Mean Deviance 7 4 3 8 2 -2 -4 -1 TOTAL DEVIANCE:
SUM OF SQUARED ERRORS (SS) SOLUTION: square each individual deviance and THEN sum them up! Score Mean Deviance Deviance Squared 7 4 3 9 8 16 2 -2 -4 -1 1 SS: 46
SUM OF SQUARED ERRORS (SS) PROBLEM: The larger the number of observations, the larger SS will automatically be. Score Mean Deviance Deviance Squared 7 4 3 9 8 16 2 -2 -4 -1 1 SS: 46
VARIANCE (s²) N=5 observations s² = 46/4 = 11.5 SOLUTION: Divide the SS by the number of observations (minus 1*) to scale it! * You lose 1 degree of freedom for estimating the mean in the first place Score Mean Deviance Deviance Squared 7 4 3 9 8 16 2 -2 -4 -1 1 SS: 46 N=5 observations s² = 46/4 = 11.5 The number of degrees of freedom generally refers to the number of independent observations in a sample minus the number of population parameters that must be estimated from sample data. One population parameter (the mean) is estimated from sample data. Therefore, the number of degrees of freedom is equal to the sample size minus one.
VARIANCE (s²) N=5 observations s² = 46/4 = 11.5 PROBLEM: Then our measure is in units squared, which is confusing to report in papers. * You lose 1 degree of freedom for estimating the mean in the first place Score Mean Deviance Deviance Squared 7 4 3 9 8 16 2 -2 -4 -1 1 SS: 46 N=5 observations s² = 46/4 = 11.5
STANDARD DEVIATION (S; or sd) SOLUTION: Take the square root of the whole darn equation and then you’re back in regular units! Whew! s = sqrt (11.5 units squared) = 3.39 units
Same mean, different standard deviation
skewness Symmetry of Distribution Tail pointing at high values Tail pointing at low values
Kurtosis Leptokurtic Platykurtic Heaviness of the Tails Heavy Tails Light Tails
SPSS descriptive statistics: ONE format
SPSS descriptive statistics: ANOTHER FORMAT
Creating a standard score: Z-scores Allows us to calculate probability of a sample score occurring within normal distribution Enables us to compare two scores that are from different normal distributions Expresses a score in terms of how many standard deviations it is away from the mean z distribution: mean of 0 and SD = 1.
normal distribution
Calculating Z-scores Using sample statistics based on our own data set Using population parameters (we typically don’t know these) Using sample statistics based on our own data set Ideally we would take each score and subtract the population mean, divided by the population standard deviation But since we typically do not have that information, we’ll instead subtract the sample mean, divided by the sample standard deviation.
Top/Bottom of Distribution Z-score CUTOFFS Top/Bottom of Distribution Z-Score 2.5% (5% two-tailed) +/- 1.96 .05% (1% two-tailed) +/- 2.58 .005% (0.1% two-tailed) +/- 3.29
Once you’ve calculated descriptive statistics, what is a simple and clear way to display them?
Graphing mean differences Confidence Intervals (CI) Standard Error (SE) Vs.
CI = SE = +/- 1SE z = -1.96 z = +1.96 95% CI or +/- 2SE
Confidence intervals (CI) True mean reaction time for all women and men in the population is unknowable. 95% CI = If we repeatedly studied a different random sample of women, 95% of the time the true mean for all women will fall within these upper/lower values. You can do the same calculation for men. Say you did significance testing and decided to plot means with CIs. CIs can overlap as much as 25% and there can still be a significant (p<.05) difference between means for men and women.
Standard error (SE) +/- 1 SE = 68% chance that the true mean falls within this range (sort-of like 68% of a CI). +/- 2 SE = 95% chance that the true mean falls within this range (almost equivalent to a CI). Typically you only plot +/- 1 SE though. When SEs overlap, differences between men and women are not significant at p<.05 (here they are far from overlapping). Actually to be significant, there needs to be about ½ an error bar’s space between the two means for significance to occur.
So which do you use, CIs or SEs So which do you use, CIs or SEs? It often depends on the preference of a particular journal as well as your personal preference. What about line versus bar graphs for plotting means? Lines work well for continuous data Bars work well for categorical data. Often people go with personal preference.
Assumptions of the normal distribution* Additivity and Linearity Normality Homoscedasticity / Homogeneity of variance Independence of errors *As sample sizes for each group approach 30 or larger, the less you have to worry about this because sample stats approximate population parameters exponentially as sample size increases
Assumptions of the normal distribution Additivity and Linearity: Outcome of any model we create is linearly related to predictor variables
Assumptions of the normal distribution Normality Confidence intervals, sampling distribution of means, and errors all need to be normally distributed
Assumptions of the normal distribution Homoscedasticity / Homogeneity of variance
Assumptions of the normal distribution Independence of errors They should not be correlated with each other!
Checking for OUTLIERS AND VIOLATION OF ASSUMPTIONS Histograms Boxplots Q-Q (quartile-quartile) plots P-P (probability-probability) plots Scatterplots Skewness/Kurtosis z-score checks K-S test and Levine’s test
HISTOGRAM Plots variable values on x-axis Plots frequency of responses on y-axis Can eyeball skewness, kurtosis, and outliers with help from normal curve outlined in black
BOXPLOT Uses info from the median and interquartile range to determine outliers and extreme values Outliers = any scores in the upper quartile + [1.5*interquartile range] Extreme scores = any scores in the upper quartile + [3*interquartile range]
P-P and Q-Q Plots PDF = probability distribution function – plotting two non-cumulative datasets against each other Example: Q-Q plot CDF = cumulative distribution function – plotting two cumulative (range: 0-1) datasets against each other Example: P-P plot
Q-Q plot Plots observed sample values on the X-axis and the expected values (assuming a normal distribution) on the Y-axis. If the sample distribution is distributed exactly like a normal distribution, the points should fall on a straight line. Plots quartiles in your sample data, not all points Magnifies deviations from proposed distribution on tails
Interpreting q-q plots
P-P Plot Plots cumulative probabilities, with observed probabilities on the X-axis and expected probabilities given the normal curve on the Y-axis. Again, if the sample were exactly normally distributed, the points would lie on a straight line Plots all probabilities from your data Magnifies deviations from normal distribution in middle
SCATTERPLOT Plotting data for one variable against another variable Look for outliers Check for homoscedasticity
Skewness/kurtosis z-score checks Divide each skewness or kurtosis value by its standard error and compare to a z-distribution Exam Performance Skewness -.373/.238 = -1.56 Kurtosis -.852/.472 = -1.81 WITHIN NORMAL LIMITS Top/Bottom of Distribution Z-Score 2.5% (5% two-tailed) +/- 1.96 .05% (1% two-tailed) +/- 2.58 .005% (0.1% two-tailed) +/- 3.29 Exam Anxiety Skewness -2.012 /.238 = -8.45 Kurtosis 5.192/.472 = 11 EXTREMELY NON-NORMAL!
Kolmogorov-smirnoV (K-S)test Testing normality assumption Compares scores in your sample to a normally distributed set of scores with the same mean and standard deviation If p < .05 your sample distribution is significantly different from the normal distribution Shapiro-Wilks test does the same thing
Levine’s test Testing homogeneity of variance assumption The variance of your outcome variable should be the same for each groups (e.g., depressed vs. non-depressed) Tests whether variances in each group are equal If p < .05, variances are unequal
Reducing bias Trim the data Winsorizing Analyze with robust methods Transform the data
Reducing bias Trim the data: remove outliers based on a rule %-based rules = trimmed mean, M-estimator (trims your data for you) SD-based rules = certain # SDs above/below mean; but SD/mean influenced by outliers in the first place so the criterion used to remove them is inherently biased Winsorizing: replacing outliers with next-highest score that is not an outlier (e.g., a score 3 SD from the mean) Analyze with robust methods (non-parametric tests, see Chapter 6 Field; or bootstrapping: sampling with replacement from our dataset 1000-2000x to get estimated confidence interval and SE for our data) Transform the data (log, square root, reciprocal)
STEPS: Exploring your data Let’s say you collected the same data in two groups (depressed and non-depressed). You entered the data in SPSS. What next? You’re going to do the following, once across ALL subjects, and once for each group separately: Descriptive statistics (mean, standard deviation, range, skewness, kurtosis etc.) Graphs (histograms, boxplots, P-P plots, Q-Q plots) Check if your data is normally distributed (compute z-scores for skewness, kurtosis, compute K-S and Wilks-Shapiro tests) Figure out how you want to handle outliers and extreme scores Compute Levine’s test to determine whether groups violate homogeneity of variance assumptions Reduce bias (e.g., Perform transformations on any problematic variables and test how well the distribution changes w/ Levene’s test)
SPSS tutorials (handout) after the break!
Detrended q-q plot Here, the Y-axis is the deviation (difference) between what was observed and what was expected. This plot sometimes makes the pattern easier to decipher (note the clear “S” pattern indicating skew)