Download presentation
Presentation is loading. Please wait.
Published byAnis Black Modified over 8 years ago
1
Copyright © 2005 by Lippincott Williams and Wilkins. PowerPoint Presentation to Accompany Statistical Methods for Health Care Research by Barbara Hazard Munro. Chapter 2 UNIVARIATE DESCRIPTIVE STATISTICS
2
Objectives for Chapter 2 Define measures of central tendency & dispersion Select appropriate measures to use for a particular dataset Discuss methods to identify & manage outliers Discuss methods to handle missing data
3
Basic Characteristics of a Distribution Central Tendency Variability Skewness Kurtosis
4
Measures of Central Tendency When assessing the central tendency of your measurements, you are attempting to identify the “average” measurement Mean: best known & most widely used average, describing the center of a frequency distribution Median: the middle value/point of a set of ordered numbers below which 50% of the distribution falls Mode: the most frequent value or category in a distribution
5
Comparison of Central Tendency Measures In a perfect world, the mean, median & mode would be the same. However, the world is not perfect & very often, the mean, median and mode are not the same
6
Central Tendency - Graphed MEANMODE MEDIAN
7
Comparison of Central Tendency Measures Use Mean when distribution is reasonably symmetrical, with few extreme scores and has one mode. Use Median with nonsymmetrical distributions because it is not sensitive to skewness. Use Mode when dealing with frequency distribution for nominal data
8
Variability A quantitative measure of the degree to which scores in a distribution are spread out or are clustered together; Types of variability include: Standard Deviation: a measure of the dispersion of scores around the mean Range: Highest value minus the lowest value Interquartile Range: Range of values extending from 25 th percentile to 75 th percentile
9
Variability - Graphed RANGE - 1 SD + 1 SD
10
Standard Deviation Most widely reported measure of variability Commonly used to calculate other statistical measures Indicates dispersion, or spread, of scores in a distribution
11
Standard Deviation The smaller the standard deviation, the more tightly clustered the scores The larger the standard deviation, the more spread out the scores Report SD when you report mean of a continuous variable’s distribution
12
Range Simplest measure of variability Difference between the maximum value in distribution and the minimum value. Unstable because it is based only on two values Sensitive to extreme scores Usually reported as the minimum and maximum scores, not as the difference between them
13
Percentiles Percentile is a score above which & below which a certain percentage of values fall. Symbolized by letter P Ex: P 40 = 55 Means that 40% of values in the distribution fall below the score 55
14
Interpercentile Measures Interquartile Range (IQR): range of values extending from P 25 to P 75 Like the median, IQR is not sensitive to extreme scores Most common use of the IQR is for growth charts
15
Comparison of Measures of Variability Standard Deviation Most widely used measure of variability Most reliable estimate of population variability Best with symmetrical distributions with only one mode
16
Comparison of Measures of Variability Range Main use is to call attention to the two extreme values of a distribution Quick, rough estimate of variability Greatly influenced by sample size: the larger the sample, the larger the range
17
Comparison of Measures of Variability Interpercentile Measures Easy to understand Can be used with distributions of any shape Especially useful in very skewed distributions Use IQR when reporting median of distribution
18
Shape of the Distribution The shape of the distribution provides information about the central tendency and variability of measurements. Three common shapes of distributions are: Normal: bell-shaped curve; symmetrical Skewed: non-normal; non-symmetrical; can be positively or negatively skewed Multimodal: has more than one peak (mode)
19
Normal Distribution
20
Positively Skewed Distribution
21
Negatively Skewed Distribution
22
Bimodal Distribution
23
Variable Distribution Symmetry Normal Distribution is symmetrical & bell-shaped; often called “bell-shaped curve” When a variable’s distribution is non- symmetrical, it is skewed This means that the mean is not in the center of the distribution
24
Skewness Skewness is the measure of the shape of a nonsymmetrical distribution Two sets of data can have the same mean & SD but different skewness Two types of skewness: Positive skewness Negative skewness
25
Relative Locations for Measures of Central Tendency Negatively Skewed Mode Median Mean Symmetric (Not Skewed) Mean Median Mode Positively Skewed Mode Median Mean
26
Positively Skewed Distribution
27
Positive Skewness Has pileup of cases to the left & the right tail of distribution is too long
28
Negatively Skewed Distribution
29
Negative Skewness Has pileup of cases to the right & the left tail of distribution is too long
30
Measures of Symmetry Pearson’s Skewness Coefficient Formula = (mean-median) SD Skewness values > 0.2 or < 0. 2 indicate severe skewness
31
Measures of Symmetry Fisher’s Skewness Coefficient Formula = Skewness coefficient NB Standard error of skewness Skewness values >+1.96 SD indicate severe skewness NB: Calculating skewness coefficient & its standard error is an option in most descriptive statistics modules in statistics programs
32
Data Transformation With skewed data, the mean is not a good measure of central tendency because it is sensitive to extreme scores May need to transform skewed data to make distribution appear more normal or symmetrical Must determine the degree & type of skewness prior to transformation
33
Data Transformation If positive skewness, can apply either square root (moderate skew) or log transformations (severe skew) directly If negative skewness, must “reflect” variable to make the negative skewness a positive skewness, then apply transformations for positive skew
34
Data Transformation Reflecting a variable change in the meaning of the scores. Ex. If high scores on a self-esteem total score meant high self-esteem before reflection, they now mean low self-esteem after reflection
35
Data Transformation As a rule, it is best to transform skewed variables, but keep in mind that transformed variables may be harder to interpret Once transformed, always check that transformed variable is normally or nearly normally transformed If transformation does not work, may need to dichotomize variable for use in subsequent analyses
36
Kurtosis A measure of whether the curve of a distribution is: Bell-shaped -- Mesokurtic Peaked -- Leptokurtic Flat -- Platykurtic
37
Fisher’s Measure of Kurtosis Formula = Kurtosis coefficient NB Standard error of kurtosis Kurtosis values >+1.96 SD indicate severe kurtosis NB: Calculating kurtosis coefficient & its standard error is an option in most descriptive statistics modules in statistics programs
38
Types of Charts/Graphs Line Chart: frequently used to display longitudinal trends Box Plot: graphic display using descriptive statistics based on percentiles Simultaneously shows median, IQR, & smallest & largest values for a group Sometimes called “box-and-whiskers” plot
39
LINE CHART Medication Error Tracking
40
BOX PLOT Examples
41
Outliers Outlier: value that is extreme relative to bulk of scores in the distribution May be due to: data recording error failure in data collection actual extreme value from an unusual respondent
42
Handling Outliers Try analyzing data with outliers included in distribution & with outliers removed - If results are similar, outliers not a problem Could use trimmed mean ( removing a certain percentage of respondents from data, then calculate new mean Ex. 5% trimmed mean is calculated on middle 90% of respondents’ scores (top 5% & bottom 5% of scores dropped prior to calculation)
43
Handling Outliers Move the outlier scores closer to the bulk of scores in distribution via recoding them This makes outliers less deviant and they still stay in the same place in the distribution. Sometimes this method can reduce a serious skewness problem
44
Missing Data Especially problematic in longitudinal & repeated measures studies Data analyst must: Identify pattern & amount of missing data Assess why it is missing Determine what to do about it
45
Pattern & Amount of Missing Data Pattern is more important than amount of missing data Two basic patterns: Random pattern -- values missing in an unplanned or haphazard fashion throughout dataset Systematic pattern -- values missing in a methodical, nonrandom way throughout data
46
Pattern & Amount of Missing Data If only a few data values are missing in a random pattern from large dataset, no problem If many data missing from small or moderate sized sample, serious problems can ensue
47
Random Missing Data Categories Missing Completely at Random (MCAR) Missing at Random (MAR) Not Missing at Random (NMAR)
48
Missing Completely at Random (MCAR) Have highest degree of randomness, showing no underlying reason that would contribute to biased data MCAR data are randomly distributed across all cases & completely unrelated to other variables in dataset
49
Missing at Random (MAR) Display some randomness to pattern of missing data that can be traced or predicted from cases with no missing data Occurs when probability of a missing value is not dependent on the value itself but may rely on values of other variables in dataset
50
Not Missing at Random (NMAR) Occurs when missing values are systematically different from those observed, even from respondents with other similar characteristics Systematic missing data, even in a few cases, should always be treated seriously because they affect generalizability of results
51
Testing for Patterns of Missing Data Create grouping variable with two levels: 1. Cases with missing values on variable 0. Cases with no missing values on variable Perform test of difference (t-test, Chi square) using this grouping variable on the dependent variable(s) If serious differences noted, systematic missing (NMAR) data are present and must be handled
52
Assessing Why Data Are Missing Missing Data Process (MDP): any systematic event external to respondent (data entry error or data collection problem) or action on respondent’s part (refusal to answer) that leads to missing data If MDP is under researcher’s control can be explicitly defined, then missing data can be ignored & no specific remedies needed
53
Assessing Why Data Are Missing Often, researcher has no idea what data are missing Thus, need to examine pattern of missing data Major Question: Are respondents with missing data on some variables different than respondents with no missing data on these variables?
54
Handling Missing Data Complete-Case Deletion (Listwise deletion) Available-Case Deletion (Pairwise deletion) Deleting Cases or Variables Weighting Techniques Estimating Missing Data through Imputation
55
Listwise Deletion Analyzes only those cases with complete data Easiest method for handling missing data Often default option in most statistics programs Use if amount of missing data is small, sample is sufficiently large & relationships in data are strong enough to survive deleting cases
56
Pairwise Deletion Use only those cases with no missing data on the variables for a specific analysis Commonly an option in most statistics programs Often used for correlations, linear regression & factor analysis
57
Deleting Cases or Variables Have a preset amount of missing data that can be tolerated (5% - 10%) Remove all cases or variables that exceed that amount Good solution if sample size is large enough
58
Weighting Techniques Disregard missing values & assign a weight to cases with complete data Weight cases with no missing data higher than those with missing data Decreases bias from case deletion methods as well as sample variance Less common procedure than other missing data handling methods
59
Missing Data Estimation Via Imputation Process of estimating missing data based on valid values of other variables or cases in sample Goal is to use known relationship that can be identified in the valid values of the sample to help estimate the missing data
60
Missing Data Estimation Methods Prior Knowledge: replace missing value with value based on educated guess Mean/Median Replacement: Replace missing value with variable mean or median Regression: Use other variables in dataset as independent variables to develop regression equation for variable with missing data (dependent variable)
61
Missing Data Estimation Methods Expectation Maximization (EM): iterative process that can be used with randomly missing data SPSS Missing Values Analysis performs EM to produce imputed values Multiple Imputation (MI): Iterative process that produces several datasets (3 - 5) with imputed values for missing data, then averages the resulting estimates & standard errors
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.