Download presentation
1
Descriptive Statistics, The Normal Distribution,
and Standardization © Scott Evans, Ph.D. and Lynne Peeples, M.S.
2
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Happy Valentine’s Day! How many candy hearts in a box of NECCO Sweethearts? 1, 2, 3, 4, …, 40? © Scott Evans, Ph.D. and Lynne Peeples, M.S.
3
Big Picture revisited…
Population μ, σ, σ2 Sample x, s, s2 Statistical Inference (w/ Probability) Take sample 2. Describe data – identifying important features, extracting useful information 3. Make statistical inferences, using probabiliyt Step I Step II Step III © Scott Evans, Ph.D. and Lynne Peeples, M.S.
4
Step I: Take the Sample x, s, s2 POPULATION = μ, σ, σ2 SAMPLE =
All boxes of Sweethearts μ, σ, σ2 ~ 8 billion hearts made each year at NECCO!! SAMPLE = Boxes of Sweethearts x, s, s2 Beware that poor samples may provide a distorted view of the population This can happen due to biased sampling or due to chance In general, larger sample is a more representative of the population. Want a representative sample of boxes (i.e. hoping for different batches - purchased at different stores in Cambridge and Boston) The larger the sample, the better © Scott Evans, Ph.D. and Lynne Peeples, M.S.
5
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Step I: Take the Sample x1 = 29 x2 = 31 x3 = 32 x4 = 27 x5 = 36 x6 = 35 x7 = 29 x8 = 30 x9 = 31 x10 = 29 x11 = 28 x12 = 33 SAMPLE = 12 boxes Sweetheart counts ranging from 28 to 36 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
6
Step II: Describe the Sample
Descriptive Statistics Measures of Central Tendency Measures of Variability Other Descriptive Measures How can we describe our Sweetheart sample? © Scott Evans, Ph.D. and Lynne Peeples, M.S.
7
Measures of Central Tendency
Measures the “center” of the data Examples Mean Median Mode The choice of which to use, depends… It is okay to report more than one. They are simply descriptive (not inferential) However, when presenting (i.e. journal) limited on space - forced to choose © Scott Evans, Ph.D. and Lynne Peeples, M.S.
8
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Mean The “average”. If the data are made up of n observations: x1, x2,…, xn, then the mean is given by the sum of the observations divided by the number of observations. For example, if the data are: x1=1, x2=2, x3=3, then the mean is (1+2+3)/3=2. Often denoted as © Scott Evans, Ph.D. and Lynne Peeples, M.S.
9
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Mean The population mean is often denoted by μ. This is usually unknown (although we try to make inferences about this). The sample mean is an estimator of the population mean. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
10
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Mean What is the mean of our sample of sweethearts? = ( … )/12 = 370/12 = 30.83 ≈ 31 Sweethearts © Scott Evans, Ph.D. and Lynne Peeples, M.S.
11
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Median The “middle observation” according to its rank in the data. The median is: The observation with rank (n+1)/2 if n is odd. For example, if the data are {1,2,3}, then the median is 2. The average of observations with rank n/2 and (n+2)/2 if n is even. For example, if the data are {1,2,3,4} then the median is 2.5. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
12
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Median What is the median of our sample of Sweethearts? Sort our 12 boxes in order by counts: 27, 28, 29, 29, 29, 30 | 31, 31, 32, 33, 35, 36 In our example, 30 and 31 are our middle numbers… So, the median = 30.5. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
13
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Median Another example: Income level with Bill Gates in the room. The median is more robust than the mean to extreme observations. If data are skewed to the right, then the mean > median (in general). For example, if the data are {1,2,3,4,20} then median=3 and mean=6. If data are skewed to the left, then mean < median (in general). For example, if the data are {1,15,16,18,20} then median=16 and mean=14. If data are symmetric, then mean≈median © Scott Evans, Ph.D. and Lynne Peeples, M.S.
14
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Mode The value that occurs the most often. For example, if data are {1,1,2,2,2,2,3,3}, the mode is 2. Good for ordinal or nominal data in which there are a limited number of categories. Not very useful for continuous data. For example, if data are {2,2,3,4,5,6,7,8,9}, the mode is 2 but is not a good measure of central tendency in this case. 29 appears the most often (3x) in our Sweetheart example. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
15
Measures of Variability
Measure the “spread” in the data Example: Age distribution in the Extension School vs. FAS college Some important measures Variance Standard Deviation Range Interquartile Range --- The larger value of these measures, the larger the spread and variability. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
16
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Variance The sample variance (s2) may be calculated from the data. It is the average of the square deviations of the observations from the mean. The population variance is often denoted by σ2. This is usually unknown. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
17
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Variance The deviations are squared because we are only interested in the size of the deviation rather than the direction (larger or smaller than the mean). Note: Why? © Scott Evans, Ph.D. and Lynne Peeples, M.S.
18
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Variance The reason that we divide by n -1 instead of n has to do with the number of “information units” in the variance. After estimating the sample mean, there are only n-1 observations that are a priori unknown (degrees of freedom). © Scott Evans, Ph.D. and Lynne Peeples, M.S.
19
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Variance For our Sweetheart data… = = Note, this is not in any units – not 7.61 hearts… © Scott Evans, Ph.D. and Lynne Peeples, M.S.
20
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Standard Deviation Square root of the variance s = sqrt(s2) = sample SD Calculate from the data (see formula for s2 ) σ = sqrt(σ2) = population SD Usually unknown Expressed in the same units as the mean (instead of squared units like the variance) In our Sweetheart example, Now, summarized sample with just 2 numbers! © Scott Evans, Ph.D. and Lynne Peeples, M.S.
21
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Range Maximum-Minimum Sweetheart example: 36 – 28 = 8 Very sensitive to extreme observations (outliers) © Scott Evans, Ph.D. and Lynne Peeples, M.S.
22
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Interquartile Range IQR=Q3-Q1 Q1: the first quartile Q3: the third quartile More robust than the range to extreme observations In our example, 27, 28, 29 | 29, 29, 30 | 31, 31, 32 | 33, 35, 36 IQR = = 3.5 Sweethearts © Scott Evans, Ph.D. and Lynne Peeples, M.S.
23
Other Descriptive Measures
Minimum and Maximum Very sensitive to extreme observations Sample size (N) (i.e. 12 boxes) Percentiles Examples: Median = 50th percentile Q1, Q3 = 25th and 75th percentiles © Scott Evans, Ph.D. and Lynne Peeples, M.S.
24
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Small Samples For very small samples (e.g., <5 observations), summary statistics are not very meaningful (actually can be misleading). Better to simply list the data. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
25
Example – Firefighter CHD Study
Table 4: CHD Retirements versus Active Firefighters (Controls) CHD Retirements (n= 277) Mean (Median), % (n) Active Firefighters (n=310) Age 54.2 (55.0) 39.3 (39.0) Age≥ 45 years old 94% (261) 21% (64) Current Smoking 30% (76) 10% (31) Hypertension 59% (141) 21% (65) Cholesterol >/= 5.18 mmol/L (200 mg/dl) 80% (169) 63% (196) Prior Diagnosis of CHD 22% (48) 1% (3) BMI 30.3 (29.8) 28..9 (28.4) Obesity, BMI >/=30 41% (98) 34% (104) © Scott Evans, Ph.D. and Lynne Peeples, M.S.
26
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Example – A5095 TZV Pooled EFV (n=382) (n=765) Male 81% 81% Mean age, years Race or ethnic group Non-Hispanic White 39% 41% Non-Hispanic Black 37% 36% Hispanic 21% 21% Other 2% <1% Mean baseline HIV RNA, log10 c/mL 100,000 c/mL at screening 43% 43% Mean baseline CD4 count, cells/mm © Scott Evans, Ph.D. and Lynne Peeples, M.S.
27
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Random Variables Variable A characteristic that can be measured, categorized, quantified, or qualified. Random variable A variable whose value is determined by a random phenomena (I.e., not determined by study design) Continuous random variable Can take on any value within a specified interval or continuum © Scott Evans, Ph.D. and Lynne Peeples, M.S.
28
Probability Distributions
Every random variable has a corresponding probability distribution A probability distribution describes the behavior of the random variable It identifies possible values of the random variable and provides information about the probability that these values (or ranges of values) will occur. A particularly important probability distribution is the Normal Distribution… © Scott Evans, Ph.D. and Lynne Peeples, M.S.
29
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Normal Distribution © Scott Evans, Ph.D. and Lynne Peeples, M.S.
30
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Normal Distribution “Bell-shaped curve” Symmetric about its mean (μ) The closer that an observation is to the mean, the more frequently it occurs. Notation: X~N(μ,σ) © Scott Evans, Ph.D. and Lynne Peeples, M.S.
31
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Location & Shape μ = LOCATION σ = SHAPE Note that some may have same mean, but differentiated by their spread (shape) © Scott Evans, Ph.D. and Lynne Peeples, M.S.
32
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Normal Distribution The normal distribution, N(μ, σ) can be described by the following “density function”: © Scott Evans, Ph.D. and Lynne Peeples, M.S.
33
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Normal Distribution The area under this curve (function) is one. Probabilities may be calculated as the area under the curve (above the x-axis). Integration (calculus) can help quantify these areas (probabilities). © Scott Evans, Ph.D. and Lynne Peeples, M.S.
34
Moving towards Step III…
Population μ, σ, σ2 Sample x, s, s2 Statistical Inference (w/ Probability) Step I Step II Step III © Scott Evans, Ph.D. and Lynne Peeples, M.S.
35
Standard Normal Distribution
A special normal distribution: N(0,1) Values from this distribution represent the number of SDs away from the mean (0). Known properties of this distribution -- Can make probabilistic statements using the standard normal table © Scott Evans, Ph.D. and Lynne Peeples, M.S.
36
Standard Normal Distribution
For any variable X, with mean μ and SD = σ : Z now has mean 0 and SD = 1. This “standardization” creates a variable Z, such that values of this variable represent the number of SD’s away from the mean (0). μ-2σ μ-σ μ μ+σ μ+2σ © Scott Evans, Ph.D. and Lynne Peeples, M.S.
37
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Standard Normal Table © Scott Evans, Ph.D. and Lynne Peeples, M.S.
38
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Standardization Common Mistake: X has mean μ and SD = σ, then Z=(X- μ)/σ ~ N(0,1). This is NOT true!! It is true that Z has mean 0 and SD=1 (standardization). However, Z is only normal if X was also normal. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
39
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Standardization However, if X~N(μ,σ), then Z=(X-μ)/σ ~N(0,1). We can then make probabilistic statements about X. Thus we can make probabilistic statements about any variable with any normal distribution. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
40
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Example - IQ IQ~N(100,15) What’s the probability that a person chosen at random has an IQ>135? Z = ( )/15 = 2.33 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
41
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Example - IQ P(Z>2.33) = 0.010 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
42
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Example – IQ What’s the probability that a person chosen at random has an IQ<90? Z = (90-100)/15 = -0.67 By symmetry, P(Z<-0.67) = P(Z>0.67) Probabilities that a person chosen at random has an IQ between two values may also be obtained. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
43
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Example - IQ P(Z>0.67) = 0.251 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
44
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Central Limit Theorem A very important result in statistics that permits use of the normal distribution for making inferences (hypothesis testing and estimation) concerning the population mean. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
45
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Central Limit Theorem Sample 1 x1 Population (any distribution) μ,σ Sample 2 x2 Sample Means Sample 3 x3 Sample 5 x5 Sample 4 x4 All samples of size n © Scott Evans, Ph.D. and Lynne Peeples, M.S.
46
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Central Limit Theorem If the distribution of each observation in the population has mean μ and standard deviation σ regardless of whether the distribution is normal or not : 1. The distribution of the sample means (from samples of size n taken from the population) has mean μ identical to that of the population. 2. The standard deviation of this distribution is as n as σ 3. As n gets large the shape of the distribution of the sample means is approximately that of a normal distribution © Scott Evans, Ph.D. and Lynne Peeples, M.S.
47
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Central Limit Theorem Variable X, population mean=100, SD=15 Samples of size 25 (for example) Sample 1, mean=90 Sample 2, mean=115 Sample 3, mean=101 Sample 4, mean=94 . Sample 30, mean=99 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
48
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Central Limit Theorem Plot sample means (histogram): The sample means have mean 100 The sample means have a SD of = 15/5 = 3 The distribution of sample means would tend to be normal as n gets large. © Scott Evans, Ph.D. and Lynne Peeples, M.S.
49
© Scott Evans, Ph.D. and Lynne Peeples, M.S.
Central Limit Theorem Now we can combine this normality result from the CLT with standardization to make probabilistic statements about the population mean! © Scott Evans, Ph.D. and Lynne Peeples, M.S.
50
Sampling Distribution of Sweethearts
Assume, μ = = 2.71/3.46 = 0.78 Population Distribution Sampling Distribution of Means © Scott Evans, Ph.D. and Lynne Peeples, M.S.
51
Sampling Distribution of Sweethearts
Population Distribution Sampling Distribution of Means © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.