The Practice of Statistics in the Life Sciences Fourth Edition

The Practice of Statistics in the Life Sciences Fourth Edition
Chapter 13: Sampling distributions Copyright © 2018 W. H. Freeman and Company

Objectives Sampling distributions Parameter versus statistic
Sampling distribution of the sample mean The central limit theorem Sampling distribution of the sample proportion The law of large numbers

Parameter versus Statistic
Population: the entire group of individuals in which we are interested but usually can’t assess directly. A parameter is a number summarizing the population. Parameters are usually unknown. Sample: the part of the population we actually examine and for which we do have data. A statistic is a number summarizing a sample. We often use a statistic to estimate an unknown population parameter.

Sampling distributions
Different random samples taken from the same population will give different statistics. But there is a predictable pattern in the long run. A statistic computed from a random sample is a random variable. The sampling distribution of a statistic is the probability distribution of that statistic for samples of a given size n taken from a given population.

Sampling distribution of the sample mean
The mean of the sampling distribution of 𝒙 is μ. There is no tendency for a sample average to fall systematically above or below μ, even if the population distribution is skewed.  𝒙 is an unbiased estimate of the population mean μ. The standard deviation of the sampling distribution of 𝒙 is σ/√n. The standard deviation of the sampling distribution measures how much the sample statistic 𝑥 varies from sample to sample. Averages are less variable than individual observations.

For Normally distributed populations (1 of 2)
When a variable in a population is Normally distributed, the sampling distribution of the sample mean x̅ is also Normally distributed.

For Normally distributed populations (2 of 2)
population N(µ, σ) sampling distribution N(µ, σ/√n)

Sampling distribution example (1 of 2)
The blood cholesterols of 14-year-old boys is ~ N(µ = 170, σ = 30) mg/dL. The population: Use the 68–95–99.7% rule for approximate Normal calculations. The middle 99.7% of cholesterol levels in boys is 80 to 260 mg/dL.

Sampling distribution example (2 of 2)
Now consider random samples of 25 boys. The sampling distribution of average cholesterol levels is ~ N(µ = 170, σ = 30/√25 = 6) mg/dL: Use the 68–95–99.7% rule for approximate Normal calculations. The middle 99.7% of average cholesterol levels (of 25 boys) is 152 to 188 mg/dL.

Another sampling distribution example
Deer mice (Peromyscus maniculatus) have a body length (excluding the tail) known to vary Normally, with a mean body length µ = 86 mm, and standard deviation σ = 8 mm. For random samples of 20 deer mice, the distribution of the sample mean body length is Normal, mean 86, standard deviation 8 mm. Normal, mean 86, standard deviation 20 mm. Normal, mean 86, standard deviation mm. Normal, mean 86, standard daeviation 3.9 mm. C) Normal, mean 86, standard deviation Population distribution: N ( = 86;  = 8) Sampling distribution for n = 20 is N ( = 86;  /√n = 1.789)

Standardizing a Normal sample distribution (1 of 2)
When the sampling distribution is Normal, we can standardize the value of a sample mean x̅ to obtain a z-score. This z-score can then be used to find areas under the sampling distribution from Table B. 𝑥 𝑁 𝜇, 𝜎 √𝑛 →𝑧= 𝑥 −𝜇 𝜎 𝑛 → 𝑧 𝑁 0, 1

Standardizing a Normal sample distribution (2 of 2)
Here, we work with the sampling distribution, and σ /√n is its standard deviation (indicative of spread). Remember that σ is the standard deviation of the original population.

Standardization example (1 of 2)
Hypokalemia is diagnosed when blood potassium levels are low, below 3.5 mEq/dL. Let’s assume that we know a patient whose measured potassium levels vary daily according to N(µ = 3.8, σ = 0.2). (Note that this mean is high enough that the patient is not considered to be hypokalemic.) If only one measurement is made, what is the probability that this patient will be misdiagnosed hypokalemic? 𝑧= 𝑥−𝜇 𝜎 = 3.5− =−1.5 P(z < 1.5) = ≈ 7% The first question asks about a single measurement, and therefore relies on the population distribution N(µ, σ). The second question asks about the average from four measurements, and therefore relies on the sampling distribution N(µ,σ/√n). Make sure to standardize (z) using the standard deviation for the sampling distribution.

Standardization example (2 of 2)
If instead measurements are taken on four separate days, what is the probability of such a misdiagnosis? 𝑧= 𝑥 −𝜇 𝜎 𝑛 = 3.5− =−3 P(z < 3) = ≈ 0.1% Note: This calculation demonstrates that an average of 4 measurements is more likely to be closer to the true average than individual measurements. The first question asks about a single measurement, and therefore relies on the population distribution N(µ, σ). The second question asks about the average from four measurements, and therefore relies on the sampling distribution N(µ,σ/√n). Make sure to standardize (z) using the standard deviation for the sampling distribution.

The central limit theorem
Central limit theorem: When randomly sampling from any population with mean m and standard deviation σ, when n is large enough, the sampling distribution of 𝑥 is approximately Normal: N(µ, σ /√n). The larger the sample size n, the better the approximation of Normality. This is very useful in inference: Many statistical tests assume Normality for the sampling distribution. The central limit theorem tells us that, if the sample size is large enough, we can safely make this assumption even if the raw data appear non-Normal.

How large a sample size? It depends on the population distribution. More observations are required if the population distribution is far from Normal. A sample size of 25 or more is generally enough to obtain a Normal sampling distribution from a skewed population, even with mild outliers in the sample. A sample size of 40 or more will typically be good enough to overcome an extremely skewed population and mild (but not extreme) outliers in the sample. In many cases, n = 25 isn’t a huge sample. Thus, even for strange population distributions we can assume a Normal sampling distribution of the sample mean, and work with it to solve problems. There is no formula setting these numbers in stone. They are sample sizes that have been shown, via simulations, to lead to approximately Normal sampling distributions for given population distribution shapes. Use them as rough guidelines.

When the population is skewed
Even though the population (a) is strongly skewed, the sampling distribution of 𝑥 when 𝑛=25 (d) is approximately Normal, as expected from the central limit theorem.

Is the population Normal?
Sometimes we are told that a variable has an approximately Normal distribution (e.g., large studies on human height or bone density). Most of the time, we just don’t know. All we have is sample data. We can summarize the data with a histogram and describe its shape. If the sample is random, the shape of the histogram should be similar to the shape of the population distribution. The central limit theorem can help guess whether the sampling distribution should look roughly Normal or not.

Central limit theorem examples (1 of 4)
Angle of big toe deformations in 38 patients: Symmetrical, one small outlier Population likely close to Normal Sampling distribution ~ Normal

Histogram of number of fruit per day for 74 adolescent girls Skewed, no outlier Population likely skewed Sampling distribution ~ Normal given large sample size

Atlantic acorn sizes (in cm3) Sample of 28 acorns: Describe the distribution of the sample. What can you assume about the population distribution? What would be the shape of the sampling distribution: For samples of size 5? For samples of size 15? For samples of size 50? The histogram is strongly right-skewed. We expect that the population is probably also strongly right-skewed. Because of that, the central limit theorem tells us that the sampling distribution of mean acorn size would be approximately Normal for a sample size of 50, but not 5 or even 15.

The histogram is strongly right-skewed. We expect that the population is probably also strongly right-skewed. Because of that, the central limit theorem tells us that the sampling distribution of mean acorn size would be approximately Normal for a sample size of 50, but not 5 or even 15.

Chapter 12 reminder: proportions (1 of 2)
A population contains a proportion p of successes. If the population is much larger than the sample, the count X of successes in an SRS of size n has approximately the binomial distribution B(n, p) with mean µ and standard deviation σ: 𝜇=𝑛𝑝 𝜎= 𝑛𝑝𝑞 = 𝑛𝑝 1−𝑝 Both counts of successes, np, and failures, n(1  p), should be at least 10. The population should also be much larger than the sample size (at least 20 times).

Chapter 12 reminder: proportions (2 of 2)
If n is large, and p is not too close to 0 or 1, this binomial distribution can be approximated by the Normal distribution: 𝑁 𝜇=𝑛𝑝, 𝜎= 𝑛𝑝 1−𝑝 Both counts of successes, np, and failures, n(1  p), should be at least 10. The population should also be much larger than the sample size (at least 20 times).

Sampling distribution of a proportion 𝑝 (1 of 2)
When randomly sampling from a population with proportion p of successes, the sampling distribution of the sample proportion 𝑝 [“p hat”] has mean and standard deviation: 𝜇 𝑝 =𝑝 𝜎 𝑝 = 𝑝 1−𝑝 𝑛

Sampling distribution of a proportion 𝑝 (2 of 2)
𝑝 is an unbiased estimator the population proportion p. Larger samples usually give closer estimates of the population proportion p.

Normal approximation The sampling distribution of p̂ is never exactly Normal. But as the sample size increases, the sampling distribution of p̂ becomes approximately Normal. The Normal approximation is most accurate for any fixed n when p is close to 0.5, and least accurate when p is near 0 or near 1. When n is large, and p is not too close to 0 or 1, the sampling distribution of p̂ is approximately: 𝑁 𝜇=𝑝, 𝜎= 𝑝 1−𝑝 𝑛 Both counts of successes, np, and failures, n(1  p), should be at least 10. The population should also be much larger than the sample size (at least 20 times).

Numerical example (1 of 2)
The frequency of color blindness (dyschromatopsia) in the Caucasian American male population is about 8%. We wish to take a random sample of size 125 from this population. What is the probability that 10% or more in the sample are color-blind? A sample size of 125 is large enough to use of the Normal approximation (np = 10 and n(1 – p) = 115). 𝑁 𝑝=0.08, 𝑝 1−𝑝 𝑛 =0.024 NORM.DIST is an Excel function for obtaining lower-tail areas under a normal curve. normalcdf is a distribution function in the TI-83 for obtaining any area under a normal curve (1E99 is the largest number that the calculator can handle; it is used instead of “infinite.”) Image from Wikimedia Commons, Author: Eddau processed File:Ishihara 2.svg by User:Sakurambo, with

Numerical example (2 of 2)
Normal approximation for p̂ sampling distribution: z = (p̂ – p) / σ = (0.10 – 0.08) / =  P(z ≥ 0.82) = from Table B Or P(p̂ ≥ 0.10) = 1 – NORM.DIST(0.10, 0.08, 0.024, 1) = (Excel) = normalcdf (0.10, 1E99, 0.08, 0.024) = (TI-83) NORM.DIST is an Excel function for obtaining lower-tail areas under a normal curve. normalcdf is a distribution function in the TI-83 for obtaining any area under a normal curve (1E99 is the largest number that the calculator can handle; it is used instead of “infinite.”) Image from Wikimedia Commons, Author: Eddau processed File:Ishihara 2.svg by User:Sakurambo, with

The law of large numbers (1 of 3)
Law of large numbers: As the number of randomly drawn observations (n) in a sample increases… the mean of the sample ( 𝑥 ) gets closer and closer to the population mean m (quantitative variable). The law of large number has important implications in the gambling and insurance industries.

the sample proportion ( 𝑝 ) gets closer and closer to the population proportion p (categorical variable). The law of large number has important implications in the gambling and insurance industries.

Note: When sampling randomly from a given population: The law of large numbers describes what would happen if we took samples of increasing size n. A sampling distribution describes what would happen if we took all possible random samples of a fixed size n. Both are conceptual ideas with many important practical applications. We rely on their known mathematical properties, but we don’t actually build them from data.

The Practice of Statistics in the Life Sciences Fourth Edition

Similar presentations

Presentation on theme: "The Practice of Statistics in the Life Sciences Fourth Edition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Practice of Statistics in the Life Sciences Fourth Edition

Similar presentations

Presentation on theme: "The Practice of Statistics in the Life Sciences Fourth Edition"— Presentation transcript:

Similar presentations

About project

Feedback