Sampling and Confidence Interval Kenneth Kwan Ho Chui, PhD, MPH Department of Public Health and Community Medicine Epidemiology/Biostatistics
Learning objectives in the syllabus Understand how a histogram can be read as a probability distribution Understand the importance of random sampling in statistics Understand how sample means can have distributions Explain the behavior (distribution) of sample means and the Central Limit Theorem Know how to interpret confidence intervals as seen in the medical literature Know how to calculate a confidence interval for a mean
Population Parameter Sample statistics Sample Types of data How to summarize data Central tendency Variability How to evaluate graphs Distribution of sample means Know how to interpret and calculate a confidence interval for statistical inference
Assumed knowledge for today Mean Variance Standard deviation The rule
Central tendency: Mean Consider a variable with data: 1, 2, 3, 3, 4, 4, 4, 5, 5, 6
Variance & Standard deviation Observation # Values Sum them up Divide by (sample size – 1) Variance SD = √Variance
The rule 68% of sample are within ± 1SD 95% of samples are within ± 2SD 99% of samples are within ± 3SD 50 th 84 th 97.5 th 99.5 th 16 th 2.5 th 0.5 th Percentile: # of SD:
Population Parameter Sample statistics The mean BMI of a sample from Boston, Massachusetts The true mean BMI of Boston, Massachusetts Sample Researcher ?
Sample variation 1, 2, 3, 4, 5, 6 2, 44, 61, 21, 6 Samples Means Researcher 1Researcher 2Researcher 3Researcher 4 Researchers The whole population ?
Central limit theorem
Central limit theorem The means obtained from many samplings from the same population have the following properties The distribution of the means is always normal if the sample size is big enough (above 120 or so), regardless of the population’s distribution The mean of the sample mean is equal to the population mean The standard deviation of the sample means, known as the standard error of the mean (SEM) is inversely related to the sample size: if we repeat the experiment with a bigger sample size, the resulting histogram will be “slimmer”
Understanding CLT through simulation Population size: Possible values: 0 through 9, 1000 each True population mean: 4.50
Simulation scheme A population of Mean = 4.5 Sample n=500 Sample mean Frequency Sample mean 10000
Sample size = 500; # of draws = Sample means Frequency % 95% 99% SD = 0.13 SE ±1 SE: 67.95% ±2 SE: 95.04% ±3 SE: 99.10%
Characteristics for the distribution of means In the previous slide, the mean 4.5 is the true population parameter, for which we have a Greek name, μ (mu) Similarly, the SD 0.13 is the true population parameter, called σ (sigma) in Greek. We call this SD of means “standard error of means” (SEM) or “standard error” (SE) SE can be estimated using sample SD:
Why bigger sample sizes are often better Sample size = 500 Sample size = 1000Sample size = 200 Sample means SE = 0.13 SE = 0.08SE = 0.20
Confidence interval
I got CLT, so now what? The histogram can be viewed as a “probability distribution” The sample mean from a researcher can be any pixel under the bell curve How should we define “acceptably close” to the population mean? 95%
The confidence interval 95%
True mean If we put a CI on every sample mean, about 95% of them would include the true mean. The two red ones are the “unlucky” samples which do not include the true mean.
Interpretation of a confidence interval The mean and 95% confidence interval (CI) of the blood glucose of a sample is: 140 mg/dl (95%CI: 120, 160) We are 95% certain that the true population mean glucose falls between 120 and 160 mg/dl. Our best estimate is 140 mg/dl (i.e. the sample mean) Why only 95% certain? Because the sample mean can be, unfortunately, an extreme one beyond ± 2 SE (the blue zones)
Some common CIs and their z -score multipliers There are two numbers in a confidence interval: the lower and upper confidence limits 90%CI: Mean ± 1.65 SE 95%CI: Mean ± 1.96 SE 2.00 is an approximation, 1.96 is recommended The most commonly used criterion 99%CI: Mean ± 2.58 SE The more certain we want the interval to include the true mean, the wider the CI becomes “I am 100% certain that the true mean is between –∞ and ∞.”
How to narrow down confidence interval? Lower our certainty by opting for, say, a 90%CI instead of a 95%CI Decrease sample standard deviation (for instance, using a more accurate measurement device) Increase sample size
Are confidence intervals always symmetric? Not in all occasions. CIs for untransformed continuous variables are symmetric However, CIs for other statistics such as odds ratios and relative risks are calculated on logarithmic scale. When back-transformed to the ratios, the interval will be asymmetric “Multivariable analysis revealed a more than 2-fold increase in the risk of total stroke among men with job strain (combination of high job demand and low job control) (hazard ratio, 2.73; 95% confidence interval, )”
Quiz A study recruited 100 subjects and examined their height. The mean of their height is 155 cm What is the most likely type of data? A) Binary B) Nominal C) Ordinal D) Continuous
Quiz A study recruited 100 subjects and examined their height. The mean their heights is 155 cm The median of their heights is 140 cm, the height variable is likely to be: A) Normally distributed B) Skewed to the left (negatively skewed) C) Skewed to the right (positively skewed)
Quiz A study recruited 100 subjects and examined their height. The mean ± SD of their height is 155 ± 10 cm Assume the height data are normally distributed. Which of the following is false? A) 16% of the subjects are shorter than 145 cm B) The standard error of the mean is 10/√100 = 1 cm C) We are 95% certain that the sample mean is between (155 ± 1.96 standard error) cm D) We are 95% certain that the population mean is between (155 ± 1.96 standard error) cm
Another application Other than estimating the true mean, μ, we can also assume the μ to be a certain hypothesized value Then, we can sample and derive the sample mean and 95%CI If the 95%CI does not include the assumed μ (the sample mean falls into the blue zones), we can then conclude that our sample is, probability-wise, weird; it is perhaps different from the assumed population The foundation of hypothesis testing (We’ll learn it next week!)