Objectives 6.1 Estimating with confidence Statistical confidence Confidence intervals Confidence interval for a population mean How confidence intervals behave Choosing the sample size
Methods for drawing conclusions about a population from sample data are called statistical inference So we’ll use data to make inferences; i.e., draw conclusions about populations from data in our samples or from our experiments We'll consider two types: Confidence interval estimation Tests of significance In both of these cases, we'll consider our data as either being a random sample from a population or as data from a randomized experiment Start with estimation… there are two situations we'll consider estimating the mean m of a population of measurements estimating the proportion p of Ss in a population of Ss and Fs
In either case, we'll construct a confidence interval of the form estimate +/- M.O.E., where M.O.E. = margin of error of the estimator. The MOE gives information on how good the estimate is through the variation in the estimator (its standard error) and through the level of confidence in the confidence interval (through a tabulated value). The standard error of an estimator is its estimated standard deviation (treating the estimator as a statistic with a sampling distribution…) Best estimator of m is and we know from the previous chapter that is approximately Best estimator of p is phat and we know from the last chapter that phat is approx.
Statistical confidence Although the sample mean, , is a unique number for any particular sample, if you pick a different sample you will probably get a different sample mean. In fact, you could get many different values for the sample mean, and virtually none of them would actually equal the true population mean, . x
But the sample distribution is narrower than the population distribution, by a factor of 1/√n. Thus, the estimates gained from our samples are always relatively close to the population parameter µ. n Sample means, n subjects Population, x individual subjects m If the population is normally distributed N(µ,σ), so will the sampling distribution N(µ,σ/√n),
95% of all sample means will be within the MOE (2 95% of all sample means will be within the MOE (2*s/√n) of the population parameter m.MOE=Margin of Error) Distances are symmetrical which implies that the population parameter m must be within roughly 2 standard deviations from the sample average , in 95% of all samples. Red dot: mean value of individual sample This reasoning is the essence of statistical inference - know and understand this figure!
Confidence intervals The confidence interval is a range of values with an associated probability or confidence level C. The probability quantifies the chance that the interval contains the true population parameter. ± 4.2 is a 95% confidence interval for the population parameter m. This equation says that in ~95% of the cases, the actual value of m will be within 4.2 units of the value of .
Reworded With 95% confidence, we can say that µ should be within roughly 2 standard deviations (2*s/√n) from our sample mean . In 95% of all possible samples of this size n, µ will indeed fall in our confidence interval. In only 5% of samples would be farther from µ.
A confidence interval can be expressed as: Sample Mean ± MOE MOE is called the margin of error m within ± m Example: 120 ± 6 Two endpoints of an interval m within ( − MOE) to ( + MOE) ex. 114 to 126 A confidence level C (in %) indicates the sense of confidence that the µ falls within the interval. It represents the area under the normal curve within ± MOE of the center of the curve. MOE MOE
Standardized height (no units) Review: standardizing the normal curve using z N(64.5, 2.5) N(µ, σ/√n) N(0,1) Standardized height (no units) Here, we work with the sampling distribution of the sample mean, and s/√n is its standard deviation (spread). Remember that s is the standard deviation of the original population.
Varying confidence levels Confidence intervals contain the population mean m in C% of samples, in the long run. Different areas under the curve give different confidence levels C. Practical use of z: z* z* is related to the chosen confidence level C. C is the area under the standard normal curve between −z* and z*. C The confidence interval is thus: −z* z* Example: For an 80% confidence level C, 80% of the normal curve’s area is contained in the interval.
How do we find specific z* values? We can use a table of z (Table A) or t values (Table D). In Table D, for a particular confidence level, C, the appropriate z* value is just above it. Example: For a 98% confidence level, z*=2.326 We can use software. In JMP: Create a new column, Edit Formula, and choose Normal Quantile( p ) under Probability where p = (1-C)/2 is the area to the left of z* Since we want the middle C probability, the probability we require is (1 - C)/2 Example: A 98% confidence level, Normal Quantile (.01) = −2.326349 (= neg. z*)
Link between confidence level and margin of error The confidence level C determines the value of z* (in table A or D). The margin of error m also depends on z*. Higher confidence C implies a larger margin of error m (thus less precision in our estimates). A lower confidence level C produces a smaller margin of error m (thus better precision in our estimates). C z* −z* m m
Different confidence intervals for the same set of measurements Density of bacteria in solution: Measurement equipment has standard deviation s = 1 * 106 bacteria/ml fluid. Three measurements: 24, 29, and 31 * 106 bacteria/ml fluid Mean: = 28 * 106 bacteria/ml. Find the 96% and 70% CI. 96% confidence interval for the true density, z* = 2.054, and write = 28 ± 2.054(1/√3) = 28 ± 1.19 x 106 bacteria/ml 70% confidence interval for the true density, z* = 1.036, and write = 28 ± 1.036(1/√3) = 28 ± 0.60 x 106 bacteria/ml
Properties of Confidence Intervals User chooses the confidence level Margin of error follows from this choice We want high confidence small margins of error The margin of error, , is smaller when z* (and thus the confidence level C) gets smaller σ is smaller n is larger
Impact of sample size The spread in the sampling distribution of the mean is a function of the number of individuals per sample. The larger the sample size, the smaller the standard deviation (spread) of the sample mean distribution. But the spread only decreases at a rate equal to 1/√n. Standard error ⁄ √n Sample size n
Sample size and experimental design You may need a certain margin of error (e.g., drug trial, manufacturing specs). In many cases, the population variability (s) is fixed, but we can choose the number of measurements (n). So plan ahead what sample size to use to achieve that margin of error. Remember, though, that sample size is not always stretchable at will. There are typically costs and constraints associated with large samples. The best approach is to use the smallest sample size that can give you useful results.
What sample size for a given margin of error? Density of bacteria in solution: Measurement equipment has standard deviation σ = 1 * 106 bacteria/ml fluid. How many measurements should you make to obtain a margin of error of at most 0.5 * 106 bacteria/ml with a confidence level of 95%? For a 95% confidence interval, z* = 1.96. Using only 15 measurements will not be enough to ensure that m is no more than 0.5 * 106. Therefore, we need at least 16 measurements.
Cautions about using Data must be a SRS from the population. Formula is not correct for other sampling designs. Inference cannot rescue badly produced data. Confidence intervals are not resistant to outliers. If n is small (<15) and the population is not normal, the true confidence level will be different from C. The standard deviation of the population must be known. The margin of error in a confidence interval covers only random sampling errors!
Interpretation of Confidence Intervals Conditions under which an inference method is valid are never fully met in practice. Exploratory data analysis and judgment should be used when deciding whether or not to use a statistical procedure. Any individual confidence interval either will or will not contain the true population mean. It is wrong to say that the probability is 95% that the true mean falls in the confidence interval. The correct interpretation of a 95% confidence interval is that we are 95% confident that the true mean falls within the interval. The confidence interval was calculated by a method that gives correct results in ~95% of all possible samples. In other words, if many such confidence intervals were constructed, ~95% of these intervals would contain the true mean. HW: Read Introduction to Chapter 6 and Section 6.1; do # 6.1-6.8, 6.10-6.18, 6.27, 6.28, 6.34, 6.35