Chapter 19: Confidence intervals for proportions
Coral Coral are in decline worldwide, possibly because of pollution or changes in sea temperature. One particular kind of coral, the sea fan, looks like a plant growing from the sea floor, but is actually an animal. Sea fans in the Caribbean Sea have been under attack by the disease aspergillosis. In June of 2000, the sea fan disease team from Dr. Drew Harvell’s lab randomly sampled some sea fans at the Las Redes Reef in Akumal, Mexico, at a depth of 40 feet. They found that 54 of the 104 sea fans they sampled were infected with the disease. What might this say about the prevalence of this disease among sea fans in general?
More about Coral We have a sample proportion, 𝑝 = 54 104 =51.9% Is this percentage close to the true population proportion, 𝑝? What do we know about the sampling variability? We can’t take additional samples, so what do we do?
Confidence Intervals Let’s look at our model for the sampling distribution. We know it’s approximately Normal (by first checking the appropriate assumptions). Because we only a single sample proportion, we’ll use the standard error, 𝑆𝐸 𝑝 = 𝑝 𝑞 𝑛 = 0.519 0.481 104 =0.049=4.9% We could label the model 𝑁 𝑝 ,𝑆𝐸 𝑝 , but what does that mean?
Things We Want to Say (and why each is wrong) “51.9% of all sea fans on the Las Redes Reef are infected.” We would never word anything this strongly and one sample doesn’t give us an exact percentage for the entire population. “It is probably true that 51.9% of all sea fans on the Las Redes Reef are infected.” We’re actually pretty sure that the true population percentage is NOT 51.9% because one sample probably won’t match the population.
Things We Want to Say (and why each is wrong) “We don’t know exactly what proportion of sea fans on the Las Redes Reef are infected, but we do know that it’s within the interval 51.9% ±2×4.9%. That is, it’s between 42.1% and 61.7%.” This is getting closer, but we still can’t be certain. “We don’t know exactly what proportion of sea fans on the Las Redes Reef are infected, but the interval from 42.1% to 61.7% probably contains the true proportion.” This one is close, but is too wishy-washy. There’s a much better way to word our conclusion.
Wording a Confidence Interval “We are 95% confident that between 42.1% and 61.7% of Las Redes sea fans are infected.” Uses 95% because that’s the middle 2 standard errors of our sample proportion This interval calculated and interpreted here is called a one-proportion z-interval
Just Checking In the 1992 presidential election Bill Clinton ran against George H.W. Bush and Ross Perot. That June the Gallup organization asked registered voters if there was “Some chance they could vote for other candidates” besides their expressed first choice. At that time, 62% if registered voters said “yes,” there was some chance they might switch. In June 2004, Gallup/CNN/USA Today asked 909 registered voters the same question. Only 18% indicated that there was some chance they might switch. The resulting 95% confidence interval is 0.18±0.025=15.5% 𝑡𝑜 20.5%. Are these statements about the 2004 presidential election correct? Explain.
Just Checking In the sample of 909 registered voters, somewhere between 15.5% and 20.5% of then said there is a chance they might switch votes. We are 95% confident that 18% of all U.S. registered voters had some chance of switching votes. We are 95% confident that between 15.5% and 20.5% of all U.S. registered voters had some chance of switching. No, we know that in the sample 18% said “yes;” there’s no need for a margin of error. No, we are 95% confident that the percentage falls into some interval, not exactly on a particular value. Yes. That’s what the confidence interval means .
Just Checking We know that between 15.5% and 20.5% of all U.S. registered voters had some chance of switching votes. 95% of all U.S. registered voters had some chance of switching votes. No, we don’t know for sure that’s true; we are only 95% confident. No, that’s our level of confidence, not the proportion of voters. The sample suggests the proportion is much lower.
What Does “95% Confidence” Really Mean? In the first example, we guessed the true proportion of infected sea fans is between 42% and 62%. Is it possible for another researcher to have an estimate between 46% and 66%? How about 23% and 43%? Each sample produces a unique confidence interval relative to centering its confidence intervals at the proportions the sample produced. What each confidence interval attempts to do is capture the true proportion value within the range (and not all confidence intervals will do this).
Visualization of Confidence Intervals Proportion
Margin of Error: Certainty vs. Precision Confidence intervals: 𝑝 ±2𝑆𝐸 𝑝 (for 95% confidence) The extent of the interval of either side of 𝑝 is called the margin of error (ME). In general, confidence intervals look like: 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 ±𝑀𝐸 The margin of error for our 95% confidence interval was 2 SE. The more confident we want to be, the larger the margin of error must be (but we usually like 95%/2SE).
Critical Values The number of SE’s we use for our confidence level is called the critical value. For confidence intervals, critical values are based on the Normal model – noted with z* (they use z-scores, but we use the * to indicate a critical value) For a 95% confidence interval, z* = 1.96 (close to 2 from the 68-95-99.7 Rule) For a 90% confidence interval, z* = 1.645 -1.645 1.645 -3 -2 -1 0 1 2 3
Assumptions and Conditions All statistical models make assumptions. If the assumptions are not true, the model might be inappropriate or our conclusions based on it may be wrong. When we have data, we can often decide whether an assumption is plausible by checking a related condition. For linear regression, we checked the Linearity Condition by looking at the data and seeing if it was “straight enough.”
Independence Assumption If checking independence is impossible, we can often check a related condition: Plausible Independent Condition: Is there any reason to believe that the data values somehow affect each other? Randomization Condition: Were the data sampled at random or generated from a properly randomized experiment? Proper randomization helps ensure independence. 10% Condition: When the sample exceeds 10% of the population, the probability of a success changes so much that a Normal model may no longer be appropriate, but if less than 10% is sampled, we can pretend to have independence.
Sample Size Assumption The model we use for inference is based on the Central Limit Theorem (quantitative data) Success/Failure Condition is needed here
One-Proportion Z-Interval When the conditions are met, we are ready to find the confidence interval for the population proportion, 𝑝. The confidence interval is 𝑝 ± 𝑧 ∗ ×𝑆𝐸 𝑝 where the standard deviation of the proportion is estimated by SE 𝑝 = 𝑝 𝑞 𝑛 .
Another Example In May 2002, the Gallup Poll asked 537 randomly sampled adults the question “Generally speaking, do you believe the death penalty is applied fairly or unfairly in this country today?” Of these, 53% answered “fairly” and 7% said they didn’t know. The rest answered “unfairly.” What can we conclude from this survey?
Checking the Conditions Plausible Independence Condition – Gallup phoned a random sample of US adults, it’s unlikely that any of their respondents influenced each other Randomization Condition – Gallup drew a random sample from all US adults 10% Condition – the sample of 537 US adults is less than 10% of all US adults Success/Failure Condition – 𝑛 𝑝 =537 .53 =285≥10 & 𝑛 𝑞 =537 .47 =252≥10, so the sample is large enough
The Actual Confidence Interval 𝑛=537, 𝑝 =0.53, so 𝑆𝐸 𝑝 = 𝑝 𝑞 𝑛 = .53 .47 537 =0.0215 Because the sampling model is Normal, for a 95% confidence interval, the critical value is z* = 1.96. The margin of error is 𝑀𝐸= 𝑧 ∗ ×𝑆𝐸 𝑝 =1.96 0.0215 =0.043 So the 95% confident interval is 0.53±0.043 𝑜𝑟 0.487, 0.573 I am 95% confident that between 48.7% and 57.3% of all U.S. adults think that the death penalty is applied fairly.
Just Checking Think some more about the 95% confidence interval we just created for the proportion of U.S. adults who think the death penalty is applied fairly. If we wanted to be 98% confident, would our confidence interval need to be wider or narrower? Our margin of error was about ±4%. If we wanted to reduce it to ±3%, would our level of confidence be lower or higher? If the Gallup organization had polled more people, would the interval’s margin of error have been larger or smaller? Wider Lower Smaller
Confidence Intervals on the Calculator Under STAT , choose TESTS Select A: 1-PropZInt x: number of successes n: total number of trials C-Level: level of confidence you want
What Can Go Wrong? Don’t suggest that the parameter varies – “There is a 95% chance that the true proportion is between 42.7% and 51.3%” suggests that the population proportion varies, but it’s a fixed value; the CI is the interval that varies from sample to sample. Don’t claim that other samples will agree with yours – “In 95% of samples of U.S. adults, the proportion who think marijuana should be decriminalized will be between 42.7% and 51.3%” suggests that the interval is about the sample proportion when it should be about the population proportion. Don’t be certain about the parameter – even a 95% interval is only “correct” about 95% of the time
More Things That Can Go Wrong Don’t forget: It’s the parameter – “I’m 95% confident that 𝑝 is between 42.1% and 61.7%” suggests the CI is about the sample, not the population proportion Don’t claim to know too much – “I’m 95% confident that between 42.1% and 61.7% of all sea fans in the world are infected” means you extrapolated (we only sampled in the Las Redes Reef) Do take responsibility – CI’s are about uncertainty, so it’s perfectly ok to say, “I’m 95% confident that between 41.2% and 61.7% of the sea fans on the Las Redes Reef are infected.”
Margin of Error Too Large Saying between 10% and 90% of the sea fans on the Las Redes Reef are infected wouldn’t be useful One way to make the margin of error smaller is to reduce the level of confidence Rarely, however, do we report a confidence level below 80% Another way to get a narrower interval (without giving up confidence) is to have less variability in your sample proportion – choose a larger sample
Choosing Your Sample Size Suppose a candidate is planning a poll and wants to estimate voter support within 3% with 95% confidence. How large of a sample size do we need? 𝑀𝐸= 𝑧 ∗ 𝑝 𝑞 𝑛 In order to do this, we need to find (or estimate) 𝑝 . Because we don’t know it, we can use the value that will make 𝑝 𝑞 the largest 0.5 0.03=1.96 0.5 0.5 𝑛 0.03 𝑛 =1.96 0.5 0.5 𝑛 =32.67 𝑛=1067.1 To be safe, we’ll round it up to 1068 respondents. Often times, however, finances prevent us to from getting the sample size we want. 5% margin of error is usually acceptable, but we can go up to 10% depending on circumstances.
Violations of Assumptions Watch out for biased sampling – Think about independence – while we almost always can’t actually check this, we can think about whether or not our sample should be independent