Inference for Proportions
Preface (text notes)
Preface (text notes)
Categorical Data Up to this point we have only talked about quantitative variables. Many studies are about variables such as race, sex, occupation, make of a car, smoking or non- smoking, type of complaint received, etc. Under these types of variables our data are counts or percents obtained from counts How do we numerically summarize numerical data?
Big Picture p Population Parameter? Population Inference Sample Sample Statistic
Cold outside? Do you think the temperature is “low” outside? There are two possible answers to this question Yes or No (at least by our definition). Pretend that out of a sample of 20 people in our class, 15 say yes. So the proportion of people who walk outside and curse, “It’s ($(&@(& cold out here” is 15 = 0.75 or 75% 20
Cold! Lets assume that Stat 226 section D is a good representation of the ISU population We are interested in knowing the opinion of the entire ISU student body This is a new parameter that is denoted by P Thus P represents the population proportion Recall that we use a statistic to estimate the population parameter of interest In this case = 0.75 is a statistic or simply is the sample statistic.
Sample Proportion In general: A sample proportion from an SRS of size n is the number of “successes” over the total sample size
Sampling Distribution of Recall that a sampling distribution of a statistic is the distribution or shape center and spread of that are obtained from all possible samples of the same size.
Sampling Distribution of As the sample size increases, the sampling distribution of becomes approximately normal with Mean: p (Thus is an unbiased estimator) Standard Deviation Where p is the true population proportion This is denoted as
Standard Error Since we don’t know what p is we can use in the standard deviation to get the standard error. So we have
One Issue Unfortunately in practice it has been shown that using the estimate to construct confidence intervals can be inaccurate For example lets suppose that not one person noticed that the temperature outside is low (all wearing shorts?). Then and which means that we are certain that no one thinks the temperature is low We need to find an estimator that is still unbiased but doesn’t have this characteristic
Wilson Estimate Wilson estimate of p (the population proportion) is We are essentially adding four observations and two of them are considered successes (and two failures). The Wilson estimate always keeps us away from 0 and 1 and works well in practice. So the Wilson estimate for the proportion of people for whom the temperature outside is low:
Confidence Interval for p So a Confidence Interval for p based on the Wilson estimate is Where the standard error is And z* is the value of the standard Normal density curve with area C between -z* and z* The margin of error is Same old method in a different hat (or tilda) estimate margin of error Caution: Use this interval when sample size is at least n=5 and for confidence level 90%, 95%, or 99%
Cold! (again) A 95% confidence interval for the proportion of ISU students who think the temperature outside is low: So the 95% CI is (0.526, 0.889) Interpretation: I am 95% confident that between 52.6% and 88.9% ISU students’ first thought when they head outside is the temperature outside is low
Example cont. What does “95% confident” tell us? This interval is fairly wide. Hence, the interval does not give enough information about what p is. How can we make the interval narrower?
Determine a Sample Size Can we find the sample size needed to get a confidence interval for p that has a pre- determined a level of confidence and margin of error? Yes, we can! Just solve for n in the margin of error equation
Determine a Sample Size Recall: Solving for n we get But there is an issue! We don’t know until after data is collected.
Determine a Sample Size There are two solutions to the problem Use a guesstimate obtained from previous studies Use this estimate is conservative and will give the desired ME and confidence level always, but if then sample size will be larger than needed and more money will be spent than necessary. So if you are not given a guesstimate, revert to
Example Is there interest in a new product? One of your employees has suggested that your company develop a new product. You decide to take a random sample of your customers and ask whether or not there is interest in the new product. The response is on a 1 to 5 scale, with 1 indicating “definitely would not purchase”; 2, “probably would not purchase”; 3, “not sure”; 4, “probably would purchase”; 5, “definitely would purchase.” For an initial analysis, you will record the responses 1,2,and 3 as “no” and 4 and 5 as “Yes”. What sample size would you use if you wanted a 95% margin of error to be 0.1 or less?
Solution We want ME (or m) = 0.1 and our level of confidence to be 95% so z*=1.96. And since three responses are no we will use So: And n=89 (e.g., 92.19 rounded up to 93 then subtract 4 = 89)
Summary
Summary
Summary