Download presentation
Presentation is loading. Please wait.
Published byElaine Farmer Modified over 8 years ago
1
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by box models. We first set up a box, then analyze the sample (the draws). These are all about descriptions. But it is often very useful to turn things around: Analyze the draws, then derive conclusions about the box. This is called inference from the sample to the population.
2
Introduction Suppose a survey organization wants to know the percentage of Democrats in a certain district. They might estimate it by taking a simple random sample. In general, the percentage of Democrats in the sample will be a good estimate to the percentage of Democrats in the district. Since the sample is chosen at random, it is possible to say how accurate the estimate is likely to be, just from the size and composition of the sample. This technique is one of the key ideas in statistical theory.
3
Example
4
We all know that the percentage in the sample is different from the sample in the whole district. There will be something off, the chance error. So the candidate is a little worried about the chance error. Because if it is off by as much as 3%, then he loses. However, the pollster tells the candidate a good news: we can be about 95% confident that we are right to within 2%. It looks good.
5
Questions We may ask some natural questions about the statement spoken by the pollster: Where does the 95% come from? What does it mean by 95% confidence? Where does the 2% come from? Why is that it is within 2%, not 1% or 3%? Let us go back to the example to see how sampling procedure works.
6
Example First of all, we need to set up a box model as before: In the district, there are 100,000 eligible voters. So in the box, there are 100,000 tickets. The box is just the population. The simple random sample size is 2,500. So there are 2,500 draws made at random. Once again, it does not matter whether it is with or without replacement. In the box, there are two kinds of tickets: 1’s and 0’s. The 1’s stand for the votes for the candidate, the 0’s stand for the other votes. The sum of draws will be the number of voters in the sample who favor the candidate. This completes the model.
7
Example
9
Remarks The calculation shows the candidate will obtain the votes around 53%, give or take 1% or so. Therefore, it is unlikely to be off by as much as 3%----that’s 3 SEs off. So he is well on the safe side of 50%, and he should enter the primary. The bootstrap: When sampling from a 0-1 box whose composition is unknown, the SD of the box can be estimated by substituting the fractions of 0’s and 1’s in the sample for the unknown fractions in the box. The estimate is good when the sample is reasonably large.
10
Remarks The bootstrap procedure may seem crude. But in fact, even with moderate-sized samples, the fractions among the draws are likely to be quite close to the fractions in the box. One thing should be pointed out here: The expected value for the number of draws is unknown. And so is the expected value for the percentage. Note that in the example, 1,328 voters is an observed value. We cannot compute the exact value of the chance error, but we can compute the likely size of the chance error by computing the SE.
11
Another Example In fall 2005, a city university had 25,000 registered students. To estimate the percentage who were living at home, a simple random sample of 400 students was drawn. It turned out that 317 of them were living at home. Estimate the percentage of students at the university who were living at home in fall 2005. What is the standard error to the estimate?
12
Solution
14
Remark In the two examples, we focused on simple random sampling, where the mathematics is easiest. In practice, survey organizations use much complicated designs. But with probability methods, it is generally possible to compute how big the chance error are likely to be. This is one of the great advantages of probability methods for drawing samples.
15
Confidence intervals In the 1 st example, we still don’t know where 95% comes from. What does it mean by 95% confidence? To make inference about the population percentage from the sample, we introduce the confidence interval to interpret the accuracy.
16
Introduction In the 2 nd example, we know that the likely size of the chance error is about 2%. We have the equation: sample percentage = population percentage + chance error. But it is still possible that we may get 4% off or 6% off. Then it is 2 SEs or 3 SEs away from the population percentage. In that case, it is less likely to happen, due to the probability. Within 1 SE or 2 SEs, it will be more likely to happen. That is, it is about 68% chance that the interval “sample percentage ± SE” covers the population percentage, or it is about 95% chance that the interval “sample percentage ± 2 SEs” covers the population percentage.
17
Definition
18
We can always define a confidence interval with a different confidence level: Indeed, any interval with a confidence level except 100% is possible, by going the right number of SEs in either direction from the sample percentage. For instance: The interval “sample percentage ± 1 SE” is a 68%-confidence interval for the population percentage; The interval “sample percentage ± 2 SEs” is a 95%-confidence interval for the population percentage; The interval “sample percentage ± 3 SEs” is a 99.7%-confidence interval for the population percentage.
19
Definition
20
Example A simple random sample of 1,600 persons is taken to estimate the percentage of Democrats among the 25,000 eligible voters in a town. It turns out that 917 people in the sample are Democrats. Q: Find a 95%-confidence interval for the percentage of Democrats among all 25,000 eligible voters.
21
Solution
22
The tickets corresponding to Democrats are marked 1, and the others are marked 0. The number of Democrats in the sample is like the sum of draws. This completes the model. Since we don’t know the composition of the box, we have to apply the bootstrap procedure to estimate the SD of the box. By previous calculation, the fraction of 1’s can be estimated by 0.573. Then the fraction of 0’s can be estimated by 0.427.
23
Solution
24
Remarks Confidence levels are often quoted as being “about” so much. For instance, in the previous problem, we can be about 95% confident that between 54.8% and 59.8% of the eligible voters in this town are Democrats. There are two reasons: (i) The standard errors have been estimated from the data. (ii) The normal approximation has been used. So just imagine that the percentage composition of the population is very close to the sample. And the sample size is large enough so that the data follow the normal curve.
25
Interpreting a confidence interval
26
It seems very natural to say “There is a 95% chance that the population percentage is between 75% and 83%.” But this causes a problem. In the frequency theory, the probability represents the percentage of the time that something will happen. The percentage of students who were living at home is fixed. It won’t change no matter how many times we sample the students. So this percentage is either between 75% and 83%, or not. In terms of frequency theory, it must be 100% or 0%.
27
Interpreting a confidence interval Hence, we don’t say “There is a 95% chance that the population percentage is between 75% and 83%.” We say “We are about 95% confident that the population percentage is between 75% and 83%”. The word “confident” is to remind you that the probability here is in the sampling procedure, not in the parameter. The idea of sampling procedure can be explained in the following way:
28
Sampling procedure The probability (95%) here is about the sample, and the sample percentage follow the normal distribution. Note that the confidence interval depends on the sample. With some samples, the interval “sample percentage ± 2 SEs” cover the population percentage. But with some other samples, the interval fails to cover. So the confidence level of 95% simply states that: for about 95% of all samples, the interval “sample percentage ± 2 SEs” covers the population percentage.
29
Sampling procedure We usually cannot tell whether the particular interval covers the population percentage or not. This is because we do not know the actual parameter. But we are using a procedure that generally works about 95% of the time. We may think of the procedure as the interval is drawn at random from a box of intervals, where 95% cover the parameter and only 5% fail.
30
Remark A confidence interval is used when estimating an unknown parameter from sample data. The interval gives a range for the parameter, and a confidence level that the range covers the true value. Probabilities are used when we reason forward, from the box to the draws. Confidence levels are used when we reason backward, from the draws to the box. The idea for confidence level is a bit difficult, because it involves thinking not only about the actual sample but about other samples that could have been drawn. The following is an illustration:
31
Interpreting confidence intervals Suppose we want to estimate the percentage of red marbles in a large box. We use a computer to simulate 100 samples. (To complete the model, we set the percentage is 80%, which is in reality unknown to us.) Each sample is of size 2,500. For each sample, we compute the 95%- confidence interval using “sample percentage ± 2 SEs”. The percentage is different from sample to sample, and so it the estimated SE. So the intervals have different centers and lengths. About 95% of them should cover the parameter, which is marked by a vertical line. In fact, there are 96 out of 100 do cover the parameter in this simulation.
32
Comments for sampling From the last chapter, we see that the conclusion----it is the sample size, not the population size mainly determines the accuracy----holds for most probability methods of drawing samples. We have to point out that the formulas for simple random sampling may not apply to other kinds of samples. This is because the probability for drawing tickets from the box is different from the probability for drawing sample with a complicated method. This can be seen by comparing the Gallup Poll samples to simple random samples of the same size.
33
The Gallup Poll Here is the table comparing the Gallup Poll with a simple random sample.
34
The Gallup Poll Most errors were considerably larger than the SE for the simple random samples. One reason is that predictions are based only on part of the sample, namely, those people judged likely to vote. This eliminates about half the sample. Here is the new table for comparison. The simple random sample formula is still not doing well.
35
The Gallup Poll The reasons are that: (i) The process used to screen out the non-voters may break down at times. (ii) Some voters may still not have decided how to vote when they are interviewed. (iii) Voters may change their minds between the last pre-election poll and election day, especially in close contests. As a result, in reality survey organizations have to use more complicated methods for estimating the SE.
36
Summary With a simple random sample, the sample percentage is used to estimate the population percentage. The bootstrap procedure: when sampling from a 0-1 box whose composition is unknown, the SD of the box can be estimated by substituting the fractions of 0’s and 1’s in the sample for the unknown fractions in the box. A confidence interval for the population percentage is obtained by going the right number of SEs either way from the sample percentage. The confidence level is read off the normal curve. This method should only be used with large samples.
37
Summary In the frequency theory of probability, parameters are not subject to chance variation. We use confidence statements instead of probability statements. The formulas for simple random sampling may not apply to other kinds of samples, even if the samples are drawn by probability methods.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.