Download presentation
Presentation is loading. Please wait.
Published byPrimrose Marsh Modified over 8 years ago
1
Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random samples from a population whose composition is known. This mainly depends on the size of the sample, not the size of the population.
2
Example Suppose a health study is based on a representative cross section of 6,672 Americans age 18 to 79. There are 3,091 men and 3,581 women. So men are about 46%. We want to interview a sample of size 100. To avoid bias, we are going to draw the sample at random. To do that, we put the names on 6,672 tickets, and draw out 100 tickets at random. We use a computer program to simulate this process: draw tickets without replacement.
3
Example In the previous sampling, there were 51 men and 49 women. This is not like the percentage in the population: 46% are men and 54% are women. This is due to chance variability. The sampling process is similar to the chance process we learned before. In the coin tossing process, chances are 50-50, whereas in our example, the chances are just about 46-54 each time. (Note that when the number of the tickets is large enough, although we draw tickets without replacement, draws can be treated as independent.) Here, the chance error for the percentage of men is 5%, since 51% = 46% + 5%. (Recall: estimate = parameter + chance error.)
4
Example If we repeat the sampling processes, the percentage of the men will varies from time to time:
5
Example From the previous table for the 250 samples, we can see the number of men ranged from a low of 34 to a high of 58. Only 17 samples out of the lot have exactly 46 men. Here is a histogram:
6
Example If we increase the size of the sample, then it will come out more like the population. For instance, we increase the sample size to 400. We draw another 250 samples. The percentage of men again varies from time to time. The low is 39%, and the high is 54%. Compare to the sample size of 100, the sample size of 400 is a bit closer to the population. (Size 100: low 34%, high 58%. The population: 46%.)
7
Example Here is the histogram for the percentages of men in samples of size 400:
8
Remarks As we may compare the samples, multiply the sample size by 4 cuts the likely size of the chance error in the percentage by a factor around 2. So one may expect, there could be some quantitative relation between the sample size and the chance error in the percentage. We did the sampling process every single time out of the population 6,672. All the samples are different to each other. This does no mean that there are 250 × 100 = 25,000 people in total. We repeat the process every time, so that some people could be picked many times.
9
The expected value and standard error With a simple random sample, the expected value for the sample percentage equals the population percentage. The standard error for the percentage is just the ratio of the standard error for the number relative to the sample size.
10
The expected value Let us continue the previous example: take a sample of size 100 from a population of 6,672 people in a health study, about 46% are men and 54% are women. We know that the percentage of men in the sample will be around the percentage of men in the population, that is about 46%. This is the expected value for the sample percentage in a simple random sample. In practice, for a sample, the percentage will not be exactly equal to its expected value----it will be off by a chance error. Similar to the chance process, in a sampling process, the likely size of the chance error is given by the standard error.
11
The standard error The idea to compute the standard error is that: First, find the SE for the number of men in the sample. Then, convert to percent, relative to the size of the sample, i.e. 100. Note: we are now doing things in percentage, so the SE must be converted to percent. To compute the SE in number, we must set up a box model. Since we are counting the men, in the box there should be tickets 1 and 0. The 1’s stand for men and 0’s stand for women.
12
The standard error There are 3,091 men and 3,581 women, so that in the box there are 3,091 1’s and 3,581 0’s.
13
The standard error
14
Increase the sample
15
The formulas
16
Note
17
Remarks Maybe some of you have noticed that the arguments are exact only when drawing with replacement. But as we mentioned earlier, when the number of draws is so small relative to the number of the total tickets in the box, the draws could be considered as independent. In this case, there is almost no difference between drawing with or without replacement. We may also notice that the SE for the number and the SE for the percentage behave quite differently: The SE for the sample number goes up like the square root of the sample size. The SE for the sample percentage goes down like the square root of the sample size.
18
Normal Approximation As before, we use the normal curve to estimate the probability in a certain interval.
19
Example In a town, the telephone company has 100,000 subscribers. They plan to take a simple random sample of 400 of the subscribers as part of a market research study. According to Census data, 20% of the company’s subscribers earn over $50,000 a year. Q1: The percentage of persons in the sample with incomes over $50,000 a year will be around ____, give or take ____ or so. Q2: Estimate the probability that between 18% and 22% of the persons in the sample earn more than $50,000 a year.
20
Solution To begin with, we first set up a box model. The problem that the persons’ incomes are over $50,000 a year or not, is just a classifying and counting problem. So we use tickets 1 and 0. The people earing more than $50,000 a year get 1’s, and the others get 0’s. Taking a sample of 400 from the population of size 100,000, is just like drawing 400 tickets at random from a box of 100,000 tickets. We will look at the sum of draws and the corresponding percentage. From the Census data, 20% of the tickets are 1’s, that is 20,000. The rest 80,000 are 0’s.
21
Solution
22
Notice that the classifying and counting problem is just a special case of sum of draws. We know that by the central limit theorem, the probability histogram for sum of draws follows the normal curve when the number of draws is reasonably large. So when the sample size is large enough, by a change of scale, the probability histogram for the sample percentage can be approximated by the normal curve. (Convert to percent is just a change of scale.) In Q2, we are dealing with the probability histogram for the sample percentage. By above arguments, the normal approximation applies.
23
Solution Since the expected value is 20%, and the SE is 2%, we now can convert the scale to standard units. The 18% is converted to -1, and the 22% is converted to +1. Recall that the area under the normal curve between -1 and +1 is about 68%. So the probability that between 18% and 22% of the persons in the sample earn more than $50,000 a year is about 68%. This completes the solution to Q2.
24
Remark In standard units, the histogram for number and the histogram for percentage are exactly the same. This is an application of the change of scale:
25
Note When the problem is about classifying and counting in a sample, then we set up a 0-1 box to get a percent. There could be problems about adding up the sample values. This will be the general case for the sum of draws box model. Then we have to set up the box with tickets of the sample values to get the sum or the average as we did before.
26
Size of Population We have already seen that the size of the sample will determine the standard error for the percentage. As a result, the size of the sample will determine the accuracy. We also have seen that when the size of the sample is so small relative to the size of the population, the sampling process can be considered as a box model----drawing with replacement. Then a natural question may come into your head: Will the size of population affect the accuracy?
27
Size of Population The answer is: No. When estimating percentages, it is the absolute size of the sample which determines accuracy, not the size relative to the population. This is true if the sample is only a small part of the population, which is the usual case.
28
Example In 2004, the presidential campaign Bush versus Kerry, focus on the Southwest: New Mexico and Texas. Pollsters try to predict the results. There are about 1.5 million eligible voters in New Mexico, and about 15 million in Texas. One polling organization takes a simple random sample of 2,500 voters in New Mexico. Another polling organization takes a simple random sample of 2,500 voters from Texas. Let us compare the accuracy of the two predictions.
29
Example Intuitively, the New Mexico poll should be more accurate than the Texas poll. Because the New Mexico poll is sampling 1 voter out of 600, while the Texas poll is sampling 1 voter out of 6,000. Let us set up two box models to have a look at this. One of the box has 1.5 million tickets, and the other has 15 million. The tickets are marked either 1 or 0. The tickets marked 1’s stand for Democrats, and the tickets marked 0’s stand for others. To keep life simple, we make the percentage of 1’s in the two boxes both equal to 50%.
30
Example
32
Remark
33
When the size of box is large relative to the number of draws, the correction factor is nearly 1 and can be ignored: This again states that in general it is the absolute size of the sample which determines accuracy. The size of the population does not really matter. On the other hand, if the sample is a substantial fraction of the population, the correction factor must be used. But in general, it is not the case.
34
Comments for the sampling All the arguments in this chapter focus on simple random sampling. But the conclusion holds for most probability methods of drawing samples. (e.g. multistage cluster sampling) A very important point is that: the likely size of the chance error in sample percentages depends mainly on the absolute size of the sample, not on the size of the population. That is, the size of the sample mainly determine the accuracy.
35
Comments for the sampling For example, the Gallup Poll predicts the vote with good accuracy by sampling several thousand eligible voters out of 200 million. This is amazing. The sample of size 2,500 is big enough. Suppose we toss a coin for 2,500 times, the standard error for the percentage of heads is only 1%. Similarly, with a sample of 2,500 voters, the likely size of the chance error is only about 1% or so. This will work unless the election is very close, like Bush versus Gore in 2000.
36
Summary
37
When the sample is only a small part of the population, it is the sample size which determines the accuracy, not the population size. When the box size is large relative to the number of draws, there is almost no difference between drawing with and without replacement. In this case, the correction factor is nearly 1. The process can be considered as drawing with replacement.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.