Presentation is loading. Please wait.

Presentation is loading. Please wait.

The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.

Similar presentations


Presentation on theme: "The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make."— Presentation transcript:

1 The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make inference about the average from the sample to the population.

2 Introduction We want to estimate the accuracy of an average computed from a simple random sample. Again, we deal with the situation that the parameter (average/expected value) of the population is unknown. First of all, we need to figure the likely size of the chance error for average. This is measured by the SE for average.

3 Example Let’s look at the box with tickets: 1, 2, 3, 4, 5, 6, 7. Using the computer simulation, 25 draws at random with replacement could be: 2 4 3 2 5 7 5 6 4 5 4 4 1 2 4 4 6 4 7 2 7 2 5 7 3 The sum is 105, and the average is 105/25 = 4.2. Another simulation came out differently: 5 1 4 3 4 5 2 1 7 7 1 2 3 2 4 7 1 6 5 3 6 6 3 3 4 The sum becomes 95, and the average is 95/25 = 3.8. The sum is subject to chance variability, therefore so is the average.

4 Example

5 As an application of the change of scale, if we convert the histograms into standard units, the two histograms (sum and average) are exactly the same. So the histogram for the average of the sum can be approximated by the normal curve, if the number of draws is large enough.

6 Example When drawing at random from a box, the probability histogram for the average of the draws follows the normal curve, even if the contents of the box do not. The histogram must be put into standard units, and the number of draws must be reasonably large.

7 Increase the draws

8 Comments Similar to the SE for number and the SE for percentage, the SE for the sum and the SE for the average behave quite differently when increasing the number of draws. As the number of draws goes up, the SE for the sum gets bigger, but the SE for the average gets smaller. Since the SE for average corresponds to the SE for percentage, when drawing without replacement, the exact SE for the average can be found using the correction factor: SE without = correction factor × SE with. In general, the number of draws is a small part of the total tickets, the correction factor will be close to 1, which could be ignored.

9 Sample average Now we come to the main issue in this chapter: make inference about the average from the sample to the population. Two main questions have to be paid attention: What is the difference between the SD of the sample and the SE for the sample average? Why is it OK to use the normal curve in figuring confidence levels? With these questions in mind, we look at two examples.

10 Example 1 A city manager wants to know the average income of the 25,000 families living in his town. He hires a survey organization to take a simple random sample of 1,000 families. The total income of the sample families turns out to be $62,396,714. So the average is about $62,400. Then the average income for all 25,000 families is estimated as $62,400. This estimate is off by a chance error. So we have to figure out the SE for the average.

11 Example 1 We first set up a box model. The problem is not about counting, and it is about average. So we no longer use the 0-1 box. Since the population size is 25,000, there are 25,000 tickets in the box. But the incomes vary from family to family, we need to summarize the data of the population. Remember that, the average and the SD are two good summary statistics for data. The average of the box is already estimated by $62,400. All we need to do is to figure out the SD of the box. But the data for the whole population is unknown, we have to use the bootstrap method to estimate the SD of the box. That is, substitute the SD of the sample to the SD of the box.

12 Example 1

13 We could also use the confidence intervals to state the accuracy: For example, a 95%-confidence interval for the average of the incomes is obtained by going 2 SEs either way from the sample average: “$62,400 ± 2 × $1,700 = $59,000 to $65,800”. Once again, “$59,000 to $65,800” is just one of the confidence intervals “sample average ± 2 SEs”. The probability 95% states that about 95% of the confidence intervals cover the true value (average income of the population, the parameter).

14 Remarks Since we don’t know the average of the box, we don’t know the expected value for the sum of draws. The income $62,396,714 of the sample families is just an observed value, and so is the average $62,400. The SE for average (or for the sum) measures the likely size of the chance error from the equation: Observed value = expected value + chance error. So the SE says how far sample averages are from the population average----for typical samples. (Comparing the averages.) Whereas, the SD says how far family incomes are from the average---- for typical families. (Comparing the incomes.)

15 Example 2 As part of an opinion survey, a simple random sample of 400 persons age 25 and over is taken in a certain town in Appalachia. The total years of schooling completed by the sample persons is 4,635. So their average educational level is 4,635/400≈ 11.6 years. The SD of the sample is 4.1 years. Find a 95%-confidence interval for the average educational level of all persons age 25 and over in this town.

16 Solution We have quite a few information about the population. This is the general case. For the box model, there should be one ticket for each person, showing the number of years of schooling completed by that person. It does not matter we don’t know the number of tickets of the whole box. Let’s assume there are too many relative to the sample size. According to the sample, there are 400 draws made at random. The data from the box can be estimated by the draws. This completes the box model.

17 Solution

18 Remark Why is it OK to use the normal curve to calculate the confidence level of 95%? After all, the histogram for educational levels looks nothing like the normal curve:

19 Remark Here is a computer simulation. In reality, we don’t know the contents of the box, but mathematical theory still applies. In the sample, although there are a few too many people with 8-9 years of education, the histogram is very similar to the population one. This indicates that the sample SD is a good estimate for the population SD. (Share about the same amount of spread.)

20 Remark The reason that we can use the normal curve is similar to the case for sample percentage. Remember, in that case, we have the new model: 0-1 box. The sample percentage is just a change of scale of the sum of draws----we are counting the 1’s. By CLT, the sum of draws follows the normal curve, even if the composition of the box is abnormal (e.g. 10% 1’s, 90% 0’s). Here, the sample average is again a change of scale of the sum of draws----numbers on the tickets are from the data(educational years). So even if the data do not follow the normal curve, the sum of draws still follow the curve, provided the number of draws is large enough. With a small sample, the normal curve should not be used.

21 Remark Here is the probability histogram for the average of draws. It does not represent data. Instead, it represents chances for the sample average (a change of scale of the sum of draws). Clearly, the normal curve is a good approximation to the histogram.

22 A summary for SE

23 Summary The SE for the average equals the SE for the sum divided by the number of draws. When drawing at random or with a simple random sample, the average of the draws can be used to estimate the average of the box. The SD of the sample can be used to estimate the SD of the box. Multiplying the number of draws by some factor divides the SE for their average by the square root of that factor. The probability histogram for the average will follow the normal curve, even if the contents of the box do not. The histogram must be put into standard units, and the number of draws must be large.

24 Summary A confidence interval for the average can be found by going the right number of SEs either way from the average of the draws. The confidence level is read off the normal curve. This method should only be used with large samples. Once again, the formulas for simple random samples should not be applied to other kinds of samples. If a sample is not chosen by a probability method, it is called a sample of convenience. In such a case, the SE makes no sense.


Download ppt "The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make."

Similar presentations


Ads by Google