Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around 1720. Around 1870, the Belgian mathematician Adolph.

Similar presentations


Presentation on theme: "The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around 1720. Around 1870, the Belgian mathematician Adolph."— Presentation transcript:

1 The Normal Approximation for Data

2 History The normal curve was discovered by Abraham de Moivre around 1720. Around 1870, the Belgian mathematician Adolph Quetelet had the idea of using the curve as an ideal histogram, to which histograms for data could be compare.

3 The normal curve The equation for the normal curve: Here e base of the natural logarithm. The constant π is the ratio of a circle's circumference to its diameter, and π = 3.1415926…. In fact, we will find it is easy to work with the normal curve through diagrams and tables, without ever using the equation.

4 The normal curve A graph of the normal curve:

5 Features of the graph The graph is symmetric about 0: the part of the curve to the right of 0 is a mirror image of the part to the left. The total area under the curve equals 100%. (The vertical axis uses the density scale.) The curve is always above the horizontal axis. The area under the normal curve between -1 and +1 is about 68%. The area under the normal curve between -2 and +2 is about 95%. The area under the normal curve between -3 and +3 is about 99.7%. Only about 6/100,000 of the area is outside the interval from -4 to 4. So it seems that the graph appears to stop outside -4 and 4.

6 Convert to standard units A value is converted to standard units by seeing how many SDs it is above or below the average. Values above the average are given a plus sign; values below the average get a minus sign. The horizontal axis of the graph of the normal curve is in standard units. Many histograms for data are similar in shape to the normal curve, provided they are drawn to the same scale: making the horizontal scales match up involves standard units.

7 Examples Recall the HANES5 example from the last chapter: look at the women age 18 and over sample. The average height was 63.5 inches, and the SD was 3 inches. If one of these women was 69.5 inches tall, then her height in standard units could be counted in the following way: The 69.5 inches tall woman was 6 inches taller than average 63.5 inches, and 6 inches is 2 SDs. So, in standard units, her height was +2.

8 Examples As in the example from the previous slide, convert 66.5 inches, 57.5 inches, 64 inches, and 63.5 inches to standard units. Since the average is 63.5 inches and the SD is 3 inches, so 66.5 inches is 1 SD above the average. Hence, in standard units, 66.5 inches is +1. Similarly, 57.5 inches is 6 inches below 63.5 inches. So that is 2 SDs below, and it is -2 in standard units. Again, 64 inches is 0.5 inches above average. That is 0.5/3≈0.17 SDs above. So it is +0.17 (or equivalently 0.17) in standard units. Finally, 63.5 inches is the average. So it is 0 in standard units.

9 Examples How about convert standard units back to actual quantities? Find the height which is -1.2 in standard units. First of all, -1.2 in standard units means that it is 1.2 SDs below the average, and the average is 63.5 inches. We know that the SD is 3 inches. So 1.2 SDs equals 1.2 x 3 inches = 3.6 inches. Then it is 3.6 inches below the average 63.5 inches. Hence, the height is 63.5 inches – 3.6 inches = 59.9 inches.

10 Examples In order to compare with the normal curve, we convert the histogram to standard units scale:

11 Scales match up Please pay attention to the change of the axes: The original horizontal axis for the histogram is in inches. After converting to standard units, the new horizontal axis is for the normal curve. The original vertical axis for the histogram is in percent per inch. After converting to standard units, the new vertical axis for the normal curve is in percent per standard unit. The match up for the vertical scales: a standard unit is 1 SD, so percent per standard unit = percent per 1 SD = percent/SD. In our example, 60% per standard unit = 60% ÷ 3 inches = 20% per inch. Or to reverse, 10% per inch will match 10% x 1 SD = 10% x 3 inches = 30% per standard unit.(The total percent of the area will not change. )

12 The approximation From the last chapter we said that, roughly 68% of the entries are within 1 SD of average, i.e. in the range: ave – SD to ave + SD. To see where the 68% comes from: the histogram that is converted to standard units follows the normal curve fairly well. The shaded area under the histogram is about the same as the area under the curve. Note that the area under the normal curve between -1 and +1 is 68%. This is where the 68% comes from. The same argument shows, roughly 95% of the entries are within 2 SDs of average, i.e. in the range: ave – 2 SDs to ave + 2 SDs. This is one of the applications of the normal approximation: replace the original histogram by the normal curve before finding the area. We will see how to use normal approximation to estimate the percentage of entries in an interval later.

13 Finding areas under the normal curve Before studying how to estimate the percentage of entries in an interval by using the normal curve, we need to learn how to figure out the areas under the normal curve.

14 The normal table The column marked z represents the number on the horizontal axis. The column marked Height represents the height of the normal curve at the corresponding number. The column marked Area represents the area under the normal curve between the interval from the number –z to z. For example, at the number 1.20, the height of the curve will be 19.42%. The area under the normal curve between the interval from -1.20 to 1.20, will be 76.99%.

15 Sometimes we need to find the following areas:

16 Examples Find the area between 0 and 1 under the normal curve. The 1 st type of the areas. Solution. We first sketch the normal curve, and then shade in the area to be found: The normal table gives us the area between -1 and 1. This is about 68%. By symmetry, the area between 0 and 1 is half the area between -1 and 1. So, we have:

17 Examples A similar example is the following: Find the area between 0 and 2 under the normal curve. Note: This is not double the area between 0 and 1 because the normal curve is not a rectangle. Solution. The procedure is the same as the previous example. From the normal table, the area between -2 and 2 is about 95%. The area between 0 and 2 is half of that by symmetry. So the area is:½ x 95% ≈ 48%.

18 Examples Find the area between -2 and 1 under the normal curve. The 2 nd type of the areas. Solution. The area between -2 and 1 can be broken down into 2 other areas: The area between -2 and 0 is the same as the area between 0 and 2 by symmetry. So from the previous example, this part of the area is about 48%. From the first example, we see the area between 0 and 1 is about 34%. So the total area between -2 and 1 is about:48% + 34% = 82%.

19 Examples Find the area to the right of 1 under the normal curve. The 3 rd type of the areas. Solution. The normal table gives the area between -1 and 1, which is 68%. So the area outside this interval is:100% - 68% = 32%. By symmetry, the area to the right of 1 is half this, which is: ½ x 32% = 16%.

20 More examples Find the area to the left of 2 under the normal curve. Solution. If we break down the area into 2 areas: the area to the left of 0 and the area between 0 and 2. Then we could compute the areas separately: The area to the left of 0 is half the total area, so by symmetry, it is: ½ x 100% = 50%. By previous examples, the area between 0 and 2 is about 48%. So the sum is:50% + 48% = 98%.

21 More examples Find the area between 1 and 2 under the normal curve. Solution. In this example, we take off a region from another region. Because the area is half the difference between 2 other areas: From the normal table, we know that the area between -2 and 2 is about 95%, and the area between -1 and 1 is about 68%. So the difference is about:95% - 68% = 27%. The half of the difference is about: ½ x 27% ≈ 14%.

22 Remark There is no unique solution to each of the problems. Just try the method that is most convenient for you to calculate the area. There is no set procedure to use in solving this sort of problems. It is a matter of drawing pictures which relate the area you want to areas that can be read from the table.

23 The normal approximation

24 Example The heights of the men age 18 and over in HANES5 averaged 69 inches; the SD was 3 inches. Use the normal curve to estimate the percentage of these men with heights between 63 inches and 72 inches. Solution. The percentage is given by the area under the height histogram, between 63 inches and 72 inches.

25 Solution Step 1. Draw a number line and shade the interval. Step 2. Mark the average on the line and convert to standard units.

26 Solution Step 3. Sketch in the normal curve, and find the area above the shaded standard units interval obtained in step 2. The percentage is approximately equal to the shaded area, which is almost 82%. (From the previous examples.)

27 Remark Using the normal curve, we obtained an approximation that about 82% of the heights were between 63 inches and 72 inches. This is pretty good: in fact, 81% of the men were in that range. Comparison of the histogram and the graph:

28 Another example The heights of the women age 18 and over in HANES5 averaged 63.5 inches; the SD was 3 inches. Use the normal curve to estimate the percentage with heights above 59 inches. Solution. A height of 59 inches is: (59 – 63.5) / 3 = -1.5 SDs. So it is 1.5 SDs below average.

29 Solution If we draw the line, shade the interval, mark the average and convert to standard units, then we obtain the following graph: Sketch in the normal curve, and find the area:

30 Solution As the previous examples, we break down the area into 2 parts: half the total area and half the area from -1.5 to 1.5. This results: The approximation is: 50% + ½ x 86.64% ≈ 93%. Remark: The approximation is about right: 96% of the women were taller than 59 inches.

31 Comments It is a remarkable fact: many histograms follow the normal curve. The summary statistics: average and SD are good enough. The average pins down the center and the SD gives the spread. This is all about to say. Some other histograms do not follow the normal curve. In such cases, the average and SD are poor summary statistics. (We will discuss this next.)

32 Percentiles

33 Example Let us look at the distribution of family income in the U.S. in 2004:

34 Example The average income for the families was about $60,000; the SD was about $40,000. So, if we use the normal approximation, it suggests that about 7% of these families had negative incomes:

35 Example The reason is that the histogram does not follow the normal curve at all well. It has a long right-hand tail. Recall: in the case that the histogram has a long right-hand tail, then the average is bigger than the median, and we prefer to use median as the center of the histogram.

36 Example To summarize such histograms, we often use percentiles:

37 Example Let us see how to read the percentiles table: The 1 st percentile of the income distribution was $0, meaning that about 1% of the families had income of $0 or less. About 99% had incomes above that level. The 10 th percentile was $15,000. This means about 10% of the families had incomes below that level, and 90% were above. The 50 th percentile is just the median.

38 Interquartile range When the distribution has a long tail, we use median as the center of the histogram, and we use the interquartile range as a measure of spread. The interquartile range = 75 th percentile – 25 th percentile. From our previous example, the interquartile range is: $90,000 - $29,000 = $ 61,000.

39 Percentiles for normal curve When a histogram follows the normal curve, the normal table can be used to estimate its percentiles.

40 Example Among all applicants to a certain university one year, the Math SAT scores averaged 535, the SD was 100, and the scores followed the normal curve. Estimate the 95 th percentile of the score distribution. Solution. Since 95% is greater than 50%, and since the median is the same as average on the normal curve, this score is above average, by some number of SDs. We need to find that number, call it z.

41 Solution The graph for the location of z should be: This is a graph equation for z, and we need to find what z is.

42 Solution The normal table can not be used directly. This is because the area between –z and z is not the area to the left of z: We need to derive a graph equation on the left from the graph equation on the right.

43 Solution Note that the area to the right of z is 100% - 95% = 5%. So by symmetry, the area to the left of –z is 5% too. Then the area between –z and z must be 95% - 5% = 90%. Now we can use the normal table to find that z ≈ 1.65.

44 Solution Now we need to convert the standard units back to actual scores. From the previous slide, we know the 95 th percentile is about 1.65 SDs above average. So it is 1.65 x 100 = 165 points above average. Recall that the average is 535, so the 95 th percentile is about 535 + 165 = 700.

45 Terminology A percentile is a number of the quantitative variable, representing the corresponding percentage. For instance, in our previous example, a percentile is a score: the 95 th percentile is a score about 700. A percentile rank is a percent of the percentile: if you score 700, your percentile rank is 95%. A third way to say the same thing is: a score of 700 puts you at the 95 th percentile of the score distribution.

46 Change of scale Adding the same number to every entry on a list adds that constant to the average; the SD does not change. Multiplying every entry on a list by the same positive number multiplies the average and the SD by that constant. These changes of scale do not change the standard units.

47 Example Find the average and SD of the list 1, 3, 4, 5, 7. Solution. The average is (1 + 3 + 4 + 5 + 7) / 5 = 4. The list of deviations is -3, -1, 0, 1, 3. The SD is √(((-3)²+(-1)²+0²+1²+3²)/5) = √(20/5) = √4 = 2.

48 Example From the previous list, multiply each entry by 3 and then add 7, to get the new list 10, 16, 19, 22, 28. Find the average and SD. Solution. Since the previous list has the average 4, so the average of the new list is 3 x 4 + 7 = 19. Similarly, the SD of the previous list is 2, so the SD of the new list is 3 x 2 = 6. If we work these numbers out directly, we obtain the same result.

49 Example Convert the previous two lists to standard units: (a) 1, 3, 4, 5, 7; and (b) 10, 16, 19, 22, 28. Solution. (a) The average is 4, so the list of deviations is -3, -1, 0, 1, 3. If we divide the entries by the SD 2, we get the standard units: -1.5, -0.5, 0, 0.5, 1.5. (b) Similarly, the list of deviations is -9, -3, 0, 3, 9. Divide by 6, we get the standard units: -1.5, -0.5, 0, 0.5, 1.5.

50 Remark List (b) comes from list (a) by changing the scale: multiply by 3 and add 7. The 7 is canceled out by computing the list of deviations. The 3 washes out when dividing by the SD. This is because the SD got multiplied by 3. This is why the lists are the same in standard units. A practical example is to convert the temperature from Fahrenheit to Celsius:C = 5/9 x (F - 32).

51 Summary The normal curve is a bell-shape curve symmetric about 0, and the total area under it is 100%. Standard units tell how many SDs a value is above (+) or below (-) the average. Many histogram have roughly the same shape as the normal curve. The percentage over a given interval can be estimated by the normal approximation, provided the scale is in the standard units. All histograms, whether or not they follow the normal curve, can be summarized using percentiles. Change of scale will not change the standard units, but will change average and SD.


Download ppt "The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around 1720. Around 1870, the Belgian mathematician Adolph."

Similar presentations


Ads by Google