Inference with Proportions I

Inference with Proportions I
Sampling Distributions and Confidence Intervals

Parameter A number that describes the population
Symbols we will use for parameters include m - mean s – standard deviation p – proportion

This variability is called sampling variability
Statistic A number that that can be computed from sample data Some statistics we will use include x – sample mean s – standard deviation p – sample proportion The observed value of the statistic depends on the particular sample selected from the population and it will vary from sample to sample. This variability is called sampling variability

Let’s explore what happens with in distributions of sample proportions (p). Have students perform the following experiment. Toss a penny 20 times and record the number of heads. Calculate the proportion of heads and mark it on the dot plot on the board. What shape do you think the dot plot will have? This is a statistic! In this case, we will use The dotplot is a partial graph of the sampling distribution of all sample proportions of sample size 20. What would happen to the dotplot if we flipped the penny 50 times and recorded the proportion of heads?

Sampling Distribution of p
The distribution that would be formed by considering the value of a sample statistic for every possible different sample of a given size from a population.

We are interested in the proportion of females. This is called
Suppose we have a population of six students: Alice, Ben, Charles, Denise, Edward, & Frank We are interested in the proportion of females. This is called What is the proportion of females? Let’s select samples of two from this population. How many different samples are possible? We will keep the population small so that we can find ALL the possible samples of a given size. the parameter of interest p = 1/3 6C2 =15

Find the mean and standard deviation of these sample proportions.
Find the 15 different samples that are possible and find the sample proportion of the number of females in each sample. Ben & Frank 0 Charles & Denise .5 Charles & Edward 0 Charles & Frank 0 Denise & Edward .5 Denise & Frank .5 Edward & Frank 0 Alice & Ben .5 Alice & Charles .5 Alice & Denise 1 Alice & Edward .5 Alice & Frank .5 Ben & Charles 0 Ben & Denise .5 Ben & Edward 0 How does the mean of the sampling distribution compare to the population parameter (p)? Find the mean and standard deviation of these sample proportions.

Six Students Continued . . .
Let’s select samples of three from this population. How many different samples are possible? 6C3 = 20 Find the mean and standard deviation of these sample proportions.

General Properties for Sampling Distributions of p
Rule 1: Rule 2:

Correction factor – multiply by
Let’s verify Rule 2. Does the formula equal the standard deviation for samples of size 2 (s = )? NO - So – in order to use this formula to calculate the standard deviation of the sampling distribution, we MUST be sure that our sample size is less than 10% of the population! WHY? Correction factor – multiply by We are sampling more than 10% of our population! If we use the correction factor, we will see that we are correct.

General Properties for Sampling Distributions of p
Rule 1: Rule 2: This rule is exact if the population is infinite, and is approximately correct if the population is finite and no more than 10% of the population is included in the sample

Chip Activity: Select three samples of size 5, 10, and 20 and record the number of blue chips. Place your proportions on the appropriate dotplots. What do you notice about these distributions?

In the fall of 2008, there were 18,516 students enrolled at California Polytechnic State University, San Luis Obispo. Of these students, 8091 (43.7%) were female. We will use a statistical software package to simulate sampling from this Cal Poly population. We will generate 500 samples of each of the following sample sizes: n = 10, n = 25, n = 50, n = 100 and compute the proportion of females for each sample. The following histograms display the distributions of the sample proportions for the 500 samples of each sample size.

What do you notice about the shape of these distributions?
What do you notice about the standard deviation of these distributions? What do you notice about the shape of these distributions? Are these histograms centered around the true proportion p = .437?

The development of viral hepatitis after a. blood
The development of viral hepatitis after a blood transfusion can cause serious complications for a patient. The article “Lack of Awareness Results in Poor Autologous Blood Transfusions” (Health Care Management, May 15, 2003) reported that hepatitis occurs in 7% of patients who receive blood transfusions during heart surgery. We will simulate sampling from a population of blood recipients. We will generate 500 samples of each of the following sample sizes: n = 10, n = 25, n = 50, n = 100 and compute the proportion of people who contract hepatitis for each sample. The following histograms display the distributions of the sample proportions for the 500 samples of each sample size.

Are these histograms centered around the true proportion p = .07?
What happens to the shape of these histograms as the sample size increases?

General Properties Continued . . .
Rule 3: When n is large and p is not too near 0 or 1, the sampling distribution of p is approximately normal. The farther the value of p is from 0.5, the larger n must be for the sampling distribution of p to be approximately normal. A conservative rule of thumb: If np > 10 and n (1 – p) > 10, then a normal distribution provides a reasonable approximation to the sampling distribution of p.

Let’s draw a normal curve over the histogram.
Why does np > 10 ensure an approximate normal distribution? In a binomial distribution, we will investigate what happens to the probability histogram as the sample size increases. Suppose n = 80 and p = 0.1 Suppose n = 90 and p = 0.1 Suppose n = 100 and p = 0.1 Suppose n = 10 and p = 0.1 Suppose n = 70 and p = 0.1 Suppose n = 20 and p = 0.1 Suppose n = 60 and p = 0.1 Suppose n = 30 and p = 0.1 Suppose n = 40 and p = 0.1 Suppose n = 50 and p = 0.1 Let’s draw a normal curve over the histogram. What does np equal?

Why do we need to also check n(1 – p)?
Consider what the histogram looks like when n = 10 and p = .9. We must also check that the upper tail will spread out into an approximate normal curve.

Here’s an algebraic proof . . .
If a binomial distribution can be approximated by a normal curve, then the minimum and maximum values of 0 and n MUST lie within 3 standard deviations of the mean. Recall: Therefore: Simplifying this inequality gives us the following: Let’s simplify this inequality. Since 0 < p < 1, we can substitute the values 0 and 1 into these inequalities to find the largest value needed to be within 3 standard deviations of the mean. The conservative approach uses 10 instead of 9. Square both sides. AND Divide both sides by np.

Blood Transfusions Revisited . . .
Let p = proportion of patients who contract hepatitis after a blood transfusion p = .07 Suppose a new blood screening procedure is believed to reduce the incident rate of hepatitis. Blood screened using this procedure is given to n = 200 blood recipients. Only 6 of the 200 patients contract hepatitis. Does this result indicate that the true proportion of patients who contract hepatitis when the new screening is used is less than 7%? To answer this question, we must consider the sampling distribution of p. p = 6/200 = .03

Is the sampling distribution approximately normal?
Blood Transfusions Revisited . . . Let p = .07 p = 6/200 = .03 Is the sampling distribution approximately normal? np = 200(.07) = 14 > 10 n(1-p) = 200(.93) = 186 > 10 What is the mean and standard deviation of the sampling distribution? Yes, we can use a normal approximation.

Blood Transfusions Revisited . . .
Let p = .07 p = 6/200 = .03 Does this result indicate that the true proportion of patients who contract hepatitis when the new screening is used is less than 7%? This small probability tells us that it is unlikely that a sample proportion of .03 or smaller would be observed. This new screening procedure appears to yield a smaller incidence rate for hepatitis. P(p < .03) = Normalcdf(-1099,.03,.07,.018) = .0132

Confidence Intervals

How might we go about estimating this proportion?
Suppose we wanted to estimate the proportion of blue candies in a VERY large bowl. How might we go about estimating this proportion? We could take a sample of candies and compute the proportion of blue candies in our sample. We would have a sample proportion or a statistic – a single value for the estimate. Create a jar with different types of coins . . .

Point Estimate A single number (a statistic) based on sample data that is used to estimate a population characteristic But not always close to the population characteristic due to sampling variation Different samples may produce different statistics. “point” refers to the single value on a number line. Population characteristic

Suppose we wanted to estimate the proportion of blue candies in a VERY large bowl.
We could take a sample of candies and compute the proportion of blue candies in our sample. How much confidence do you have in the point estimate? Would you have more confidence if your answer were an interval? Create a jar with different types of coins . . .

Confidence intervals A confidence interval (CI) for a population characteristic is an interval of plausible values for the characteristic. It is constructed so that, with a chosen degree of confidence, the actual value of the characteristic will be between the lower and upper endpoints of the interval. The primary goal of a confidence interval is to estimate an unknown population characteristic.

Rate your confidence 0 – 100%
How confident (%) are you that you can ... Guess my age within 10 years? . . . within 5 years? . . . within 1 year? What does it mean to be within 10 years? What happened to your level of confidence as the interval became smaller? Adapted from an activity from Michael Legacy.

Perform CI Activity . . . Question for after the activity:
What proportion of all possible CI’s contain the true proportion p? This is called the confidence level.

Let’s develop the equation for the large-sample confidence interval.
For large random samples, the sampling distribution of p is approximately normal. So about 95% of the possible p will fall within We can generalize this to normal distributions other than the standard normal distribution – About 95% of the values are within 1.96 standard deviations of the mean To begin, we will use a 95% confidence level. Use the table of standard normal curve areas to determine the value of z* such that a central area of .95 falls between –z* and z*. 95% of these values are within 1.96 of the mean. Central Area = .95 Lower tail area = .025 Upper tail area = .025 -1.96 1.96

Developing a Confidence Interval Continued . . .
Approximate sampling distribution of p Suppose we get this p Suppose we get this p and create an interval p Create an interval around p Suppose we get this p and create an interval Using this method of calculation, the confidence interval will not capture p 5% of the time. p p This line represents 1.96 standard deviations below the mean. This line represents 1.96 standard deviations above the mean. When n is large, a 95% confidence interval for p is Notice that the length of each half of the interval equals Here is the mean of the sampling distribution p This p doesn’t fall within 1.96 standard deviations of the mean AND its confidence interval does NOT “capture” p. This p fell within 1.96 standard deviations of the mean AND its confidence interval “captures” p. This p fell within 1.96 standard deviations of the mean AND its confidence interval “captures” p.

Developing a Confidence Interval Continued . . .
If p is within of p, this means the interval will capture p. And this will happen for 95% of all possible samples!

Confidence level The confidence level associated with a confidence interval estimate is the success rate of the method used to construct the interval. If this method was used to generate an interval estimate over and over again from different samples, in the long run 95% (or whatever confidence level we use) of the resulting intervals would include the actual value of the characteristic being estimated. Our confidence is in the method – NOT in any ONE particular interval! The most common confidence levels are 90%, 95%, and 99% confidence.

The diagram to the right is 100 confidence intervals for p computed from 100 different random samples. Note that the ones with asterisks do not capture p. If we were to compute 100 more confidence intervals for p from 100 different random samples, would we get the same results?

Recall the General Properties for Sampling Distributions of p
These are the conditions that must be true in order to calculate a large-sample confidence interval for p 1. 2. As long as the sample size is less than 10% of the population 3. As long as n is large (np > 10 and n (1-p) > 10) the sampling distribution of p is approximately normal.

The Large-Sample Confidence Interval for p
This is an estimate of the standard deviation of p or the standard error The general formula for a confidence interval for a population proportion p is In real life, we often do not know the population proportion? What value can we use to estimate it? The standard error of a statistic is the estimated standard deviation of the statistic. point estimate

The 95% confidence interval is based on the fact that, for approximately 95% of all random samples, p is within the margin of error estimation of p. The Large-Sample Confidence Interval for p The general formula for a confidence interval for a population proportion p is This is called the margin of error.

Critical value (z*) z*=1.645 z*=1.96 z*=2.576 .05 .025 .005
Found from the confidence level The upper z-score with probability p lying to its right under the standard normal curve Confidence level tail area z* .05 z*=1.645 .025 .005 z*=1.96 z*=2.576 90% 95% 99%

The general formula for a confidence interval for a population proportion p when (assumptions) (STEP 1) if the sample is selected without replacement, the sample size is small relative to the population size (at most 10% of the population) p is the sample proportion from a random sample the sample size n is large (np > 10 and n(1-p) > 10), and

What are the steps for performing a confidence interval?
Assumptions Data from a random sample Sample size is large enough Sample size is small relative to population size Calculations Conclusion

Conclusion: (memorize!!)
We are ________% confident that the true proportion context is between ______ and ______.

The article “How Well Are U. S. Colleges Run
The article “How Well Are U.S Colleges Run?” (USA Today, February 17, ) describes a survey of 1031 adult Americans. The survey was carried out by the National Center for Public Policy and the sample was selected in a way that makes it reasonable to regard the sample as representative of adult Americans. Of those surveyed, 567 indicated that they believe a college education is essential for success. What is a 95% confidence interval for the population proportion of adult Americans who believe that a college education is essential for success? The point estimate is Before computing the confidence interval, we need to verify the conditions.

College Education Continued . . .
What is a 95% confidence interval for the population proportion of adult Americans who believe that a college education is essential for success? Conditions: 2) The sample size of n = 1031 is much smaller than 10% of the population size (adult Americans). 3) The sample was selected in a way designed to produce a representative sample. So we can regard the sample as a random sample from the population. 1) np = 1031(.55) = 567 and n(1-p) = 1031(.45) = 364, since both of these are greater than 10, the sample size is large enough to proceed. All our conditions are verified so it is safe to proceed with the calculation of the confidence interval.

What does this interval mean in the context of this problem?
College Education Continued . . . What is a 95% confidence interval for the population proportion of adult Americans who believe that a college education is essential for success? Calculation: Conclusion: We are 95% confident that the population proportion of adult Americans who believe that a college education is essential for success is between 52% and 58%. What does this interval mean in the context of this problem?

Recall the “Rate your Confidence” Activity
College Education Revisited . . . A 95% confidence interval for the population proportion of adult Americans who believe that a college education is essential for success is: Compute a 90% confidence interval for this proportion. Compute a 99% confidence interval for this proportion. Recall the “Rate your Confidence” Activity What do you notice about the relationship between the confidence level of an interval and the width of the interval?

A May 2000 Gallup Poll found that 38% of a random sample of 1012 adults said that they believe in ghosts. Find a 95% confidence interval for the true proportion of adults who believe in ghost.

Step 1: check assumptions!
Have an SRS of adults np =1012(.38) = & n(1-p) = 1012(.62) = Since both are greater than 10, the distribution can be approximated by a normal curve Population of adults is at least 10,120. Step 1: check assumptions! Step 2: make calculations Step 3: conclusion in context We are 95% confident that the true proportion of adults who believe in ghosts is between 35% and 41%.

What value should be used for the unknown value p?
Choosing a Sample Size The margin of error estimation for a confidence interval is Before collecting any data, an investigator may wish to determine a sample size needed to achieve a certain margin of error estimation. In other cases, prior knowledge may suggest a reasonable estimate for p. What value should be used for the unknown value p? If there is no prior knowledge and a preliminary study is not feasible, then the conservative estimate for p is 0.5. Sometimes, it is feasible to perform a preliminary study to estimate the value for p.

Why is the conservative estimate for p = 0.5?
.1(.9) = .09 .2(.8) = .16 .3(.7) = .21 .4(.6) = .24 .5(.5) = .25 By using .5 for p, we are using the largest value for p(1 – p) in our calculations. Recall the activity where we graphed the histograms for binomials with different probabilities of success – which had the largest standard deviation?

In spite of the potential safety hazards, some people would like to have an internet connection in their car. Determine the sample size required to estimate the proportion of adult Americans who would like an internet connection in their car to within 0.03 with 95% confidence. What value should be used for p? This is the value for the margin of error estimate m. Always round the sample size up to the next whole number.

Inference with Proportions I

Similar presentations

Presentation on theme: "Inference with Proportions I"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Inference with Proportions I

Similar presentations

Presentation on theme: "Inference with Proportions I"— Presentation transcript:

Similar presentations

About project

Feedback