Stats 120A Review of CIs, hypothesis tests and more
Sample/Population Last time we collected height/armspan data. Is this a sample or a population?
Gallup Poll, 1/9/07 "As you may know, the Bush administration is considering a temporary but significant increase in the number of U.S. troops in Iraq to help stabilize the situation there. Would you favor or oppose this?"
Results Results based on 1004 randomly selected adults (> 18 years) interviewed Jan 5-7, % are opposed. "For results based on this sample, one can say with 95% confidence that the maximum error attributable to sampling and other random effects is ±3 percentage points. "
Pop Quiz Is the value 61% a statistic or a parameter? The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling
Sampling paradigm In the U.S., the proportion of adults who are opposed to a surge is p, (or p*100%). We take a random sample of n = The proportion of our sample ("p hat") is an estimate of the proportion in the population.
A simulation: Choose a value to serve as p (say p =.6) Our "data" consist of 1004 numbers: 0's represent those in favor, 1's are those opposed. x = 589 out of 1004 say "opposed", so p-hat = 589/1004 =.5866 mean(x) =.5866 sd(x) =.4926
xbar=.5866, s =.493
How do we know sample proportion is a good estimate of population proportion? Law of Large Numbers: sample averages (and proportions) converge on population values implying that for finite values, the sample proportion might be close if the sample size is large
Coin flips: sample proportion "settles down" to 0.5
So if we stop earlier, say n = 10 p-hat =.60
Which raises the question: If we stop early, how far away will our sample proportion be from the true value? Or, in a survey setting, if we take a finite sample of n=1004, how far off from the population proportion are we likely to be?
A simulation might help: Assume p =.60 (population proportion) Take sample of n = 1004 and find p-hat. Save this value Repeat above 3 steps times.
The R code (for the record) phat <- c() for (i in 1:10000){ x <- sample(c(0,1),1004,replace=T,prob=c(.4,.6)) temp <- sum(x)/1004 phat <- c(phat,temp)} hist(phat)
each dot represents one survey of 1004 people
10,000 sample proportions, n = 1004
Observe that... sample proportions are centered on the true population value: p =.60 variability is not great: smallest is.54, biggest is.66 distribution is bell- shaped
We've just witnessed the Central Limit Theorem If samples are independent and random and sufficiently large means (and proportions) follow a nearly Normal distribution the mean of the Normal is the mean of the population the SD of the Normal (aka the standard error) is the population SD divided by sqrt(n)
CLT applied to sample proportions phat is distributed with an approx Normal mean is p SE is sqrt(p*(1-p)/n) For our simulation, p =.60 so our p-hats will be centered on.6 with a SD of sqrt(.6*.4/1004) =
We saw Normal mean(phat) = (expected.6) sd(phat) = (expected )
In practice, we don't know p but we can get a good approximation to the standard error using sqrt(phat * (1-phat)/n) rather than sqrt(p*(1-p)/n)
So if we take a random sample of n = 1004 and we see p-hat =.61, we know that: The true value of p can't be far away. SE = sqrt(.61*.39/1004) = So 68% of the time we do this, p will be within of phat And 95% of the time it will be with 2*.0154 = 0.03
Which leads us to conclude that the true proportion of the population that opposes a surge is somewhere in the interval = 0.58 to = 0.64
Confidence intervals This is an example of a 95% confidence interval. Because 95% of all samples will produce a p-hat that is within 2 standard errors of the true value, we are 95% confident that ours is a "good" interval.
Formula A 95% CI for a proportion is estimate +/- 2 * (Standard Error) p-hat +/- 2*sqrt(phat*(1-phat)/n) /- 2*sqrt(.61*.39/1004) (.58,.64) note: our replacing phat for p in SE means we get an approximate value
What does 95% mean? If we repeat this infinitely many times: –take a sample of n = 1004 from population –calculate sample proportion –find an interval using +/- 2 * SE then 95% of these CIs will contain the truth and 5% will not. We see only one: (.58,.64). It is either good or bad, but we are confident it is good.
Where did the 95% come from? It came from the normal curve. The CLT told us that p-hat followed a (approx) normal distribution. For Normal's, 68% of probability is within 1 standard deviation of mean, 95% within 2, 99.7% within 3. A normal table gives other probabilities
phat = % 95% 99.7% 1 SE 2 SEs 3 SEs 1.6 SE 90% Change confidence level by changing the width of margin of error
The CLT applies to any linear combination of the observations assuming observations are randomly sampled, and independent it does NOT matter what the distribution of the population looks like if n is small, the distribution will be only approximately normal, and this might be a very poor approximation
the CLT does NOT apply to non-linear combinations, such as the sample median or the standard deviation non-random samples samples that are dependent
simulation _dist/index.htmlhttp://onlinestatbook.com/stat_sim/sampling _dist/index.html
Summary Confidence Level is a statement about the sampling process, not the sample Margin of error is determined to achieve the desired confidence level We can calculate the confidence level only if we know the sampling distribution: the probability distribution of the sample
Pop Quiz Is the value 61% a statistic or a parameter? The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling
Pop Quiz Is the value 61% a statistic or a parameter? The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling
For next time: In WWII, German army produced tanks with sequential serial numbers. The allies captured a few tanks, and wanted to infer the total number of tanks produced. Suppose you had captured 10 tanks. Come up with three estimators for the total number of tanks. Data: