Lecture 3 Preview: Interval Estimates and the Central Limit Theorem Review Populations, Samples, Estimation Procedures, and the Estimate’s Probability Distribution Why Is the Mean of the Estimate’s Probability Distribution Important? Why Is the Variance of the Estimate’s Probability Distribution Important? Normal Distribution: A Way to Estimate Probabilities Relative Frequency Interpretation of Probability Random Variables Clint’s Dilemma and His Opinion Poll Interval Estimates Central Limit Theorem Properties of the Normal Distribution Using the Normal Distribution Table: An Example Justifying the Use of the Normal Distribution Normal Distribution’s Rules of Thumb Mean and Variance of the Estimate’s Probability Distribution for a Sample Size of T
Review Populations, Samples, and Estimation Procedures Question: How can we use sample information to draw inferences about a population? Random Variables: Before the experiment is conducted: Bad news. What we do not know: We cannot determine the numerical value of the random variable with certainty before the experiment is conducted. Good news. What we do know: On the other hand, we can often calculate the random variable’s probability distribution telling us how likely it is for the random variable to equal each of its possible numerical values. Relative Frequency Interpretation of Probability: After many, many repetitions of the experiment the distribution of the numerical values from the experiment mirrors the random variable’s probability distribution. The mean reflects the center of the distribution. The variance reflects the spread of the distribution. An example, Clint’s poll: 12 of the 16 individuals polled support Clint.EstFrac =.75 Question: Does this poll definitely prove that Clint is ahead? Answer: No. It is possible for 12 (or more) individuals to support Clint in one poll even when the election is a toss up. Question: How do we describe a distribution? Distribution of the Numerical Values After many, many repetitions Probability Distribution Answer: Center (Mean) and Spread (Variance)
Opinion Poll: Sample Size Equals T Write the names of every individual in the population on a card. Perform the following procedure T times: Thoroughly shuffle the cards. Randomly draw one card. Ask that individual if he/she supports Clint; the answer determines the numerical value of v t : Replace the card. Calculate the fraction of those polled supporting Clint. Question: What do we know about the v t ’s? From our last class – Sample Size of 2: Mean[v 1 ] = Mean[v 2 ] = p Mean[v t ] = p for each t; that is, Mean[v 1 ] = Mean[v 2 ] = … = Mean[v T ] = p From our last class – Sample Size of 2: Var[v 1 ] = Var[v 2 ] = p(1 p) Var[v t ] = p(1 p) for each t; that is, Var[v 1 ] = Var[v 2 ] = … = Var[v T ] = p(1 p) where T = Sample Size From out last class – Sample Size of 2: v 1 and v 2 are independent; their covariance equals 0 The v t ’s are independent; hence, their covariances equal 0. where p = ActFrac = Actual fraction of the population supporting Clint v t equals 1 if the t th individual polled supports Clint; 0 otherwise. The estimated fraction, EstFrac, is a random variable.
Mean[v t ] = p for each t; that is, Mean[v 1 ] = Mean[v 2 ] = … = Mean[v T ] = p Var[v t ] = p(1 p) for each t; that is, Var[v 1 ] = Var[v 2 ] = … = Var[v T ] = p(1 p) The v t ’s are independent; that is, all their covariances equal 0 where p = ActFrac = Actual fraction of the population supporting Clint Mean[cx] = cMean[x] Mean[x + y] = Mean[x] + Mean[y] How many p terms are there?T Mean[cx] = cMean[x] Mean[x + y] = Mean[x] + Mean[y] Mean[v 1 ] = Mean[v 2 ] = … = Mean[v T ] = p Distribution Center: Mean of the Estimate’s Probability Distribution
Mean[v t ] = p for each t; that is, Mean[v 1 ] = Mean[v 2 ] = … = Mean[v T ] = p Var[v t ] = p(1 p) for each t; that is, Var[v 1 ] = Var[v 2 ] = … = Var[v T ] = p(1 p) The v t ’s are independent; hence, all their covariances equal 0 where p = ActFrac = Actual fraction of the population supporting Clint Var[cx] = c 2 Var[x] Var[x + y] = Var[x] + 2Cov[x, y] + Var[y] How many p(1 p) terms are there? Var[x + y] = Var[x] + Var[y] Summary: T Var[cx] = c 2 Var[x] Var[x + y] = Var[x] + Var[y] Var[v 1 ] = Var[v 2 ] = … = Var[v T ] = p(1 p) Distribution Spread: Variance of the Estimate’s Probability Distribution
Simulations: Confirming the equations. Mean[EstFrac] = ActFrac = p Var[EstFrac] = Mean of Variance of Mean (Average) of Variance of EstFrac’s EstFrac’s Numerical Values Numerical Values Sample Prob Prob Simulation of EstFrac from of EstFrac from Size Dist Dist Repetitions the Experiments the Experiments >1,000,000 .50 .25 >1,000,000 .50 .125 >1,000,000 .50 .01 >1,000,000 .50 .0025 >1,000,000 .50 Two Questions Why is the distribution center (mean) important? Why is the distribution spread (variance) important? Relative Frequency Interpretation of Probability: After many, many repetitions of the experiment, the distribution of the actual numerical values mirrors the probability distribution of the random variable. Both distributions have the same mean and variance. Lab 3.1 More specifically, Mean[EstFrac] = ActFrac. Why is this important?
Question: Why is the mean of the estimate’s probability distribution important? A mean describes the center of its probability distribution. Mean[EstFrac] = ActFrac Conceptually, an estimation procedure is unbiased whenever it does not systematically underestimate or overestimate the actual population fraction. If the probability distribution is symmetric, we have even more intuition. the chances that the estimated fraction is too low the chances that the estimated fraction is too high equal Average of the estimate’s numerical values after many, many repetitions Unbiased Estimation Procedure Formally, an estimation procedure is unbiased whenever the mean of the estimated fraction’s probability distribution equals the actual population fraction. Relative Frequency Interpretation of Probability Lab 3.2 Mean[EstFrac] Probability Distribution of EstFrac ActFrac EstFrac In one poll, So, we have already shown that Clint’s estimation procedure is unbiased. Average of the estimate’s numerical values after many, many repetitions = ActFrac = Now we have some intuition.
Question: Why is the variance of the estimate’s probability distribution important when the estimation procedure is unbiased? Claim: When the estimation procedure is unbiased, the reliability of the estimated fraction depends on the variance of the estimated fraction’s probability distribution. Interval Estimate Question: What is the probability that the estimated fraction from a single poll lies close to the actual value? Small probabilityLarge probability Estimate is unreliable Estimate is reliable Decide on a close to criterion:.05 Population Fraction = ActFrac = p Simulations: Percent of Repetitions Sample Variance of Random Simulation in which the Numerical Value of Size Variable EstFrac Repetitions EstFrac Lies between.45 and >1,000,000 39% >1,000,000 69% >1,000,000 95% =.50 Question: After many, many repetitions, how frequently is the estimated fraction are close to, within.05 of, the actual population fraction? Lab 3.3 Quantifying Reliability: Strategy: A simulation and apply the relative frequency interpretation of probability. Interval Estimate Question: What is the probability that the estimated fraction from a single poll lies close to, within.05 of, the actual value?
Probability that the Numerical Value Sample Variance of EstFrac’s of EstFrac Lies between.45 and.55 Size Probability Distribution in a Single Poll (One Repetition) .39 .69 .95 Interval Estimate Question: What is the probability that the numerical value of the estimated fraction from one repetition of the experiment lies close to, within.05 of, the actual population fraction? ActFrac =.50 Simulations: Percent of Repetitions Sample Variance of EstFrac’s Simulation in which the Numerical Value of Size Probability Distribution Repetitions EstFrac Lies between.45 and >1,000,000 39% >1,000,000 69% >1,000,000 95% Relative Frequency Interpretation of Probability: After many, many repetitions of the experiment, the distribution of the numerical values mirrors the probability distribution. The portion of estimates that lie within.05 of the actual value, between.45 and.55, after many, many repetitions How can we use the simulation results to answer the interval estimate question? equals The probability that the estimate lies within.05 of the actual value, between.45 and.55, in a single poll (one repetition) Reconsider the interval estimate question:
Sample Variance of EstFrac’s In a Single Poll (One Repetition): Size Probability Distribution Prob[.45 Numerical Value .55] .39 .69 .95 Variance LargeVariance Small Small probability that the numerical value of the estimated fraction, EstFrac, from one repetition of the experiment will be close to the actual population fraction, ActFrac. Large probability that the numerical value of the estimated fraction, EstFrac, from one repetition of the experiment will be close to the actual population fraction, ActFrac. Estimate is unreliable Estimate is reliable Variance largeVariance small Probability Distributions of EstFrac Mean[EstFrac] = ActFrac EstFrac Summary: When the estimation procedure is unbiased, the variance tells us how reliable the estimate is. Generalizing, when an estimation procedure is unbiased:
Sample Size = T = 25 Sample Size = T = 100 Mean[EstFrac] = p Sample Size = T = 400 Mean[EstFrac] = p Strategy for Motivating and Illustrating the Central Limit Theorem: Four Steps Central Limit Theorem Motivation: Role of the Standard Deviation Central Limit Theorem: As the sample size becomes larger and larger, we can use the normal distribution to calculate better and better approximations of interval estimates. Step 2: Use simulations to calculate the percent of repetitions that fall within 1, 2, and 3 standard deviations of Mean[EstFrac], the mean EstFrac’s probability distribution. Step 3: Observe an interesting similarity. Step 4: Introduce the normal distribution and use it to calculate the percent of repetitions that fall within 1, 2, and 3 standard deviations of Mean[EstFrac]. Step 1: Mean, variance, and SD for three sample sizes Step 1: Use the equations to calculate the mean, variance, and standard deviation of EstFrac’s probability distribution for three sample sizes, 25, 100, and 400.
Summary of Mean and SD Calculations Sample Size Mean[EstFrac] SD[EstFrac] Interval: 1 SD From-To Values Percent of Repetitions 69.2% Interval: 2 SD’s From-To Values Percent of Repetitions Interval: 3 SD’s From-To Values Percent of Repetitions % % % % % % % % Question: What do these results suggest? Central Limit Theorem Motivation: Role of the Standard Deviation Central Limit Theorem: As the sample size becomes larger and larger, the normal distribution provides better and better approximations of interval estimates. Step 2: Use simulations to calculate the percent of repetitions that fall within 1, 2, and 3 standard deviations of Mean[EstFrac], the mean EstFrac’s probability distribution. Step 3: Observe an interesting similarity. Answer: The standard deviations, the SD’s, appear to be critical. Lab 3.4
Normal Distribution: The Famed Bell-Shaped Curve The variable z: the “normalized” value of the random variable. z equals the number of standard deviations the value lies from the random variable’s mean: Normal Distribution Table The row specifies the z value’s whole number and its tenths. For example, suppose that z = 1.53: What is the probability that the random variable would lie more than 1.53 standard deviations above its mean? 1.53 SD’s.0630 Normal Distribution: Three Important Properties The normal distribution is bell shaped. The area beneath the normal curve equals 1. The number in the body of the table estimates the probability that the random variable lies more than z standard deviations above its mean The column the z value’s hundredths. z SD’s Probability of being more than z standard deviations about the distribution mean The normal distribution is symmetric around its mean (center). Normal Distribution
Normal Distribution Rules of Thumb Standard Deviations within Random Probability of Variable’s Mean being within 1 .68 2 .95 3 >.99 Simulations: Percent of Interval: Repetitions within Interval Standard Deviations within Sample Size Random Variable’s Mean 69.2% 68.5% 68.3% 2 96.3% 95.6% 95.5% 3 99.9% 99.8% 99.7% 68.26% 95.44% 99.74% z z z ( ) = ( ) = ( ) = Normal Distribution Percentages The area beneath the normal curve equals 1. The normal distribution is symmetric around its mean (center).Normal Distribution Summary Central Limit Theorem: As the sample size becomes larger and larger, we can use the normal distribution to calculate better and better approximations of interval estimates.
Revisiting Clint’s Dilemma On the eve of the election, Clint must decide whether or not to hold a pre-election party: If he is comfortably ahead, he will not hold the party; he will save his campaign funds for a future political endeavor (or a trip to Cancun). If he is not comfortably ahead, he will hold the party hoping to capture more votes. There is not enough time to canvas everyone, however. What should he do? Econometrician’s Philosophy: If you lack the information to determine the value directly, estimate the value to the best of your ability using the information you do have. Clint’s Estimation Procedure Questionnaire: Are you voting for Clint? Results: 12 students report that they will vote for Clint and 4 against Clint. Estimated fraction of population supporting Clint Clint uses the information collected from the sample to draw inferences about the entire population. Seventy-five percent,.75, of the sample support Clint. This poll suggests that Clint leads. Question: Should Clint be confident that he has the election in hand or should he fund the party? Procedure: Clint selects 16 students at random. =.75