Ratio and regression estimation STAT262, Fall 2017
STAT262: Ratio estimation
Motivating Example: California Schools api99 and api100
Motivating Example: California Schools Suppose that we know api99 for the whole population api00 for a simple random sample We would like to estimate the population mean of api00 Use the SRS only Use api99 as an auxiliary variable
Ratio Estimator The same sampling method: Sample Random Sampling (SRS) Sometimes we are interested in the ratio of two population characteristics. E.g., average yield of corn per acre. Sometimes, Y: a characteristic we are interested in X: a characteristic that is related with Y Use ratio estimators to increase the precision of estimated means or totals
api99 vs api00
api99 vs api00 In the SRS In the population mean of api99 = 624.685 mean of api00 = 656.585. The sample mean is an unbiased estimator for the population mean The ratio = 1.051066 In the population Mean of api99 = 631.913 What is a good guess of api00 in the population
api99 vs api00 Ratio estimation ?=664.1821 The true mean of api00 in the population: 664.7126 The SRS estimate: 656.585 . 656.585 624.685 =1.051066= ? 631.913 . population SRS
api99 vs api00 Ratio estimator vs the unbiased estimator? Simulation: for every of the 1000 simulations Generate an SRS Two methods to estimate the population mean of api00: Use the sample mean of api00 Use the ratio estimator
api99 vs api00
api99 vs api00: bias vs variance
More examples E.g.1: The average yield of corns per acre E.g.2: The number of hummingbirds in a national forest Sample a few regions, record the number (yi) and area (xi) for each region. Calculate sample ratio Total area of the national forest is tx An estimate of ty is
Examples E.g.3: Laplace wanted to know the number of persons living in France in 1802. There was no census in that year Two candidate estimators Which was Laplace’s choice? # persons # registered births Sample: 30 counties 2,037,615 71,866 France: N (known) ty (???) tx (known)
Examples Laplace reasoned that using ratio estimator is more accurate. Large counties have more registered births Number of registered births and number of persons are positively correlated Thus, using information in x is likely to improve our estimate of y
Examples E.g. 4. McDonal Corp. The average of annual sale of this year One can use information from last year. Details will be discussed later
Ratio estimators in SRS Sampling method: SRS Two quantities (xi, yi) are measured in each sampled unit, where xi is an auxiliary variable
Population quantities Size: N Totals: Means: Ratio: Variances and covariance: Correlation coefficient:
Example of population quantities
Bias Ratio estimators are usually biased
Bias – the exact expression
Bias – the exact expression The expression is exact. But it involves quantities we don’t know
Bias – an approximation We want to get rid of random items in the denominators
Bias – an approximation
The bias is usually small if
Variance and Mean Squared Error The bias is usually small, thus can be ignored and MSE≈Var
Estimate the variance Alternative expression of the variance They are not the same
Estimate the variance We can implement the formula in the previous slide using “residuals: It is not difficult to show that
Efficiency of ratio estimation Consider the ratio and the unbiased estimators. Which one has a smaller variance?
A hypothetical example Population. N=8
A hypothetical example
A hypothetical example Mean estimate = 39.85036 Bias = -0.003178 Bias approx: Mean estimate = 40
api99 vs api00 Consider the built in SRS: apisrs
Another example
Another example
Another example
Assignment 3: Problem 1 We have used the California schools example to illustrate different sampling and estimation strategies. What if we combine stratified sampling and ratio estimation to estimate the population mean of api00? Will this combination better than using only one strategy? Use simulations to answer this question. Please choose reasonable sample size(s) describe your conclusion clearly support your conclusion using tables and or figures
Assignment 3: Problem 2
STAT262: Regression estimation
Regression estimation Ratio estimation works well if the data are well fit by a straight line through the origin Often, data are scattered around a straight line that does not go through the origin
Regression estimation The regression estimator of the population mean is
Bias For large SRS, the bias is usually small
Variance and MSE Bias is small
Variance
Standard error
California Schools: api99 vs api00
California Schools: api99 vs api00 Step 1: fit a regression model: api00~api99 Residuals: , i=1 ,…, n Step 2: calculate the regression estimation Step 3: calculate standard error
California Schools: api99 vs api00
California Schools: api99 vs api00 SRS: 656.6 +/- 1.96*9.2 Ratio estimator: 664.2 +/- 1.96*2.25 Regression estimator: 663.4 +/- 1.96*2.04 The true population mean: 664.7
The McDonald Example
The McDonald Example
Relative Efficiencies
Relative Efficiencies
Relative Efficiencies
Relative Efficiencies
Relative efficiency: California Schools
Summary We introduced two new estimators: Ratio estimator: Regression estimator: Both exploit the association between x and y The regression estimator is the most efficient (asymptotically) The ratio estimator is more efficient than the SRS estimator if R is large
Estimation in Domains: A motivating example We are often interested in separate estimates for subpopulations (also called domains) E.g. after taking an SRS of 1000 persons, we want to estimate the average salary for men and the average salary for women
Estimation in Domains: A motivating example
Estimation in Domains: A motivating example The calculation in the previous slide treats as a constant. But it is not. We should take the randomness into consideration The formulas we derived for ratio estimators can be used
Estimations in Domains
Estimations in Domains
Estimations in Domains If the sample is large