Estimation
Sampling Distributions Because estimators are based on random samples, they are random variates just like data! Estimators have distributions called sampling distributions Say are interested in mean Mn mass contained in bullets manufactured at a particular factory Lets use the average mass of Mn in a sample (of size n) to estimate the population mean mass: What might the distribution of look like if we take 1000 samples of size 10 over the course of a week?
Sampling Distributions Important features of an estimator’s sampling distribution: (Approximate) sampling distribution of Sampling dist mean: Sample size, n = 10 bullets Number of samples = 1000 Sampling dist s.d.:
Handy Unbiased Estimators An unbiased estimator of the mean that we always use is: Same as MLE estimate An unbiased estimator of the variance (which we will typically use as a variance estimator) is: Different from MLE estimate
Handy Unbiased Estimators An unbiased estimator for a proportion is: Heads, Success, etc, … An unbiased estimator of the standard error of p is:
Sampling Distributions Uncertainty in the estimate can be represented as standard deviation for the sampling distribution: is called the standard error of the estimator Estimated standard error of the sample average by plugging in
Interval Estimation We are interested in methods that produce an interval: Given the assumptions of the methods are satisfied, the interval covers the true value of the parameter with (approximate) probability at least 1 – a. Common interval methods for: Confidence intervals Prediction intervals Tolerance intervals Credibility/Probability intervals (Bayesian)
Confidence Intervals q is a parameter we are interested in and assume we don’t know its true value. e.g. a mean, a sd, a proportion, etc. Consider an experiment that will collect a sample of data. Then BEFORE we collect the data, we can devise procedure such that: Estimates we will get from the sample we have yet to collect
Confidence Intervals In order to get actual numerical values for and we perform the experiment and plug in the data The outcomes for this experiment are: Under the frequentist definition, probabilities (other than 0 or 1) only exist for outcomes of experiments that haven’t happened yet. After we collect data is a set of plausible values for q.
Confidence Intervals confidence is not probability Given a sample of data, the (1 − a)×100% confidence interval for a parameter estimate on the sample is: We are (1 − a)×100% confident that the true value of q is covered by The CI’s level of confidence: (1 − a)×100% is the same “number” as the CI –method’s probability of producing an interval that covers q, but… confidence is not probability
Confidence Intervals So how do we compute a (1 − a)×100% confidence interval given a set of data?? General Case: (1 − a)×100% CIs for the mean m : Sample size n, sd sX unknown and estimated: Two sided One sided, lower bound One sided, upper bound Student-t(n-1) quantiles qt(1-a/2,df=n-1) or qt(1-a,df=n-1)
Compute the Confidence Intervals A the mass of an unknown powder was determined 30 times. The Results are shown below (units: mg): 4.11, 3.70, 3.36, 3.68, 4.42, 3.23, 4.03, 4.03, 3.52, 4.75, 5.09, 3.47, 3.02, 4.24, 4.74, 4.51, 2.90, 4.15, 3.54, 3.81, 2.98, 3.82, 4.32, 3.06, 4.00, 4.05, 3.19, 3.17, 3.67, 4.37 Compute: The sample mean: The sample sd: The estimated standard error of the mean: The number of estimated standard errors that cover 95% of the sampling distribution symmetrically about the sample mean: ±
Compute the Confidence Intervals a. Sample mean = 3.83 d. For 95% , a = 0.05. 95% spread symmetrically about the mean we want t0.025, 29 and t0.975, 29 = ± 2.04523 b. Sample sd = 0.58 c. Est se of mean = 0.11 # Data from the question: x <- c(4.11, 3.70, 3.36, 3.68, 4.42, 3.23, 4.03, 4.03, 3.52, 4.75, 5.09, 3.47, 3.02, 4.24, 4.74, 4.51, 2.90, 4.15, 3.54, 3.81, 2.98, 3.82, 4.32, 3.06, 4.00, 4.05, 3.19, 3.17, 3.67, 4.37) n <- length(x) # Sample size mn <- mean(x) # Sample average (estimated mean) s <- sd(x) # Sample standard deviation se <- s/sqrt(n) # Estimated standard error of the mean alpha <- 0.05 # Level of significance conf <- 1 - alpha/2 # Level of confidence tt <- qt(p = conf, df = n-1) # t-quantile: The number of estimated standard # errors that cover conf*100% of the # sampling distribution for the mean.
Compute the Confidence Intervals e. Compute the two-sided 95% CI for the mean given this data: [ 3.83 – 2.04*0.11, 3.83 + 2.04*0.11 ] lo <- mn - tt*se hi <- mn + tt*se c(lo,hi) # Two-sided confidence interval for a set of # plauseable values for the mean given this sample. [3.61, 4.05]
Confidence Intervals For us, we can approximate the CI for any parameter we have encountered as (1 − a)×100% CIs for general parameter q : Two sided One sided, lower bound One sided, upper bound Student-t(n-1) quantiles qt(1-a/2,df=n-1) or qt(1-a,df=n-1)
Example Over a several month period the rate of attacks on a certain computer network per day were measured: 11.1, 12.3, 12.0, 11.3, 12.6, 12.9, 12.0, 13.2, 11.8, 13.2, 12.4, 10.3, 12.0, 12.1, 13.1 Compute the 90% lower confidence limit of the hack rate parameter.