Chapter 3 INTERVAL ESTIMATES BAE 6520 Applied Environmental Statistics Biosystems and Agricultural Engineering Department Division of Agricultural Sciences and Natural Resources Oklahoma State University Source Dr. Dennis R. Helsel & Dr. Edward J. Gilroy 2006 Applied Environmental Statistics Workshop and Statistical Methods in Water Resources
Population vs. Sample We measure characteristics of a sample and infer that they apply to the population.
Intervals An interval computed from sample data provides information on how certain we are of where the true population parameter is.
Interval Estimates Confidence Interval – contains an unknown parameter (mean, median) of the population with a specified probability Prediction Interval – contains one or more future observations with a specified probability Tolerance Interval – contains a proportion (percentile) of future observations with a specified probability
What is Inside the Interval? Confidence Interval – contains an unknown parameter (mean, median) of the population with a specified probability Prediction Interval – contains one or more future observations with a specified probability Tolerance Interval – contains a proportion (percentile) of future observations with a specified probability
Your Interval May Not Contain the True Value!
Meaning of a Confidence Interval If you compute ten 90% confidence intervals Each from a different sample of data collected under identical conditions with identical methods Thus, each sample is equally valid Nine of the 10 intervals (90%) will contain the true mean. One will not! You never know if yours is that one!!!
Meaning of a Confidence Interval Ten 90% Confidence Intervals
Meaning of a Confidence Interval Example: 90% Confidence Interval about the Mean We are 90% confident that the true mean turbidity in the Poteau River is between 5 and 200 NTU (Nephelometric Turbidity Units).
Computing Confidence Intervals Parametric Intervals μ = population mean X = sample mean z = depends on confidence level σ = standard error of the mean _ Symmetric around the sample mean Confidence levels are valid if the data are normally distributed or there are a large amount of data
Computing Confidence Intervals Nonparametric Intervals Usually computed on median or other percentile Endpoints are data values Count in the same number of data from each end of the ranked dataset Does not depend on assumption that the data are normally distributed
Confidence Intervals on Skewed Data Parametric intervals assume the data follow a normal distribution or the mean does. If this is incorrect, the confidence intervals will not include the true value as often as the confidence interval suggests.
Confidence Intervals on Skewed Data First Approach Transform the data to approximate normality Compute the confidence interval Problem When the confidence interval is retransformed to the original units, it is no longer a confidence internal on the mean With logs, it is a confidence interval on the geometric mean, an estimate of the median
Confidence Intervals on Skewed Data Example Arsenic Concentrations New Hampshire Groundwater
Confidence Intervals on Skewed Data Second Approach Hope that the Central Limit Theorem applies. This is a function of the data skewness and the sample size See Chapter 2 for Central Limit Theorem discussion
Bootstrapping Currently the best way to compute a confidence internal from skewed data, or small sample size Does not require assumption of normality
Confidence Intervals on Skewed Data Third Approach - Bootstrapping Sample from the data set, with replacement This subsample is generated with replacement so that any data point can be sampled multiple times or not sampled at all. Compute the estimated statistic Do this many times Confidence endpoints determined from the ranked estimated statistic Based on the data set, so it works best with more data
Confidence Intervals on Skewed Data Bootstrapping Example: Arsenic Data Set Randomly pick 25 values from a 25 point arsenic data set. Sample with replacement. Compute the mean of these 25 values Do again 1000 times A 2-sided 95% confidence interval for the mean is the 0.025*1000th and 0.975*1000th ranked values for the mean
Confidence Intervals on Skewed Data Bootstrapping Example: Arsenic Data Set
Confidence Intervals on Skewed Data Bootstrapping Example: Arsenic Data Set
Other Confidence Intervals Can have other confidence intervals for other parameters Variance Standard Deviation Other percentiles Median Confidence intervals for a percentile is often call a “tolerance interval”
Prediction Intervals (contains one or more future observations with a specified probability) Simplest prediction interval (nonparametric) is to use the percentiles of the data For a two-sided 90% prediction interval, use the 5th and 95th percentiles 90% of the observed data fall within this interval, and thus we expect that 90% of the future observations will also fall within this interval Requires ample data
Prediction Intervals Parametric prediction interval will be shorter than a nonparametric interval if: Data follow the distribution assumed by the interval calculation Easy method for prediction interval Transform data to look normal Compute interval Transform interval back to original units
Confidence vs. Prediction Intervals A prediction interval will always be larger than the confidence interval for the same alpha. Why? The mean of 10 observations, for example, is always less variable than the location of the 10 observations themselves.
Tolerance Intervals (contains a proportion of future observations with a specified probability) An interval around a proportion of the distribution The proportion is called the “converge” What cutoff(s) will cover 95% of all future observations, with 90% confidence? Easy method for tolerance interval Transform data to look normal Compute interval Transform interval back to original units Works for prediction and tolerance intervals, but not confidence intervals