module 21 Module 2: Terminology of Data Sets Attributes of Data Sets (Mean and Spread) Melinda Ronca-Battista, ITEP Catherine Brown, U.S. EPA
module 22 Histogram a.k.a. “frequency distribution” Many types of datasets form “bell”-shaped histograms a.k.a. “normal,” “standard,” “Gaussian” curves
module 23 Typical Histogram
module 24 Only 2 Factors Mean (center of data, where most data are) Spread (how far from mean is how much of the data)
module 25 Center Mean = average Outliers can strongly affect the mean Distribution may not be symmetrical
module 26 Many Environmental Distributions
module 27 Normal Distribution Useful
module 28Spread Sample standard deviation STDEV(range) The bigger the STDEV is, compared to the mean, the wider the spread COV = STDEV/mean
module 29 How can we use normal distribution? “Map” our distribution onto a normal distribution, using our mean and stdev Then can predict how many values in different degrees of “spread” away from mean
module 210 Sample vs. Population Our sample is a subset We assume our subset is subset of “real” population The closer our subset is to the real population, the better our prediction will be Good sampling plans produce better representations of the “real” distribution
module 211 Subsets Might be Biased
module 212 Terminology Mu = mean Sigma = s = standard deviation
module 213 How is this useful? Calculate mean and stdev NOW can predict reality! Put any “x” value in context of how many STDEVs away from mean it is
module 214 Standard Deviation STDEV(range)
module 215 Z Score Z shows how far away from the mean is the “x” value you are interested in
module 216 Z scores are Proportions of Spread
module 217
module 218 Sample Size Affects Confidence: The more N, the better your estimate (STDEV) reflects the real spread
module 219 Air Quality Daily sampling is best estimate of reality Compromise with 1 in 3 day sampling Worse estimate with 1 in 6 day Even worse with 1 in 12 day How well does one location estimate all air in airshed? Compromise in both frequency and number of sites
module 220 Module 2 Summary Data sets estimate reality Mean (average) Spread (stdev) N Good sampling plans produce good estimates of reality