BAE 6520 Applied Environmental Statistics Biosystems and Agricultural Engineering Department Division of Agricultural Sciences and Natural Resources Oklahoma State University Source Dr. Dennis R. Helsel & Dr. Edward J. Gilroy 2006 Applied Environmental Statistics Workshop and Statistical Methods in Water Resources
TEXTBOOK Free on-line at: http://pubs.usgs.gov/twri/twri4a3/
Choosing a Statistical Method Depends on: Chapter 1 SUMMARIZING DATA Numbers and Graphs Choosing a Statistical Method Depends on: Data characteristics Study objectives
Characterizes of Environmental Data Lower bound of zero Presence of outliers, high values Positive skewness Non-normal distribution High variance Data below recording limits Data collected by other people
Categories of Measured Data Continuous: 1.10, 2.56, 100.5 …. Discrete: 1, 2, 5, 15 Qualitative, Grouped, Categorical Site 1, Site 2, Site 3 Below Detection Limit, Above Detection Limit
Histograms Show how many times Y occur in several groups of X. Require grouping of a continuous variable Y-axis: frequency or relative frequency
Box Plots Good for continuous data Based on percentiles 50th percentile (median) 50 percent of data below or equal to median
Inter-quartile Range (IQR) A Measure of Variability IQR = 75th percentile – 25th percentile Represents the middle half of the data IQR = 15 – 2.5 = 12.5 IQR 1 3 7 10 13 21 25th Percentile (2.5) 75th Percentile (15)
Box Plots
Box Plots Outliers Ends of Vertical Lines - Whiskers Whisker – extends to highest or lowest data value within the limit. Upper Limit = Q3 + 1.5 (Q3 - Q1) Lower Limit = Q3 - 1.5 (Q3 - Q1) Q1 = First Quartile, 25th percentile Q3 = Third Quartile, 75th percentile
Population vs. Sample Data are samples that we assume represent the characteristics of a population.
Mean and Standard Deviation Summary Statistics Mean and Standard Deviation
Mean vs. Median Effect of Outliers Suppose an error is made, and Median Mean 1 3 7 10 13 21 8.5 9.2 Becomes: 1 3 7 10 13 210 8.5 40.7 The mean is NOT a resistant measure of the center Median and percentiles are generally not sensitive to outliers
Symmetric vs. Skewed Data Box Plots Approximate Normal Distribution Non-normal Distribution
Common in Environmental Data Positive Skewness Common in Environmental Data
Symmetric vs. Skewed Data Histograms and Box Plots Symmetrical Data Approximates a Normal Distribution
Symmetric vs. Skewed Data Histograms and Box Plots Box plot is compressed due to outliers.
(top half box width increases) Increasing Skewness (top half box width increases)
Cumulative Distribution Functions Histogram of natural log of loads and the resulting empirical cumulative density function (CDF). Blue – best fit normal distribution Red – Empirical CDF
If data are also straight, they follow a normal distribution. Probability Plots Theoretical normal distribution plots as a straight line on normal probability paper. If data are also straight, they follow a normal distribution.
Not Normally Distributed Concentrations are Not Normally Distributed
Logs of Concentration are Normally Distributed
What to do with skewed data? Data with outliers have a mean that may be larger than 75% of the data If we want a more “typical” measure of the center, we have two choices: Use a different method, i.e. use the median or geometric mean Transform the data
Purpose of Transformations Make data more normal Make data more linear Make data more constant variance
Positive and Negative Skew Source: http://www.georgetown.edu/departments/psychology/researchmethods/statistics/begin.htm
Transformations Using Ladder of Powers
Geometric Mean Mean of the natural logs of the data If the logs are normally distributed, the geometric mean is: An estimate of the MEDIAN NOT the mean Mean = 7.40 Median = 0.50 Geometric Mean = 0.62
Outliers Observations that are different from the rest of the observations in the data set May be the most important observations in the data set Example: Antarctic ozone data NEVER throw way an outlier(s) Use an alternate method or transform the data
Cause of Outliers Measurement or recording error Skewed data Solution: identify and fix problem Skewed data Solution: use alternate method or transformation Data from a different population Solution: split into two groups based on science and analyze separately