BAE 5333 Applied Water Resources Statistics Biosystems and Agricultural Engineering Department Division of Agricultural Sciences and Natural Resources Oklahoma State University Source Dr. Dennis R. Helsel & Dr. Edward J. Gilroy 2006 Applied Environmental Statistics Workshop and Statistical Methods in Water Resources
TEXTBOOK Free on-line at: http://pubs.usgs.gov/twri/twri4a3/
SOFTWARE Minitab Version 15 Statistical Software http://www.minitab.com/products/minitab/
Choosing a Statistical Method Depends on: Chapter 1 SUMMARIZING DATA Numbers and Graphs Choosing a Statistical Method Depends on: Data characteristics Study objectives
Characterizes of Environmental Data Lower bound of zero Presence of outliers, high values Positive skewness Non-normal distribution High variance Data below recording limits Data collected by other people
Categories of Measured Data Continuous: 1.10, 2.56, 100.5 …. Discrete: 1, 2, 5, 15 Qualitative, Grouped, Categorical Site 1, Site 2, Site 3 Below Detection Limit, Above Detection Limit
Histograms Show how many times Y occur in several groups of X. Require grouping of a continuous variable Y-axis: frequency or relative frequency X - Independent Variable Y - Dependent Variable
Box Plots Good for continuous data Based on percentiles 50th percentile (median) 50 percent of data below or equal to median
Inter-quartile Range (IQR) A Measure of Variability 1st Quartile (Q1) = 25% data ≤ this value 2nd Quartile (Q2) = Median 50% data ≤ this value 3rd Quartile (Q3) = 75% data ≤ this value IQR = Q3 - Q1 1 3 7 10 13 21 IQR IQR = 15 – 2.5 = 12.5 Example 2 1,2,3,……..97,98,99 Q1 = 25 Q3 = 75 IQR = 75-25 = 50 25th Quartile (2.5) 75th Quartile (15)
Box Plots
Box Plots Outliers Ends of Vertical Lines - Whiskers Whisker – extends to highest or lowest data value within the limit. Upper Limit = Q3 + 1.5 (Q3 - Q1) Lower Limit = Q1 - 1.5 (Q3 - Q1) Q1 = First Quartile, 25th percentile Q3 = Third Quartile, 75th percentile
Population vs. Sample Data are samples that we assume represent the characteristics of a population.
Mean and Standard Deviation Summary Statistics Mean and Standard Deviation
Mean vs. Median Effect of Outliers Suppose an error is made, and Median Mean 1 3 7 10 13 21 8.5 9.2 Becomes: 1 3 7 10 13 210 8.5 40.7 The mean is NOT a “resistant” measure of the center Median and percentiles are generally not sensitive to outliers
Symmetric vs. Skewed Data Box Plots Approximate Normal Distribution Non-normal Distribution
Common in Environmental Data Positive Skewness Common in Environmental Data
Probability Density Function (pdf) Normal (Gaussian) Distribution X = continuous variable μ = mean σ2 = variance
Cumulative Density Function Cumulative Density Function, F(b) Area under the probability density function fx(x) from a to b.
Calculating Probabilities
Example Distributions Uniform Triangular
Example Distributions Lognormal Exponential
Boxplot vs. Probability Density Function pdf Normal μ=0 σ2=1 http://en.wikipedia.org/wiki/File:Boxplot_vs_PDF.png
Symmetric vs. Skewed Data Histograms and Box Plots Symmetrical Data Approximates a Normal Distribution
Symmetric vs. Skewed Data Histograms and Box Plots Box plot is compressed due to outliers.
(top half box width increases) Increasing Skewness (top half box width increases)
Cumulative Distribution Functions Histogram of natural log of loads and the resulting empirical cumulative density function (CDF) Blue – best fit normal distribution Red – Empirical CDF
If data are also straight, they follow a normal distribution. Probability Plots Theoretical normal distribution plots as a straight line on normal probability paper. If data are also straight, they follow a normal distribution.
Not Normally Distributed Concentrations are Not Normally Distributed
Logs of Concentration are Normally Distributed
What to do with skewed data? Data with outliers have a mean that may be larger than 75% of the data If we want a more “typical” measure of the center, we have two choices: Use a different method, i.e. use the median or geometric mean Transform the data
Purpose of Transformations Make data more normal Make data more linear Make data more constant variance
Positive and Negative Skew Source: http://www.georgetown.edu/departments/psychology/researchmethods/statistics/begin.htm
Transformations Using Ladder of Powers
Geometric Mean Mean of the natural logs of the data If the logs are normally distributed, the geometric mean is: An estimate of the MEDIAN NOT the mean Mean = 7.40 Median = 0.50 Geometric Mean = 0.62
Outliers Observations that are different from the rest of the observations in the data set May be the most important observations in the data set Example: Antarctic ozone data NEVER throw way an outlier(s) Use an alternate method or transform the data
Cause of Outliers Measurement or recording error Skewed data Solution: identify and fix problem Skewed data Solution: use alternate method or transformation Data from a different population Solution: split into two groups based on science and analyze separately
MINITAB Laboratory 1 Reading Assignment Chapter 1 Summarizing Data (pages 1-12) Chapter 2 Graphical Data Analysis (pages 17-64) Statistical Methods in Water Resources by D.R. Helsel and R.M. Hirsch MINITAB Laboratory 1