1 - COURSE APPLICATIONS OF STATISTICS IN WATER QUALITY MONITORING
2 APPLICATIONS IN WATER QUALITY MONITORING Peter Kelderman UNESCO-IHE Institute for Water Education Online Module Water Quality Assessment
3 CONTENTS Necessary water quality monitoring frequency Significant differences between two data sets Seasonal trends Linear correlation ANOVA Sign test Trend detection
4 CONTENTS Necessary water quality monitoring frequency Significant differences between two data sets Seasonal trends Linear correlation ANOVA Sign test Trend detection
5 NECESSARY MONITORING FREQUENCY In an annual water quality monitoring programme measuring phosphate (average = 0.20 mg P/L; s x = 0.05), it is required that the average is known, with 95% confidence, within mg P/L distance from this average. How many samples/year must be taken to fulfil this requirement? Assume a normal distribution of data.
6 So: (required max. distance away from average) Or: n = ; for n>10, t 2.2 (see t-table; 2-tailed) n = (2.2 * 0.05/0.03) 2 = 13 (: monthly intervals) Use formula t-test: 95% confidence interval = X avg ± t s x /√n Above n, t assumption was correct. Suppose that the above calculation would have yielded a result: n=6, then t n=6 = 2.6 n new = (2.6 *0.05/0.03) 2 = 4.3. After a second “iteration” with n=4 (t=3.2), this would have yielded n = 5.
7 “In an annual water quality monitoring programme measuring phosphate (average = 0.20 mg P/L; s x = 0.05), it is required that the average is known, with 95% confidence, within mg P/L distance from this average”
8 It will be clear that you, as a water quality manager, have two “degrees of freedom” to influence the necessary monitoring frequency (for “fixed” standard devation): 1. The allowable range around the maximum. In above example, if you reduce from 0.03 to 0.01 mg/L, the monitoring frequency must be : 9 times higher (check this !) 2. The level of confidence (90% ; 99%.. Confidence intervals) see t -table Indeed, this is the normal way of estimating frequencies in water quality monitoring programmes!
9 Example: EU water Framework Directive We discussed items of EU-WFD before; see e.g. Course 3-4 The EU-WFD sets guidelines for required allowable range around average”(“precision”) and confidence for detecting long-term trends Temporarily, lower precision and confidence may be accepted for e.g. socio-economic reasons.
10 Example from EU-WFD: number of river samples/year, necessary to estimate PO 4 concentrations up to 90%, 75% and 50% precision with 90% confidence Allowable range
11 CONTENTS Necessary water quality monitoring frequency Significant differences between two data sets Seasonal trends Linear correlation ANOVA Sign test Trend detection
12 COMPARISON BETWEEN 2 YEARS WQ DATA Significant difference between the 2 years? year 1 year 2
Also for detecting “step trend” after e.g. wastewater treatment 13 [BOD]
14 Factors playing a role: Difference between the averages : d Number of observations : n standard deviation s x This leads to formula for t test value t test = d/(s x /√n) Look up the value in the t-table, whether or not there is a significant difference between the two Data sets with significance level (“probability that the two years are not different”) PROCEDURE d
15 Intermezzo: HYPOTHESIS SETTING
16 In Statistics, you often test a certain hypothesis,e.g.: “there is a significant relationship between two variables” In the present example, it would be: H 0 hypothesis: the two years do not have different averages H 1 hypothesis: the two years do have different averages α: chance that you incorrectly think the years are different (“false positive”); β: chance that you incorrectly do not detect a difference (“false negative”) (We will work only with α here) DecisionREALITY H 0 is trueH 0 is not true Accept H 0 β ("Type 2 error") Reject H 0 α ("Type 1 error")
17 Example: Given two years of water quality data, use the “pooled t- test” to determine if the averages for the two years are significantly different, at 90% and 95% Confidence levels. Year 2: n y = 19 y avg. = 8.2 mg/L s y 2 = 5.4 (mg/L) 2 Year 1: n x = 22 X avg. = 9.4 mg/L s x 2 = 6.2 (mg/L) 2 (n x, n y = number of observations; x avg., y avg. = average values; s x 2, s y 2 = variances of the two data sets)
18 Step1: find out “pooled variance” s w 2 of the two data sets Year 1: n x = 22 X avg. = 9.4 mg/L s x 2 = 6.2 (mg/L) 2 Year 2: n y = 19 y avg. = 8.2 mg/L s y 2 = 5.4 (mg/L) 2
19 Step 2: Calculate the t test value Year 1: n x = 22 X avg. = 9.4 mg/L s x 2 = 6.2 (mg/L) 2 Year 2: n y = 19 y avg. = 8.2 mg/L s y 2 = 5.4 (mg/L) 2
= 41 observations d.f. = = 39 (two independent data sets; for each substract: 1) p = 0.05 t = 2.02 p = 0.1 t = tailed t-test Our t value 1.59 is smaller than above t values, so not significant (p> 0.1)
21 What if we would have found another t value ? t test = Significant on “p<0.1 level”..; now check: t test = :significant; p< 0.05 t test = :significant; p <0.01 t test = : significant; p<0.001
22 CONTENTS Necessary water quality monitoring frequency Significant differences between two data sets Seasonal trends Linear correlation ANOVA Sign test Trend detection
Oxygen contents in small ditch, maximum values (often “supersaturation”; > 100% oxygen) Large seasonal variations will lead to high s w values not possible to detect trends 23
Kruskal-Wallis test: 1.Divide into four seasons (or 12 months) 2.Rank all data (highest value = rank 1) 3. Sum of ranks for the four seasons (e.g. in above figure: Σ Winter = 580 Σ Spring = 675 Σ Summer =325 Σ Autum = Apply Kruskal-Wallis test (comparable with t-test) significant differences between seasons? at which level?
25 CONTENTS Necessary water quality monitoring frequency Significant differences between two data sets Seasonal trends Linear correlation ANOVA Sign test Trend detection
26 CORRELATION ANALYSIS Measures strength of association between two independent variables correlation coefficient r r = 1: 100% positive correlation r = -1: 100% negative correlation r = 0 : no correlation Used for, e.g.: Optimisation of monitoring (reducing frequency, number of variables, of stations) Finding out common sources of pollution Etc.
27 Highly significant correlation chloride – conductivity Leave out Cl - (routinely)? Less significant correlation conductivity with hardness EXAMPLES
28 Example GEMS programme Correlation between Discharge – alkalinity (log-log transformed !) r 2 : gives % of the variation that can be ascribed to relationship between the two variables (here: 72.3 %).
29 EXAMPLE: Correlation NH 4 -N and N tot. in Kirinya wetland, Uganda Is this t test value significant ? (r-value found with EXCEL; see exercise)
30 2-tailed t-test 12 samples, so d.f. = = 10 (two dependent data sets) Highly significant correlation(p< 0.01) t test = 3.94
31 BE CAREFUL WITH….. -Data clouds with just a few outlyers…. Solution could be: - Leave out outliers (apply test, e.g. Dixon Q test) - RANK correlation (Spearman rank test; see book Chapman) Too many data. For, say, 100 data, already r 2 = 0.04 would give significant (p<0.05) correlation (4% of variation in the values of the variables explained by relationship between the two variables !?) Non-linear trends apply non-linear regression.
32 Example: relationship light penetration, P content, pH, colour of lakes as f(altitude) (see Håkanson, 2006)
33 EXAMPLE: Correlation between TOC and COD in a wastewater Above graph using EXCEL; r= 0.89 ? (Correlation only for limited range!)
34 CONTENTS Necessary water quality monitoring frequency Significant differences between two data sets Seasonal trends Linear correlation ANOVA Sign test Trend detection
ANOVA: Analysis Of Variance : looks at the variability of data and divides this into: “Between groups variability” (e.g. between different soil types) “Within groups variability” (differences between replicates) F value ; from its value it can be decided whether or not “groups” are significantly different, at what significance level. Just as for the t-test, the number of degrees of freedom will play a big role. ANOVA In researches, there are often sets of data, in which there are different “groups” (e.g. different soil types). For each group (soil type), we have e.g. data of P adsorption onto the soil. 35
Example: P adsorption capacity q m of five land use types in the Migina catchment, Rwanda. Box plots represent 25 th and 75 th percentile; bars represent minimum and maximum values. Only “wetland” soil is significantly different from the rest (one-way ANOVA* ; p<0.001). * One independent variable 36
37 CONTENTS Necessary water quality monitoring frequency Significant differences between two data sets Seasonal trends Linear correlation ANOVA Sign test Trend detection
38 Another useful (non-parametric) test: sign test It compares pairs of data in data set A and B and determines for how many pairs, Data A > Data B (and for how many pairs: Data A<Data B, or are equal) Data set A significantly larger/smaller than B? Especially useful if there are large variations in the data set A and B (e.g. seasonal trends)
Example: monitoring sediment resuspension at two stations in Lake Markermeer, Netherlands; this was done by using “sediment traps” hung near the bottom, and at half-depth. STA different from STB? Differences between “bottom” and “half-depth” traps? 39
Very large variations over the season; however data can be compared in pairs, since they were monitored on the same days Comparing bottom traps: resuspension at STB always (n=7) > STA “values at STB > STA” (p < 0.05) Comparing bottom with “half-depth” resuspension bottom values always (n=14) > half-depth values “values at bottom > half depth” (p<0.01) Significance levels: function of n and of “paired differences” (>, <, =) look up in “sign test table” 40
41 CONTENTS Necessary water quality monitoring frequency Significant differences between two data sets Seasonal trends Linear correlation ANOVA Sign test Trend detection
42 TREND DETECTION See book Chapman, Ch. 10 Many types of trends Simple t-test can be used for trend detection, but also rigid statistical tests; these are outside scope of these lectures.
43 Highly significant decreasing trend BOD from For details, see MTM IV, page Example: BOD trend in 77 New Zealand rivers
44 Trends will more likely be detected for higher monitoring frequency
We have only discussed basic statistical procedures In this Course unit, you will also find an EXCEL exercise, dealing with the topics we have just highlighted. We did not deal with more advanced statistical tools such as multivariate techniques and Cluster analysis; the latter two tecniques will be highlighted in two (optional) presentations.
46 Literature Statistics D. Chapman (1996). Water Quality Assessments - A guide to use of Biota, sediments and water in Environmental monitoring. London, Chapman and Hall. D.C. Montgomery & G.C. Runger (2003). Applied statistics and probability for engineers. New York, Wiley and sons.
47 FURTHER READING Proceedings “Monitoring tailor-made” MTM II, p. 153 (trend assessment New Zealand waters) MTM II, p. 287 (North sea optimization; see Course 3.5) MTM III, p. 113 (Design monitoring programme) MTM III, p. 307 (Trend detection) MTM III, p. 323 (Multivariate techniques) MTM IV, p. 207 (Trend detection; see before)