Geoff Phillips & Heliana Teixeira Statistical Tool Kit Geoff Phillips & Heliana Teixeira - the latest updated version of Best practice guide and the tool-kit (TKit_2017Sept19.zip) are available at: https://circabc.europa.eu/w/browse/2be04871-b0ba-4789-a36f-05f47f153ee7
Outline Overview of the tool kit Brief summary of results of testing Highlight significant changes following comments Excel Tool R Scripts Possible further developments Comparison of the different approaches using artificial data sets
Overview of the tool-kit Excel Tool Simple assessment outliers Type I & II regression Categorical Analysis (Distributions of nutrient concentrations within class) Minimisation mis-match R Scripts Assessment of outliers & linearity Use of co-plots to identify interactions Multivariate regression Categorical Analysis Bi-variate logistic regression Minimisation of mis-match (including boot strapping to assess uncertainty)
Toolkit feedback All respondents used the Excel tool Identification outliers complicated Difficulties with units (designed for freshwater & P) Most found regression methods worked successfully Issue with categorical method selection of data (either all the data or only data within linear range) Most liked the minimisation of mis-match method, although a few issues with how it scaled the data Most tried the R scripts More difficult, particularly for those unfamiliar with R Few tried the Shiny version (not clear why) Generally toolkit well received Excel simple but less flexible
Testing MS data sets (Lakes & Rivers) Results from 13 countries Lakes (10), Rivers (4) Wide range of R2 values for regressions Range R2 values obtained by Country
Range R2 for Phosphorus (Lake TP, River TP or SolP) Range R2 for Nitrogen (Lake TN, River TN or NO3-N)
Collated results for Lake TP and River TP or Soluble P by Broad Type where R2>0.3 Relatively wide range of predicted P boundary values (when R2 >0.3) Too few results for rivers, difficulty identifying broad types Results provided useful test for tool kit, but did not provide information to help determine boundary ranges for broad types.
Testing MS data sets Transitional & Coastal Waters Results from 8 countries: Transitional waters IE; UK; FR; RO Coastal waters IE; UK; FI; SP; GR; FR; RO; SEcategor Often national types BQE Phytoplankton Opportunistic macroalgae Nutrients DIN; TN; NO3; PO4; TP; OrtoP; N/Pratio Range R2 values obtained by Country Categorical not presented in this overview, just regressions summaries Wide range of R2 values for regressions Categorical results not presented here
Range R2 per nutrient across GIGs Coastal waters Transitional waters Different nutrients (and parameters) across GIGs /water categories.
G/M results for types where R2>0.3 Few results within common types for comparing predicted G/M boundary values BALTIC CW MEDITERRANEAN CW NEA CW & TW BLACK TW Only one value resulting from relationship with BQE Opp macroalgae (IE) Macroalg useful test for tool kit, but not sufficient to help determine boundary ranges for common types
Excel Tool Modifications v6c Modified data input tab so that the last record used for regression is separated from the last used for categorical analysis
Excel Tool Modifications v6c Axis labels now taken from a cell Macro is used to scale graphs (.xlsm file)
Excel Tool Modifications v6c Axis labels taken from cell B2 Scaling – min and max values & number of bins used
Excel Tool Modifications v6c Categorical method includes Wilcoxon Rank Sum Test to check there are significant differences in distribution of nutrient concentration between adjacent classes
R Script Modifications Difficult to make R scripts fully reliable, better to treat them as an example. The Shiny application produced by Gabor Varbiro might be the best way to apply these for non-experts Key minor changes Included some lines to check field names used in data file Produced 2 copies of the script for N and for P Included additional optional lines of code with different units (mg/l, mmol etc) Increased number of decimals to allow for different units (Changes introduce errors, so new scripts may produce problems !)
Additional R Scripts Conditioning plots TKit_CoPlot.R It is often helpful to look at the relationships between EQR and nutrients for different levels of other potentially limiting nutrients. For example categorising data by N:P ratio Fig A11 relationship between EQR for phytoplankton and a) total phosphorus (log10) for different ranges of the N:P ratio
Additional R Scripts Categorical methods TKit_P_Categorical.R Visualisation using box plots Wilcoxon Test Average quartiles Average median 75th quartile class High 27.75 32.5 26.90 Good 52.70 54.2 53.95 Fig A23 Box plot showing range of nutrient concentration by WFD class, width of box proportional to number of records in class. The probability that Good > High and Moderate>Good is shown (Wilcoxon test)
Additional R Scripts Minimisation mis-match (Gabor Varbiro, modified by GP) Specify the bootstrap iterations and sampling size Itt<-50 # Set value for number of iterations used to estimate variability, e.g. 50 Prop<-0.75 # Set proportion of data used for each iteration of simulation Experimental script Currently rather slow to run (Gabor is making some modifications to this script which should increase its speed) Fits lines using Loess fit and the results are dependent on the number of bins used Alternative approach may be to use a logistic regression method Fig A24 Relationship between percentage of mis-classified records comparing biological and nutrient classifications in comparison to value of nutrient boundary. Vertical lines mark the range of cross-over points where the mis-classification is minimized, together with the mean nutrient concentration. (each line shows a sub-sample of the data set selected at random)
Additional R Scripts Important to check that there are sufficient iterations of the boot-strapping to achieve convergence Fig A25 Example of convergence of estimated mean in comparison to number of iterations
Additional R Scripts Binomial Logistic Regression (Adapted from script provided by Adreas Müller, Germany) Fig A26 Binomial logistic regression of total nitrogen on probability of being moderate or worse status. Lines show potential boundary values at different probabilities of being moderate or worse.
Further developments Use of R package modEvA Heliana has also been experimenting with using this R package which uses the output from the GLM logistic model to produce confusion matrices (number of false negative and false positive classifications) and different approaches such as minimisation of false negative, or false positive or minimum difference Raises questions about what we are seeking to minimise Nutrient Pred + G -NG Biology Obs TP(1,1) FN(0,1) n+1 FP(1,0) TN(0,0) n+0 Confusion Matrix
Fundamentally two approaches Regression modelling (including quantile regression) Uses all the data, not only that within the status class of interest Dependent on linearity (unless non linear models are used) Issues re use of type I or type II models Categorical methods (including binomial logistic regression) Only uses data for the status class of interest Ignores the variability of nutrient concentration within the class When relationships are strong then most methods produce similar results, particularly if the mean EQR is close to the boundary of interest Particular issue when scatter plots show “wedge” shaped relationships (other factors influence nutrient response). May be common for relationships in rivers
Comparing categorical & regression approaches Strong relationship Regression – P is 83 ug/l at EQR 0.6 Categorical methods 75th quantile – lower value Average median - similar Average quartiles - similar Working with artificial data set Random normally distributed set of P concentration values of a given mean & standard deviation Predict a “true” EQR using a known regression model EQR ~ aP + c Add random error, normally distributed with a mean of 0 and different standard deviations to generate a typical “observed” EQR EQR~ aP + c + Error
Comparing categorical & regression approaches Noisy relationship Regression – 83 ug/l Categorical methods 75th quantile – difference smaller Average median - similar Average quartiles – similar Categorical methods only consider data in range Good & Moderate Not influenced by linearity and outliers at ends of gradient
Exploration of methods using synthetic data set Created a series of synthetic data sets 200 records 10 random sets of data using 10 different P mean values (50 – 170 µg/l) Predicted EQR values with 10 different levels of error Applied all methods to each data set Generate 1000 sets of estimated good/moderate threshold values Compare the ranges of values by variability (categories of R2 and mean phosphorus concentration)
Range of predicted P at Good/Moderate boundary for artificial data with increasing variability (R2) Dotted line “true” boundary (83 ug/l)
Effect of range of data (mean P 50 – 140 ug/l) Where mean of data is < boundary, categorical methods underestimate boundary > boundary, categorical methods overestimate boundary Differences increase as scatter increases (R2) The 75th quantile of good shows the most extreme range
Range of predicted P at Good/Moderate boundary for artificial data with increasing variability (R2) Binary logistic regression and the minimisation of mis-match methods are the most stable of the categorical methods
Conclusion from synthetic data sets Linear regression provided good estimates of boundary for all values of R2 Binary logistic regression performed at least as well Minimisation of mis-match was only slightly influenced by variability and range of the data The other categorical methods produced relatively wide ranges of estimated threshold values, particularly the use of 75th percentile of good. These were conclusions drawn from synthetic data that conformed to the requirement of linear regression, but they suggest that regression, binary logistic regression and the minimisation of mis-match are the most reliable techniques, provided the data scatter does not show evidence of a “wedge” shape evidence of multiple pressures?) (More about this later)
Problem with wedge shaped data Wedge shaped data may occur for many reasons but fitting OLS regression lines may not produce useful models for determining boundary values More interested in fitting upper or lower quantiles, or using upper quantiles of nutrient distribution within a class
Summary Tool kit provides a range of tools which can be used Selecting the correct method is potentially difficult We recommend using regression methods rather than categorical methods as they use data across the range of pressure However, categorical methods, particularly the use of binomial logistic regression may be useful where data are clearly not linear We have not solved the problem of interpreting data where multiple pressures may generate wedge shaped relationships