The Statistical Tool Kit determination of valid nutrient boundary values Geoff Phillips.

The Statistical Tool Kit determination of valid nutrient boundary values
Geoff Phillips

Objective methods Regression models Categorical methods
Numerical model (relationship) between nutrient concentration and EQR Categorical methods Range of nutrients observed in water bodies of different class Nutrient class boundary values that minimise the difference in classification between biology and nutrient quality elements. Minimisation of miss-match

Tool Kit Excel spreadsheet Set of scripts for use in R
Simple methods to detect and remove outliers & select linear range of data Comparison of type I and II regression models v4 modified to now contain same range of univariate linear regression models (includes reduced major axis regression) Categorical methods Set of scripts for use in R Examples of methods to detect outliers GAM & segmented model fits to assist selection of linear range of data Inclusion of multivariate models with N and P Excel simple and quick R more flexible Comments received from AT, FI, FR, IT, LV, UK. Generally supportive, key issues raised. Is type II regression needed ? Should EQR or Nutrient concentration be dependent variable ?

Simple ordinary least squares (OLS) linear regression
Scatter plot of EQR v Log10 TP concentration Dependent variable is EQR Predictor variable is log10 [TP] We assume that TP causes EQR Fit a line by minimising the sum of squared distance of observed values of y for given values of x Assume this is an asymmetric relationship. We know TP which gives rise to observed EQR. EQR = β0 + βLogTP + E OLS is appropriate for asymmetric relationship, it assumes that the error variance of Y >> error variance of X, likely to be true because of model error ε Error = Measurement error δTP + Model error ε

But ! We are predicting TP for a given EQR
The dependent variable should be log[TP] The independent variable should be EQR We wish to use our data to determine the likely range of TP at different EQR values This is more likely to be a symmetric relationship as EQR is unlikely to cause TP concentration LogTP = β0 + βEQR + E The error variance for y (now [TP]) δTP is less likely to be >> than for X δEQR Error = measurement error δTP (no equation error)

Invert the axis for comparison red line OLS of TP v EQR compared with black line OLS of EQR v TP
We are minimising the variation on the X axis (i.e. LogTP) The outcome is a much steeper slope Lines intersect at mean LogTP & mean EQR Likely to produce lower TP boundary values at the GM boundary The choice of dependent and independent variable is critical to the prediction of the slope and intercept of our model and thus boundaries. Caused by choice of variable used to minimise variation, i.e. the response variable contains all the variability. Difference in slope is a function of r2, Slope OLS Y on X = r2 * Slope OLS X on Y For R2 < 0.7 differences start to become important to boundary values

Invert the axis for comparison red line OLS of TP v EQR compared with black line OLS of EQR v TP
True slope lies between these lines Where is most of the error? EQR measurement TP measurement Structural or equation error (only in EQR?) Sample data suggests CV of TP >> CV EQR, but metric is annual/growing season mean TP. Need to consider replicate annual mean values, not the CV of sample results. Estimate the variation without seasonality (importance of using mean TP not spot TP concentrations) Months TP conc

Assume variation in both TP and EQR Type II regression
Minimise variation of both X and Y Several methods Orthogonal, major axis (MA) regression error variances of Y and X are equal, ratio λ= 1 Geometric mean, standardised major axis(SMA) or reduced major axis regression (don’t confuse with ranged major axis regression) ratio error variance similarly proportional to variance ratio of observations x and y i.e. λ= s2y / s2x Calculated as the geometric average of the slopes of the OLS regressions of Y on X and X on Y Issues with averaging approach If no relationship (random values of x and y), OLS fitted lines are mean of x and y, slopes of 0 and approaching ∞ geometric average of slopes = 1. Cant test if slope sig > 0 Don’t use if r < 0.6 (Smith 2009) Smith, R.J., Use and misuse of the reduced major axis for line-fitting. Am J Phys Anthropol 140,

Assume variation in both TP and EQR Type II regression
Minimise variation of both X and Y Several methods Orthogonal, major axis (MA) regression error variances of Y and X are equal, ratio λ= 1 Geometric mean, standardised major axis(SMA) or reduced major axis regression (don’t confuse with ranged major axis regression) ratio error variance similarly proportional to variance ratio of observations x and y i.e. λ= s2y / s2x Calculated as the geometric average of the slopes of the OLS regressions of Y on X and X on Y Ranged major axis regression (RMA) An orthogonal (MA) regression, but data standardised by ranging (0 – 1) Overcomes the limitations of MA (require equal scales of X & Y) and no longer determined simply be ratio s2y / s2x as in geometric regression Still assumes error variances of Y and X are equal, ratio λ= 1 RMA most appropriate method – implemented in R and in v4 of Excel workbook

Summary of regression methods
Unless R2 is high (>0.7) choice of dependent variable will influence slope and thus predicted boundary values True slope lies between the extremes of OLS of EQR on Nut and Nut on EQR RMA regression is proposed as the best predictor of boundaries, but range will lie between the values predicted by the 2 OLS approaches Compare all 3 methods

Data Template convenient format for R & Excel workbook
Need to use summary data not spot samples Spatial and temporal match Water bodies from same or similar types

Data need to extend across the EQR gradient
Reasonable relationship Extend from High to Poor biological status Adequate data above and below EQR of 0.6 Good/Moderate boundary  Not such a clear relationship Only a single point below an EQR of 0.6 Good/Moderate boundary More data needed to fit a reliable model  TP concentration ug/l

Is the response linear Select linear portion of the data By eye
Fit GAM model (Tkit_check_data.R) Use segmented regression Example data linear in range TP 10 – 100 µgl-1 Consider excluding outliers

Excel tool – Data table Enter data columns B:D
Sort data into ascending order of nutrient concentration Enter the range of records to be used Enter the Boundary EQR values & record of data used (BQE, Type, Nutrient etc)

Excel tool – PR Plot Plots for OLS of EQR v Nutrient,
Nutrient v OLS and Ranged Major Axis regressions are shown (v4 Geometric mean regression is shown in distributed v2) Open circles show data not used

Excel tool – Results Summary of the regression models are provided in sheet Results 3 models are shown OLS EQR v Log10 Nut OLS Log10 Nut v EQR Geometric regression (average slope of models 1 & 2) SMA regression The sheet also calculates the residuals of the model and their interquartile range. The range is used to determine the where 50% if the observed data lie in comparison to the best fit line v4 now also includes 4th model Ranged major axis regression

Use of residuals By adding the 25th and 75th quartiles of the residuals to the regression model lines parallel to the best fit line mark boundaries below/above which are 25% of the data. These lines can be used to provide estimates of the likely range of TP concentrations at G/M boundary (0.6) at 35 µgL-1 75% WBs had EQR ≥ 0.6 At 72 µgL-1 25% WBs had EQR ≥ 0.6 The most likely boundary would be 53 µgL-1, the intersection of best fit line. 50% of WBs had EQR ≥ 0.6, 50% < 0.6 For model 1 OLS EQR v TP the GM boundary for TP would fall in the range

Summary of results The most likely boundary value is taken from prediction using the Ranged Major Axis regression model (v4) G/M = 49 µgL-1 The most likely boundary value range is taken from predictions using the two OLS models G/M = µgL-1 The possible boundary value range is taken from predictions using the upper and lower quantiles of the fit + residuals of the two OLS models G/M = µgL-1

Categorical methods Average adjacent quartiles
Calculate the median and interquartile range of nutrient concentration in each biological class Average the upper 75th quantile of Good with lower 25th quantile of Moderate

Categorical methods Average adjacent medians
Calculate the median and interquartile range of nutrient concentration in each biological class Average the median of Good with median of Moderate

Categorical methods Upper 75th quantile of higher class
Issue with box plots are Perhaps easier to use for non-linear data, but lower quantiles are influenced by outliers. Better to restrict data to linear region of data, derived using scatter plots. Need to decide if log transform is appropriate (used in Excel work book) No estimate of uncertainty Similar approach to OLS where nutrient is dependent variable.

Minimise mis-match of classifications
Binary classifications of biology & nutrients (good or better and moderate or worse) Use logarithmic series of nutrient concentrations to define nutrient class Plot rate of mis-classifications Point of intersection identifies nutrient boundary concentration for minimum mis-classification.

Minimisation mis-match class
Benefits of method No modelling, so simple to understand. Not sensitive to outliers and non linear data Directly shows rate of mis-classification No bias Disadvantages No modelling, so not a general case No uncertainty analysis (might be possible to explore by Monte Carlo simulation of subsets of data)

Multivariate regression models
EQR influenced by N & P multivariate model (NP.mod0) EQR = a1 log10[TP] + a2 log10[TN] + C0 + Error univariate model (P.mod0) EQR = a3 log10[TP] + C1 + Error univariate model (N.mod0) EQR = a4 log10[TN] + C2 + Error Example using high alkalinity very shallow lakes (N, CB, EC GIGs) TN and TP not significantly correlated (VIF < 2) df AIC N.mod NP.mod lowest AIC, best model P.mod

Multivariate regression models
Two predictor variables, so difficult to interpret results Calculate a matrix of TN & TP values that would predict an EQR = 0.6 (G/M boundary) Overlay as a contour line on scatter plot of TP v TN. Use intersection of best fit line relating TP to TN (fitted using RMA regression) to the contour to determine pairs of boundary values for TP and TN (More about situations where N or P are likely to be limiting nutrient later)

Summary Tool kit provides range of regression approaches
Propose that Ranged Major Axis regression provides most unbiased model Range of true slope & thus most likely values defined by predictions from two OLS models Categorical analysis is an alternative approach Still influenced by outliers and non linear relationships Minimisation of mis-match is a potentially useful approach Data quality is a key factor Should cover full range of pressures Nutrient data need to be a summary statistic (e.g. mean) Need to be checked for linearity and for outliers Outstanding issues How to deal with data showing weak relationships

Comments Tool kit needs moving to a separate section – Appendix
Specific comment Clearer definition outliers. Is type II regression needed ? Need to consider N & P together Other variables not included (e.g. turbidity, flushing) Tool-kit aimed at freshwater (units in template?) R scripts need more explanation

The Statistical Tool Kit determination of valid nutrient boundary values Geoff Phillips.

Similar presentations

Presentation on theme: "The Statistical Tool Kit determination of valid nutrient boundary values Geoff Phillips."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Statistical Tool Kit determination of valid nutrient boundary values Geoff Phillips.

Similar presentations

Presentation on theme: "The Statistical Tool Kit determination of valid nutrient boundary values Geoff Phillips."— Presentation transcript:

Similar presentations

About project

Feedback