The Statistical Tool Kit determination of valid nutrient boundary values Geoff Phillips.

Slides:



Advertisements
Similar presentations
Variation, uncertainties and models Marian Scott School of Mathematics and Statistics, University of Glasgow June 2012.
Advertisements

Managerial Economics in a Global Economy
Forecasting Using the Simple Linear Regression Model and Correlation
IB Math Studies – Topic 6 Statistics.
Objectives (BPS chapter 24)
Chapter 13 Multiple Regression
Chapter 12 Simple Regression
Chapter 12 Multiple Regression
The Simple Regression Model
Lecture 6: Multiple Regression
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Lecture 24: Thurs., April 8th
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Simple Linear Regression Analysis
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
Relationships Among Variables
Advantages of Multivariate Analysis Close resemblance to how the researcher thinks. Close resemblance to how the researcher thinks. Easy visualisation.
Objectives of Multiple Regression
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Summarizing Bivariate Data
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Regression Regression relationship = trend + scatter
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
BUSINESS MATHEMATICS & STATISTICS. Module 6 Correlation ( Lecture 28-29) Line Fitting ( Lectures 30-31) Time Series and Exponential Smoothing ( Lectures.
Comparison of freshwater nutrient boundary values Geoff Phillips 1 & Jo-Anne Pitt 2 1 University of Stirling & University College London 2 Environment.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Stats Methods at IC Lecture 3: Regression.
Chapter 13 Simple Linear Regression
Chapter 15 Multiple Regression Model Building
The simple linear regression model and parameter estimation
Chapter 14 Introduction to Multiple Regression
Regression Analysis AGEC 784.
Statistical Data Analysis - Lecture /04/03
Statistics for the Social Sciences
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Evaluation of measuring tools: validity
Multiple Regression Analysis and Model Building
Chapter 11 Simple Regression
Understanding Standards Event Higher Statistics Award
Correlation and Simple Linear Regression
Inverse Transformation Scale Experimental Power Graphing
Lecture Slides Elementary Statistics Thirteenth Edition
Stats Club Marnie Brennan
Unit 3 – Linear regression
BUS173: Applied Statistics
Correlation and Simple Linear Regression
Undergraduated Econometrics
Product moment correlation
Inferential Statistics
Introduction to Regression
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
More difficult data sets
Session 1d Selecting appropriate thresholds
Guidance on establishing nutrient concentrations to support good ecological status Introduction and overview Martyn Kelly.
Relationships for Broad & Intercalibration Types Geoff Phillips
Developing, understanding and using nutrient boundaries
Correlation and Simple Linear Regression
Geoff Phillips & Heliana Teixeira
Presentation transcript:

The Statistical Tool Kit determination of valid nutrient boundary values Geoff Phillips

Objective methods Regression models Categorical methods Numerical model (relationship) between nutrient concentration and EQR Categorical methods Range of nutrients observed in water bodies of different class Nutrient class boundary values that minimise the difference in classification between biology and nutrient quality elements. Minimisation of miss-match

Tool Kit Excel spreadsheet Set of scripts for use in R Simple methods to detect and remove outliers & select linear range of data Comparison of type I and II regression models v4 modified to now contain same range of univariate linear regression models (includes reduced major axis regression) Categorical methods Set of scripts for use in R Examples of methods to detect outliers GAM & segmented model fits to assist selection of linear range of data Inclusion of multivariate models with N and P Excel simple and quick R more flexible Comments received from AT, FI, FR, IT, LV, UK. Generally supportive, key issues raised. Is type II regression needed ? Should EQR or Nutrient concentration be dependent variable ?

Simple ordinary least squares (OLS) linear regression Scatter plot of EQR v Log10 TP concentration Dependent variable is EQR Predictor variable is log10 [TP] We assume that TP causes EQR Fit a line by minimising the sum of squared distance of observed values of y for given values of x Assume this is an asymmetric relationship. We know TP which gives rise to observed EQR. EQR = β0 + βLogTP + E OLS is appropriate for asymmetric relationship, it assumes that the error variance of Y >> error variance of X, likely to be true because of model error ε Error = Measurement error δTP + Model error ε

But ! We are predicting TP for a given EQR The dependent variable should be log[TP] The independent variable should be EQR We wish to use our data to determine the likely range of TP at different EQR values This is more likely to be a symmetric relationship as EQR is unlikely to cause TP concentration LogTP = β0 + βEQR + E The error variance for y (now [TP]) δTP is less likely to be >> than for X δEQR Error = measurement error δTP (no equation error)

Invert the axis for comparison red line OLS of TP v EQR compared with black line OLS of EQR v TP We are minimising the variation on the X axis (i.e. LogTP) The outcome is a much steeper slope Lines intersect at mean LogTP & mean EQR Likely to produce lower TP boundary values at the GM boundary The choice of dependent and independent variable is critical to the prediction of the slope and intercept of our model and thus boundaries. Caused by choice of variable used to minimise variation, i.e. the response variable contains all the variability. Difference in slope is a function of r2, Slope OLS Y on X = r2 * Slope OLS X on Y For R2 < 0.7 differences start to become important to boundary values

Invert the axis for comparison red line OLS of TP v EQR compared with black line OLS of EQR v TP True slope lies between these lines Where is most of the error? EQR measurement TP measurement Structural or equation error (only in EQR?) Sample data suggests CV of TP >> CV EQR, but metric is annual/growing season mean TP. Need to consider replicate annual mean values, not the CV of sample results. Estimate the variation without seasonality (importance of using mean TP not spot TP concentrations) Months TP conc

Assume variation in both TP and EQR Type II regression Minimise variation of both X and Y Several methods Orthogonal, major axis (MA) regression error variances of Y and X are equal, ratio λ= 1 Geometric mean, standardised major axis(SMA) or reduced major axis regression (don’t confuse with ranged major axis regression) ratio error variance similarly proportional to variance ratio of observations x and y i.e. λ= s2y / s2x Calculated as the geometric average of the slopes of the OLS regressions of Y on X and X on Y Issues with averaging approach If no relationship (random values of x and y), OLS fitted lines are mean of x and y, slopes of 0 and approaching ∞ geometric average of slopes = 1. Cant test if slope sig > 0 Don’t use if r < 0.6 (Smith 2009) Smith, R.J., 2009. Use and misuse of the reduced major axis for line-fitting. Am J Phys Anthropol 140, 476-486.

Assume variation in both TP and EQR Type II regression Minimise variation of both X and Y Several methods Orthogonal, major axis (MA) regression error variances of Y and X are equal, ratio λ= 1 Geometric mean, standardised major axis(SMA) or reduced major axis regression (don’t confuse with ranged major axis regression) ratio error variance similarly proportional to variance ratio of observations x and y i.e. λ= s2y / s2x Calculated as the geometric average of the slopes of the OLS regressions of Y on X and X on Y Ranged major axis regression (RMA) An orthogonal (MA) regression, but data standardised by ranging (0 – 1) Overcomes the limitations of MA (require equal scales of X & Y) and no longer determined simply be ratio s2y / s2x as in geometric regression Still assumes error variances of Y and X are equal, ratio λ= 1 RMA most appropriate method – implemented in R and in v4 of Excel workbook

Summary of regression methods Unless R2 is high (>0.7) choice of dependent variable will influence slope and thus predicted boundary values True slope lies between the extremes of OLS of EQR on Nut and Nut on EQR RMA regression is proposed as the best predictor of boundaries, but range will lie between the values predicted by the 2 OLS approaches Compare all 3 methods

Data Template convenient format for R & Excel workbook Need to use summary data not spot samples Spatial and temporal match Water bodies from same or similar types

Data need to extend across the EQR gradient Reasonable relationship Extend from High to Poor biological status Adequate data above and below EQR of 0.6 Good/Moderate boundary  Not such a clear relationship Only a single point below an EQR of 0.6 Good/Moderate boundary More data needed to fit a reliable model  TP concentration ug/l

Is the response linear Select linear portion of the data By eye Fit GAM model (Tkit_check_data.R) Use segmented regression Example data linear in range TP 10 – 100 µgl-1 Consider excluding outliers

Excel tool – Data table Enter data columns B:D Sort data into ascending order of nutrient concentration Enter the range of records to be used Enter the Boundary EQR values & record of data used (BQE, Type, Nutrient etc)

Excel tool – PR Plot Plots for OLS of EQR v Nutrient, Nutrient v OLS and Ranged Major Axis regressions are shown (v4 Geometric mean regression is shown in distributed v2) Open circles show data not used

Excel tool – Results Summary of the regression models are provided in sheet Results 3 models are shown OLS EQR v Log10 Nut OLS Log10 Nut v EQR Geometric regression (average slope of models 1 & 2) SMA regression The sheet also calculates the residuals of the model and their interquartile range. The range is used to determine the where 50% if the observed data lie in comparison to the best fit line v4 now also includes 4th model Ranged major axis regression

Use of residuals By adding the 25th and 75th quartiles of the residuals to the regression model lines parallel to the best fit line mark boundaries below/above which are 25% of the data. These lines can be used to provide estimates of the likely range of TP concentrations at G/M boundary (0.6) at 35 µgL-1 75% WBs had EQR ≥ 0.6 At 72 µgL-1 25% WBs had EQR ≥ 0.6 The most likely boundary would be 53 µgL-1, the intersection of best fit line. 50% of WBs had EQR ≥ 0.6, 50% < 0.6 For model 1 OLS EQR v TP the GM boundary for TP would fall in the range 35 - 72

Summary of results The most likely boundary value is taken from prediction using the Ranged Major Axis regression model (v4) G/M = 49 µgL-1 The most likely boundary value range is taken from predictions using the two OLS models G/M = 43 - 53 µgL-1 The possible boundary value range is taken from predictions using the upper and lower quantiles of the fit + residuals of the two OLS models G/M = 32 - 72 µgL-1

Categorical methods Average adjacent quartiles Calculate the median and interquartile range of nutrient concentration in each biological class Average the upper 75th quantile of Good with lower 25th quantile of Moderate

Categorical methods Average adjacent medians Calculate the median and interquartile range of nutrient concentration in each biological class Average the median of Good with median of Moderate

Categorical methods Upper 75th quantile of higher class Issue with box plots are Perhaps easier to use for non-linear data, but lower quantiles are influenced by outliers. Better to restrict data to linear region of data, derived using scatter plots. Need to decide if log transform is appropriate (used in Excel work book) No estimate of uncertainty Similar approach to OLS where nutrient is dependent variable.

Minimise mis-match of classifications Binary classifications of biology & nutrients (good or better and moderate or worse) Use logarithmic series of nutrient concentrations to define nutrient class Plot rate of mis-classifications Point of intersection identifies nutrient boundary concentration for minimum mis-classification.

Minimisation mis-match class Benefits of method No modelling, so simple to understand. Not sensitive to outliers and non linear data Directly shows rate of mis-classification No bias Disadvantages No modelling, so not a general case No uncertainty analysis (might be possible to explore by Monte Carlo simulation of subsets of data)

Multivariate regression models EQR influenced by N & P multivariate model (NP.mod0) EQR = a1 log10[TP] + a2 log10[TN] + C0 + Error univariate model (P.mod0) EQR = a3 log10[TP] + C1 + Error univariate model (N.mod0) EQR = a4 log10[TN] + C2 + Error Example using high alkalinity very shallow lakes (N, CB, EC GIGs) TN and TP not significantly correlated (VIF < 2) df AIC N.mod0 3 -236.7768 NP.mod0 4 -330.8359 lowest AIC, best model P.mod0 3 -264.8557

Multivariate regression models Two predictor variables, so difficult to interpret results Calculate a matrix of TN & TP values that would predict an EQR = 0.6 (G/M boundary) Overlay as a contour line on scatter plot of TP v TN. Use intersection of best fit line relating TP to TN (fitted using RMA regression) to the contour to determine pairs of boundary values for TP and TN (More about situations where N or P are likely to be limiting nutrient later)

Summary Tool kit provides range of regression approaches Propose that Ranged Major Axis regression provides most unbiased model Range of true slope & thus most likely values defined by predictions from two OLS models Categorical analysis is an alternative approach Still influenced by outliers and non linear relationships Minimisation of mis-match is a potentially useful approach Data quality is a key factor Should cover full range of pressures Nutrient data need to be a summary statistic (e.g. mean) Need to be checked for linearity and for outliers Outstanding issues How to deal with data showing weak relationships

Comments Tool kit needs moving to a separate section – Appendix Specific comment Clearer definition outliers. Is type II regression needed ? Need to consider N & P together Other variables not included (e.g. turbidity, flushing) Tool-kit aimed at freshwater (units in template?) R scripts need more explanation