Geoff Phillips & Heliana Teixeira

Slides:



Advertisements
Similar presentations
Design of Experiments Lecture I
Advertisements

x – independent variable (input)
Lecture 23: Tues., Dec. 2 Today: Thursday:
Statistics: Data Analysis and Presentation Fr Clinic II.
Data Freshman Clinic II. Overview n Populations and Samples n Presentation n Tables and Figures n Central Tendency n Variability n Confidence Intervals.
1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
Data Tutorial Tutorial on Types of Graphs Used for Data Analysis, Along with How to Enter Them in MS Excel Carryn Bellomo University of Nevada, Las Vegas.
Hydrologic Statistics
Inference for regression - Simple linear regression
Chapter 10 Hypothesis Testing
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Quantitative Skills 1: Graphing
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Statistics and Nutrient Levels Julie Stahli Metro Wastewater Reclamation District March 2010.
Data Analysis, Presentation, and Statistics
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Intercalibration Option 3 results: what is acceptable and what is not ? Sandra Poikane Joint Research Centre Institute for Environment and Sustainability.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Comparison of freshwater nutrient boundary values Geoff Phillips 1 & Jo-Anne Pitt 2 1 University of Stirling & University College London 2 Environment.
Methods of Presenting and Interpreting Information Class 9.
Stats Methods at IC Lecture 3: Regression.
Chapter 2 Linear regression.
Step 1: Specify a null hypothesis
Chapter 14 Introduction to Multiple Regression
Chapter 4 Basic Estimation Techniques
Analysis and Empirical Results
Using Excel to Construct Basic Tools for Understanding Variation
Statistical Data Analysis - Lecture /04/03
Travelling to School.
Basic Estimation Techniques
Descriptive Statistics (Part 2)
Teaching Statistics in Psychology
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Chapter 11 Simple Regression
Description of Data (Summary and Variability measures)
Edexcel: Large Data Set Activities
IE-432 Design Of Industrial Experiments
Regression model Y represents a value of the response variable.
Introduction to Instrumentation Engineering
Prepared by Lee Revere and John Large
Tabulations and Statistics
What is Regression Analysis?
STATISTICS Topic 1 IB Biology Miss Werba.
When You See (This), You Think (That)
Intercalibration of Opportunistic Algae Blooms
Product moment correlation
15.1 The Role of Statistics in the Research Process
Nutrient Standards: Proposals for further work
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
DESIGN OF EXPERIMENT (DOE)
FITTING THE ITALIAN METHOD FOR EVALUATING LAKE ECOLOGICAL QUALITY FROM BENTHIC DIATOMS (EPI-L) IN THE “PHYTOBENTHOS CROSS-GIG” INTERCALIBRATION EXERCISE.
More difficult data sets
The Statistical Tool Kit determination of valid nutrient boundary values Geoff Phillips.
Session 2a Working with more difficult data sets: short gradients
ECOSTAT nutrient work : Brief update February 2017
Introductory Statistics
Multiple Pressures nutrient boundary setting
Summary – Day 1 Martyn Kelly.
Session 1d Selecting appropriate thresholds
Guidance on establishing nutrient concentrations to support good ecological status Introduction and overview Martyn Kelly.
Relationships for Broad & Intercalibration Types Geoff Phillips
Mismatches between nutrients and BQEs: what does it tell us?
The use of pressure response relationships between nutrients and biological quality elements as a method for establishing nutrient supporting element boundary.
ECOSTAT nutrient work : Brief intro
Presentation transcript:

Geoff Phillips & Heliana Teixeira Statistical Tool Kit Geoff Phillips & Heliana Teixeira  - the latest updated version of Best practice guide and the tool-kit (TKit_2017Sept19.zip) are available at:  https://circabc.europa.eu/w/browse/2be04871-b0ba-4789-a36f-05f47f153ee7

Outline Overview of the tool kit Brief summary of results of testing Highlight significant changes following comments Excel Tool R Scripts Possible further developments Comparison of the different approaches using artificial data sets

Overview of the tool-kit Excel Tool Simple assessment outliers Type I & II regression Categorical Analysis (Distributions of nutrient concentrations within class) Minimisation mis-match R Scripts Assessment of outliers & linearity Use of co-plots to identify interactions Multivariate regression Categorical Analysis Bi-variate logistic regression Minimisation of mis-match (including boot strapping to assess uncertainty)

Toolkit feedback All respondents used the Excel tool Identification outliers complicated Difficulties with units (designed for freshwater & P) Most found regression methods worked successfully Issue with categorical method selection of data (either all the data or only data within linear range) Most liked the minimisation of mis-match method, although a few issues with how it scaled the data Most tried the R scripts More difficult, particularly for those unfamiliar with R Few tried the Shiny version (not clear why) Generally toolkit well received Excel simple but less flexible

Testing MS data sets (Lakes & Rivers) Results from 13 countries Lakes (10), Rivers (4) Wide range of R2 values for regressions Range R2 values obtained by Country

Range R2 for Phosphorus (Lake TP, River TP or SolP) Range R2 for Nitrogen (Lake TN, River TN or NO3-N)

Collated results for Lake TP and River TP or Soluble P by Broad Type where R2>0.3 Relatively wide range of predicted P boundary values (when R2 >0.3) Too few results for rivers, difficulty identifying broad types Results provided useful test for tool kit, but did not provide information to help determine boundary ranges for broad types.

Testing MS data sets Transitional & Coastal Waters Results from 8 countries: Transitional waters IE; UK; FR; RO Coastal waters IE; UK; FI; SP; GR; FR; RO; SEcategor Often national types BQE Phytoplankton Opportunistic macroalgae Nutrients DIN; TN; NO3; PO4; TP; OrtoP; N/Pratio Range R2 values obtained by Country Categorical not presented in this overview, just regressions summaries Wide range of R2 values for regressions Categorical results not presented here

Range R2 per nutrient across GIGs Coastal waters Transitional waters Different nutrients (and parameters) across GIGs /water categories.

G/M results for types where R2>0.3 Few results within common types for comparing predicted G/M boundary values BALTIC CW MEDITERRANEAN CW NEA CW & TW BLACK TW Only one value resulting from relationship with BQE Opp macroalgae (IE) Macroalg useful test for tool kit, but not sufficient to help determine boundary ranges for common types

Excel Tool Modifications v6c Modified data input tab so that the last record used for regression is separated from the last used for categorical analysis

Excel Tool Modifications v6c Axis labels now taken from a cell Macro is used to scale graphs (.xlsm file)

Excel Tool Modifications v6c Axis labels taken from cell B2 Scaling – min and max values & number of bins used

Excel Tool Modifications v6c Categorical method includes Wilcoxon Rank Sum Test to check there are significant differences in distribution of nutrient concentration between adjacent classes

R Script Modifications Difficult to make R scripts fully reliable, better to treat them as an example. The Shiny application produced by Gabor Varbiro might be the best way to apply these for non-experts Key minor changes Included some lines to check field names used in data file Produced 2 copies of the script for N and for P Included additional optional lines of code with different units (mg/l, mmol etc) Increased number of decimals to allow for different units (Changes introduce errors, so new scripts may produce problems !)

Additional R Scripts Conditioning plots TKit_CoPlot.R It is often helpful to look at the relationships between EQR and nutrients for different levels of other potentially limiting nutrients. For example categorising data by N:P ratio Fig A11 relationship between EQR for phytoplankton and a) total phosphorus (log10) for different ranges of the N:P ratio

Additional R Scripts Categorical methods TKit_P_Categorical.R Visualisation using box plots Wilcoxon Test Average quartiles Average median 75th quartile class High 27.75 32.5 26.90 Good 52.70 54.2 53.95 Fig A23 Box plot showing range of nutrient concentration by WFD class, width of box proportional to number of records in class. The probability that Good > High and Moderate>Good is shown (Wilcoxon test)

Additional R Scripts Minimisation mis-match (Gabor Varbiro, modified by GP) Specify the bootstrap iterations and sampling size Itt<-50 # Set value for number of iterations used to estimate variability, e.g. 50 Prop<-0.75 # Set proportion of data used for each iteration of simulation Experimental script Currently rather slow to run (Gabor is making some modifications to this script which should increase its speed) Fits lines using Loess fit and the results are dependent on the number of bins used Alternative approach may be to use a logistic regression method Fig A24 Relationship between percentage of mis-classified records comparing biological and nutrient classifications in comparison to value of nutrient boundary. Vertical lines mark the range of cross-over points where the mis-classification is minimized, together with the mean nutrient concentration. (each line shows a sub-sample of the data set selected at random)

Additional R Scripts Important to check that there are sufficient iterations of the boot-strapping to achieve convergence Fig A25 Example of convergence of estimated mean in comparison to number of iterations

Additional R Scripts Binomial Logistic Regression (Adapted from script provided by Adreas Müller, Germany) Fig A26 Binomial logistic regression of total nitrogen on probability of being moderate or worse status. Lines show potential boundary values at different probabilities of being moderate or worse.

Further developments Use of R package modEvA Heliana has also been experimenting with using this R package which uses the output from the GLM logistic model to produce confusion matrices (number of false negative and false positive classifications) and different approaches such as minimisation of false negative, or false positive or minimum difference Raises questions about what we are seeking to minimise   Nutrient Pred + G -NG Biology Obs TP(1,1) FN(0,1) n+1 FP(1,0) TN(0,0) n+0 Confusion Matrix

Fundamentally two approaches Regression modelling (including quantile regression) Uses all the data, not only that within the status class of interest Dependent on linearity (unless non linear models are used) Issues re use of type I or type II models Categorical methods (including binomial logistic regression) Only uses data for the status class of interest Ignores the variability of nutrient concentration within the class When relationships are strong then most methods produce similar results, particularly if the mean EQR is close to the boundary of interest Particular issue when scatter plots show “wedge” shaped relationships (other factors influence nutrient response). May be common for relationships in rivers

Comparing categorical & regression approaches Strong relationship Regression – P is 83 ug/l at EQR 0.6 Categorical methods 75th quantile – lower value Average median - similar Average quartiles - similar Working with artificial data set Random normally distributed set of P concentration values of a given mean & standard deviation Predict a “true” EQR using a known regression model EQR ~ aP + c Add random error, normally distributed with a mean of 0 and different standard deviations to generate a typical “observed” EQR EQR~ aP + c + Error

Comparing categorical & regression approaches Noisy relationship Regression – 83 ug/l Categorical methods 75th quantile – difference smaller Average median - similar Average quartiles – similar Categorical methods only consider data in range Good & Moderate Not influenced by linearity and outliers at ends of gradient

Exploration of methods using synthetic data set Created a series of synthetic data sets 200 records 10 random sets of data using 10 different P mean values (50 – 170 µg/l) Predicted EQR values with 10 different levels of error Applied all methods to each data set Generate 1000 sets of estimated good/moderate threshold values Compare the ranges of values by variability (categories of R2 and mean phosphorus concentration)

Range of predicted P at Good/Moderate boundary for artificial data with increasing variability (R2) Dotted line “true” boundary (83 ug/l)

Effect of range of data (mean P 50 – 140 ug/l) Where mean of data is < boundary, categorical methods underestimate boundary > boundary, categorical methods overestimate boundary Differences increase as scatter increases (R2) The 75th quantile of good shows the most extreme range

Range of predicted P at Good/Moderate boundary for artificial data with increasing variability (R2) Binary logistic regression and the minimisation of mis-match methods are the most stable of the categorical methods

Conclusion from synthetic data sets Linear regression provided good estimates of boundary for all values of R2 Binary logistic regression performed at least as well Minimisation of mis-match was only slightly influenced by variability and range of the data The other categorical methods produced relatively wide ranges of estimated threshold values, particularly the use of 75th percentile of good. These were conclusions drawn from synthetic data that conformed to the requirement of linear regression, but they suggest that regression, binary logistic regression and the minimisation of mis-match are the most reliable techniques, provided the data scatter does not show evidence of a “wedge” shape evidence of multiple pressures?) (More about this later)

Problem with wedge shaped data Wedge shaped data may occur for many reasons but fitting OLS regression lines may not produce useful models for determining boundary values More interested in fitting upper or lower quantiles, or using upper quantiles of nutrient distribution within a class

Summary Tool kit provides a range of tools which can be used Selecting the correct method is potentially difficult We recommend using regression methods rather than categorical methods as they use data across the range of pressure However, categorical methods, particularly the use of binomial logistic regression may be useful where data are clearly not linear We have not solved the problem of interpreting data where multiple pressures may generate wedge shaped relationships