Using the Maryland Biological Stream Survey Data to Test Spatial Statistical Models A Collaborative Approach to Analyzing Stream Network Data Andrew A.

Slides:



Advertisements
Similar presentations
Predicting Water Quality Impaired Stream Segments using Landscape-scale Data and a Regional Geostatistical Model Erin Peterson Environmental Risk Technologies.
Advertisements

Analysis of variance and statistical inference.
Forecasting Using the Simple Linear Regression Model and Correlation
Predicting the likelihood of water quality impaired stream reaches using landscape scale data and a hierarchical methodology Erin Peterson Geosciences.
Kriging.
MSS/MBSS # 1 Joint work with Erin E. Peterson, Andrew A. Merton, David M. Theobald, and Jennifer A. Hoeting All of Colorado State University, Fort Collins,
SPATIAL DATA ANALYSIS Tony E. Smith University of Pennsylvania Point Pattern Analysis Spatial Regression Analysis Continuous Pattern Analysis.
The General Linear Model. The Simple Linear Model Linear Regression.
Basic geostatistics Austin Troy.
Applications of Scaling to Regional Flood Analysis Brent M. Troutman U.S. Geological Survey.
How Many Samples are Enough? Theoretical Determination of the Critical Sampling Density for a Greek Clay Quarry. by K. Modis and S. Stavrou, Nat. Tech.
Erin E. Peterson Postdoctoral Research Fellow CSIRO Mathematical and Information Sciences Division Brisbane, Australia May 18, 2006 Regional.
Spatial autoregressive methods Nr245 Austin Troy Based on Spatial Analysis by Fortin and Dale, Chapter 5.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Ecologically representative distance measures for spatial modeling in stream networks Erin Peterson, David M. Theobald, and Jay Ver Hoef Natural Resource.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Erin Peterson Geosciences Department Colorado State University Fort Collins, Colorado Predicting Water Quality Impaired Stream Segments using Landscape-scale.
0.6 – – – – – 15.9 MBSS Survey Sites 1996 Dissolved organic carbon (mg/l) 0.7 – – – –
Analysis of Simulation Input.. Simulation Machine n Simulation can be considered as an Engine with input and output as follows: Simulation Engine Input.
Developing GIS indicators and metrics David Theobald Natural Resource Ecology Lab Colorado State University.
Predicting Water Quality Impaired Stream Segments using Landscape-scale Data and a Regional Geostatistical Model Erin E. Peterson Postdoctoral.
Predicting the likelihood of water quality impaired stream reaches using landscape scale data and a hierarchical methodology: A case study in the Southern.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
MSS/MBSS # 1 N. Scott Urquhart Joint work with Erin P. Peterson, Andrew A. Merton, David M. Theobald, and Jennifer A. Hoeting All of Colorado State University,
Why Geography is important.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
"Developing statistically-valid and -defensible frameworks to assess status and trends of ecosystem condition at national scales" "Developing statistically-valid.
FOUR METHODS OF ESTIMATING PM 2.5 ANNUAL AVERAGES Yan Liu and Amy Nail Department of Statistics North Carolina State University EPA Office of Air Quality,
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
By Jay M. Ver Hoef Alaska Dept. of Fish and Game 1300 College Road Fairbanks, AK By Jay M. Ver Hoef Alaska Dept. of Fish and Game 1300.
Method of Soil Analysis 1. 5 Geostatistics Introduction 1. 5
Hydrologic Statistics
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1. An Overview of the Data Analysis and Probability Standard for School Mathematics? 2.
Weed mapping tools and practical approaches – a review Prague February 2014 Weed mapping tools and practical approaches – a review Prague February 2014.
Bringing Inverse Modeling to the Scientific Community Hydrologic Data and the Method of Anchored Distributions (MAD) Matthew Over 1, Daniel P. Ames 2,
Areal Estimation techniques Two types of technique: 1. Direct weighted averages 2. Surface fitting methods DIRECT WEIGHTED AVERAGE METHODS use the equation:
Spatial Interpolation of monthly precipitation by Kriging method
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
III. Ground-Water Management Problem Used for the Exercises.
Edoardo PIZZOLI, Chiara PICCINI NTTS New Techniques and Technologies for Statistics SPATIAL DATA REPRESENTATION: AN IMPROVEMENT OF STATISTICAL DISSEMINATION.
Gridding Daily Climate Variables for use in ENSEMBLES Malcolm Haylock, Climatic Research Unit Nynke Hofstra, Mark New, Phil Jones.
Interpolation Tools. Lesson 5 overview  Concepts  Sampling methods  Creating continuous surfaces  Interpolation  Density surfaces in GIS  Interpolators.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Generalized Linear Models All the regression models treated so far have common structure. This structure can be split up into two parts: The random part:
Soil Movement in West Virginia Watersheds A GIS Assessment Greg Hamons Dr. Michael Strager Dr. Jingxin Wang.
Spatial Statistics in Ecology: Continuous Data Lecture Three.
GEOSTATISICAL ANALYSIS Course: Special Topics in Remote Sensing & GIS Mirza Muhammad Waqar Contact: EXT:2257.
Applications of Regression to Water Quality Analysis Unite 5: Module 18, Lecture 1.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Selecting Input Probability Distribution. Simulation Machine Simulation can be considered as an Engine with input and output as follows: Simulation Engine.
Mixed Effects Models Rebecca Atkins and Rachel Smith March 30, 2015.
PCB 3043L - General Ecology Data Analysis. PCB 3043L - General Ecology Data Analysis.
Lecture 6: Point Interpolation
PCB 3043L - General Ecology Data Analysis.
1 Module One: Measurements and Uncertainties No measurement can perfectly determine the value of the quantity being measured. The uncertainty of a measurement.
Tutorial I: Missing Value Analysis
Controls on Catchment-Scale Patterns of Phosphorous in Soil, Streambed Sediment, and Stream Water Marcel van der Perk, et al… Journal of Environmental.
Geostatistics GLY 560: GIS for Earth Scientists. 2/22/2016UB Geology GLY560: GIS Introduction Premise: One cannot obtain error-free estimates of unknowns.
Spatial Point Processes Eric Feigelson Institut d’Astrophysique April 2014.
By Russ Frith University of Alaska at Anchorage Civil Engineering Department Estimating Alaska Snow Loads.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
PCB 3043L - General Ecology Data Analysis Organizing an ecological study What is the aim of the study? What is the main question being asked? What are.
PCB 3043L - General Ecology Data Analysis.
Diagnostics and Transformation for SLR
Concepts and Applications of Kriging
Diagnostics and Transformation for SLR
Probabilistic Surrogate Models
Presentation transcript:

Using the Maryland Biological Stream Survey Data to Test Spatial Statistical Models A Collaborative Approach to Analyzing Stream Network Data Andrew A. Merton

Overview The material presented here is a subset of the work done by Erin Peterson for her Ph.D. The material presented here is a subset of the work done by Erin Peterson for her Ph.D. Interested in developing geostatistical models for predicting water quality characteristics in stream segments Interested in developing geostatistical models for predicting water quality characteristics in stream segments Data: Maryland Biological Stream Survey (MBSS) Data: Maryland Biological Stream Survey (MBSS) The scope and nature of the problem requires interdisciplinary collaboration The scope and nature of the problem requires interdisciplinary collaboration Ecology, geoscience, statistics, others… Ecology, geoscience, statistics, others…

Stream Network Data The response data is comprised of observations within a stream network The response data is comprised of observations within a stream network What does it mean to be a “neighbor” in such a framework? What does it mean to be a “neighbor” in such a framework? How does one characterize the distance between “neighbors”? How does one characterize the distance between “neighbors”? Should distance measures be confined to the stream network? Should distance measures be confined to the stream network? Does flow (direction) matter? Does flow (direction) matter?

Stream Network Data Potential explanatory variables are not restricted to be within the stream network Potential explanatory variables are not restricted to be within the stream network Topography, soil type, land usage, etc. Topography, soil type, land usage, etc. How does one sensibly incorporate these explanatory variables into the analysis? How does one sensibly incorporate these explanatory variables into the analysis? Can we develop tools to aggregate upstream watershed covariates for subsequent downstream segments? Can we develop tools to aggregate upstream watershed covariates for subsequent downstream segments?

Competing Models Given a collection of competing models, how does one select the “best” model? Given a collection of competing models, how does one select the “best” model? Is one subset of explanatory variables better or closer to the “true” model? Is one subset of explanatory variables better or closer to the “true” model? Should one assume correlated residuals and, if so, what form should the correlation function take? Should one assume correlated residuals and, if so, what form should the correlation function take? How does the distance measure impact the choice of correlation function? How does the distance measure impact the choice of correlation function?

Functional Distances & Spatial Relationships A B C Straight-line Distance (SLD) Is this an appropriate measure of distance? Influential continuous landscape variables: geology type or acid rain (As the crow flies…) Geostatistical models are based on straight-line distance

A B C Distances and relationships are represented differently depending on the distance measure Functional Distances & Spatial Relationships Symmetric Hydrologic Distance (SHD) Hydrologic connectivity (As the fish swims…)

A B C Distances and relationships are represented differently depending on the distance measure Functional Distances & Spatial Relationships Asymmetric Hydrologic Distance (AHD) Longitudinal transport of material (As the sh*t flows…)

Candidate Models Restrict the model space to general linear models Restrict the model space to general linear models Look at all possible subsets of explanatory variables X (Hoeting et al) Look at all possible subsets of explanatory variables X (Hoeting et al) Require a correlation structure that can accommodate the various distance measures Require a correlation structure that can accommodate the various distance measures Could assume that the residuals are spatially independent, i.e., S =  2 I (probably not best) Could assume that the residuals are spatially independent, i.e., S =  2 I (probably not best) Ver Hoef et al propose a better solution Ver Hoef et al propose a better solution

Asymmetric Autocovariance Models for Stream Networks Weighted asymmetric hydrologic distance (WAHD) Weighted asymmetric hydrologic distance (WAHD) Developed by Jay Ver Hoef, National Marine Mammal Laboratory, Seattle Developed by Jay Ver Hoef, National Marine Mammal Laboratory, Seattle Moving average models Moving average models Incorporates flow and uses hydrologic distance Incorporates flow and uses hydrologic distance Represents discontinuity at confluences Represents discontinuity at confluences Flow

Exponential Correlation Structure The exponential correlation function can be used for both SLD and SHD The exponential correlation function can be used for both SLD and SHD For AHD, one must multiply (element-wise) by the weight matrix A, i.e., For AHD, one must multiply  (element-wise) by the weight matrix A, i.e.,  ij * = a ij  ij, hence WAHD The weights represent the proportion of flow volume that the downstream location receives from the upstream location Estimating the a ij is non-trivial – Need special GIS tools (Theobald et al)

GIS Tools Theobald et al have created automated tools to extract data about hydrologic relationships between sample points Visual Basic for Applications programs that: 1.Calculate separation distances between sites  SLD, SHD, Asymmetric hydrologic distance (AHD) 2.Calculate watershed covariates for each stream segment  Functional Linkage of Watersheds and Streams (FLoWS) 3.Convert GIS data to a format compatible with statistics software SLD 12 3 SHD AHD

Spatial Weights for WAHD Proportional influence: influence of each neighboring sample site on a downstream sample site Weighted by catchment area: Surrogate for flow 1.Calculate influence of each upstream segment on segment directly downstream 2.Calculate the proportional influence of one sample site on another Multiply the edge proportional influences 3.Output: n×n weighted incidence matrix stream confluence stream segment

Spatial Weights for WAHD Proportional influence: influence of each neighboring sample site on a downstream sample site Weighted by catchment area: Surrogate for flow 1.Calculate influence of each upstream segment on segment directly downstream 2.Calculate the proportional influence of one sample site on another Multiply the edge proportional influences 3.Output: n×n weighted incidence matrix stream confluence stream segment

Spatial Weights for WAHD Proportional influence: influence of each neighboring sample site on a downstream sample site Weighted by catchment area: Surrogate for flow 1.Calculate influence of each upstream segment on segment directly downstream 2.Calculate the proportional influence of one sample site on another Multiply the edge proportional influences 3.Output: n×n weighted incidence matrix stream confluence stream segment

Spatial Weights for WAHD Proportional influence: influence of each neighboring sample site on a downstream sample site Weighted by catchment area: Surrogate for flow 1.Calculate influence of each upstream segment on segment directly downstream 2.Calculate the proportional influence of one sample site on another Multiply the edge proportional influences 3.Output: n×n weighted incidence matrix A B C D E F G H survey sites stream segment

Spatial Weights for WAHD Proportional influence: influence of each neighboring sample site on a downstream sample site Weighted by catchment area: Surrogate for flow 1.Calculate influence of each upstream segment on segment directly downstream 2.Calculate the proportional influence of one sample site on another Multiply the edge proportional influences 3.Output: n×n weighted incidence matrix A B C D E F G H Site PI = B * D * F * G

Parameter Estimation Maximize the (profile) likelihood to obtain estimates for , , and  2 Maximize the (profile) likelihood to obtain estimates for , , and  2 MLEs Profile likelihood:

Model Selection Hoeting et al adapted the Akaike Information Corrected Criterion for spatial models Hoeting et al adapted the Akaike Information Corrected Criterion for spatial models AICC estimates the difference between the candidate model and the “true” model AICC estimates the difference between the candidate model and the “true” model Select models with small AICC Select models with small AICC where n is the number of observations, p-1 is the number of covariates, and k is the number of autocorrelation parameters

Spatial Distribution of MBSS Data N

Summary Statistics for Distance Measures Distance measure greatly impacts the number of neighboring sites as well as the median, mean, and maximum separation distance between sites * Asymmetric hydrologic distance is not weighted here Summary statistics for distance measures in kilometers using DO (n=826).

Comparing Distance Measures The “selected” models (one for each distance measure) were compared by computing the mean square prediction error (MSPE) The “selected” models (one for each distance measure) were compared by computing the mean square prediction error (MSPE) GLM: Assumed independent errors GLM: Assumed independent errors Withheld the same 100 (randomly) selected records from each model fit Withheld the same 100 (randomly) selected records from each model fit Want MSPE to be small Want MSPE to be small

MSPE GLM SLD SHD WAHD Comparing Distance Measures Prediction Performance for Various Responses

Maps of the Relative Weights Generated maps by kriging (interpolation) Generated maps by kriging (interpolation) Predicted values are linear combinations of the “observed” data, i.e., Predicted values are linear combinations of the “observed” data, i.e., Z 1 is the observed data, Z 2 is the predicted value,  11 is the correlation matrix for the observed sites, and  is the correlation matrix between the prediction site and the observed sites

Relative Weights Used to Make Prediction at Site 465 General Linear Model Symmetric Hydrologic Straight-line Weighted Asymmetric Hydrologic

General Linear ModelStraight-line Symmetric HydrologicWeighted Asymmetric Hydrologic Relative Weights Used to Make Prediction at Site 465

Residual Correlations for Site 465 General Linear Model Symmetric Hydrologic Straight-line Weighted Asymmetric Hydrologic

General Linear Model Straight-line Symmetric HydrologicWeighted Asymmetric Hydrologic Residual Correlations for Site 465

Probability-based random survey design Designed to maximize spatial independence of survey sites Does not adequately represent spatial relationships in stream networks using hydrologic distance measures Some Comments on the Sampling Design Frequency Number of Neighboring Sites 244 sites did not have neighbors Sample Size = 881 Number of sites with ≥ 1 neighbor: 393 Mean number of neighbors per site: 2.81

Conclusions A collaborative effort enabled the analysis of a complicated problem Ecology – Posed the problem of interest, provides insight into variable (model) selection Ecology – Posed the problem of interest, provides insight into variable (model) selection Geoscience – Development of powerful tools based on GIS Geoscience – Development of powerful tools based on GIS Statistics – Development of valid covariance structures, model selection techniques Statistics – Development of valid covariance structures, model selection techniques Others – e.g., very understanding (and sympathetic) spouses… Others – e.g., very understanding (and sympathetic) spouses…