EXPLORING SPATIAL CORRELATION IN RIVERS by Joshua French.

Slides:



Advertisements
Similar presentations
Spatial point patterns and Geostatistics an introduction
Advertisements

Spatial point patterns and Geostatistics an introduction
Assumptions underlying regression analysis
Multiple Regression. Introduction In this chapter, we extend the simple linear regression model. Any number of independent variables is now allowed. We.
You have data! What’s next? Data Analysis, Your Research Questions, and Proposal Writing Zoo 511 Spring 2014.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
The Simple Regression Model
The Multiple Regression Model.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
STAT 497 APPLIED TIME SERIES ANALYSIS
Deterministic Solutions Geostatistical Solutions
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Spatial Interpolation
FOUR METHODS OF ESTIMATING PM 2.5 ANNUAL AVERAGES Yan Liu and Amy Nail Department of Statistics North Carolina State University EPA Office of Air Quality,
The Simple Regression Model
Applied Geostatistics
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Deterministic Solutions Geostatistical Solutions
Lecture 23 Multiple Regression (Sections )
Lecture 17 Interaction Plots Simple Linear Regression (Chapter ) Homework 4 due Friday. JMP instructions for question are actually for.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Regression Analysis
Method of Soil Analysis 1. 5 Geostatistics Introduction 1. 5
Lecture 3-2 Summarizing Relationships among variables ©
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
Introduction to Linear Regression and Correlation Analysis
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Relationship of two variables
EXPLORING SPATIAL CORRELATION IN RIVERS by Joshua French.
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Simple Linear Regression Models
CORRELATION & REGRESSION
Correlation.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Explorations in Geostatistical Simulation Deven Barnett Spring 2010.
Geographic Information Science
Geo479/579: Geostatistics Ch16. Modeling the Sample Variogram.
Spatial Statistics in Ecology: Continuous Data Lecture Three.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
GG 313 Geological Data Analysis Lecture 13 Solution of Simultaneous Equations October 4, 2005.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Semivariogram Analysis and Estimation Tanya, Nick Caroline.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Geo479/579: Geostatistics Ch7. Spatial Continuity.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Simple Linear Regression The Coefficients of Correlation and Determination Two Quantitative Variables x variable – independent variable or explanatory.
Geostatistics GLY 560: GIS for Earth Scientists. 2/22/2016UB Geology GLY560: GIS Introduction Premise: One cannot obtain error-free estimates of unknowns.
Geo479/579: Geostatistics Ch12. Ordinary Kriging (2)
Principal Component Analysis
Chapter 9 Regression Wisdom. Getting the “Bends” Linear regression only works for data with a linear association. Curved relationships may not be evident.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
CWR 6536 Stochastic Subsurface Hydrology Optimal Estimation of Hydrologic Parameters.
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
Stochastic Process - Introduction
The simple linear regression model and parameter estimation
Why Model? Make predictions or forecasts where we don’t have data.
Regression and Correlation
Statistical Data Analysis - Lecture /04/03
Ch9 Random Function Models (II)
Diagnostics and Transformation for SLR
Simple Linear Regression
Concepts and Applications of Kriging
Diagnostics and Transformation for SLR
Multiple Regression Berlin Chen
Presentation transcript:

EXPLORING SPATIAL CORRELATION IN RIVERS by Joshua French

Introduction A city is required to extends its sewage pipelines farther in its bay to meet EPA requirements. How far should the pipelines be extended? The city doesnt want to spend any more money than it needs to extend the pipelines. It needs to find a way to make predictions for the waste levels at different sites in the bay.

With the passage of the Clean Water Act in the 1970s, spatial analysis of aquatic data has become even more important. Section 305 b) requires state governments to make, a description of the water quality of all navigable waters in such State... It is not physically or financially possible to make measurements at all sites. Some sort of spatial interpolation will need to be used.

Usually we might try to fit some sort of linear model to the data to make predictions. Usually we assume observations are independent. For spatial data however, we intuitively know that two sampling sites close together will probably be similar. We would expect that two sites in close proximity would be more similar than two sites separated by a great distance. We can use the correlation between sampling sites to make better predictions with our model.

The Ohio River

The Road Ahead -Methods -Introduction to the Variogram -Exploratory Analysis -Sample Variogram -Modeling the Variogram -Analysis -3 types of results -Conclusions -Future Work

Introduction to the Variogram Spatial data is often viewed as a stochastic process. For each point x, a specific property Z(x) is viewed as a random variable with mean µ, variance σ 2, higher-order moments, and a cumulative distribution function.

Each individual Z(x i ) is assumed to have its own distribution, and the set {Z(x 1 ),Z(x 2 ),…} is a stochastic process. The data values in a given data set are simply a realization of the stochastic process.

We want to measure the relationship between different points. Define the covariance for Z(x j ) and Z(x k ) to be: Cov(Z(x j ),Z(x k ))=E[{Z(x j )-µ(x j )} {Z(x k )-µ(x k )}] where µ(x j ) and µ(x k ) is the mean of Z at each respective location.

However, we have a problem. We dont know the means at each point because we only have one realization. To solve this, we must assume sort of stationarity – certain features of the distribution are identical everywhere. We will work with data that satisfies second- order stationarity.

Second-order stationarity means that the mean is the same everywhere: i.e. E[Z(x j )]=µ for all points x j. It also implies that Cov(Z(x j ),Z(x k )) becomes a function of the distance x j to x k.

Thus, Cov(Z(x j ),Z(x k )) = Cov(Z(x),Z(x+h)) = Cov(h) where h measures the distance between two points. We can then derive that Cov(Z(x),Z(x+h)) =E[(Z(x)-µ)(Z(x+h)- µ)] = E[(Z(x)(Z(x+h))-µ 2 ]

Sometimes it is clear that our data is not second-order stationary. Georges Matheron solved this problem in 1965 by establishing his intrinisic hypothesis. For small distances h, Matheron held that E[Z(x)-Z(x+h)]=0

Looking at the variance of differences, this leads to Var[Z(x)-Z(x+h)] =E[ (Z(x)-Z(x+h)) 2 ] = 2 γ(h) Intrinsic stationarity is good because analysis may be conducted even if second-order stationarity is violated. Unfortunately, the covariance equation is not defined for intrinsic stationarity.

For this reason, we will work with data that is second-order stationarity. If second-order stationarity is violated by the original data, then we will perform additional procedures to work with data that is second-order stationary.

Note that second-order stationarity implies intrinsic stationarity, so the variogram equation is still defined. Under second-order stationarity, γ(h)=Cov(0)-Cov(h). γ(h) is known as the semi-variogram. In practice however, it is usually referred to as the variogram.

Things to know about variograms: 1.γ(h)= γ(-h). Because it is an even function, usually only positive lag distances are shown. 2.Nugget effect - by definition, γ(0)= 0. In practice however, sample variograms often have a positive value at lag 0. This is called the nugget effect.

3.Tend to increase monotonically 4.Sill – the maximum variance of the variogram 5.Range – the lag distance at which the sill is reached The following figure shows these features

Variogram Example

Exploratory Analysis Before we model variograms, we should explore the data. We need to make sure that the data analyzed satisfies second-order stationarity We need to check for outliers We need to make sure that the data is not too badly skewed (G 1 >1)

We can look at the river data as a one-dimensional linear system. It is fairly easy to check for stationarity using a scatter plot.

If there is an obvious trend in the data, we should remove it and analyze the residuals. If the variance increases or decreases with lag distance, then we should transform the variable to correct this.

To check for outliers, we may use a typical boxplot. If the data contains outliers, we should do analysis both with and without outliers present.

If G 1 >1, then we should transform the data to approximate normality if possible. To check approximate normality, the standard qqplot can be used.

3.3 The Sample Variogram One of the previous definitions of semivariance is: The logical estimator is: where N(h) is the number of pairs of observations associated with that lag.

Sample Variogram Example

Modeling the Variogram Our goal is to estimate the true variogram of the data. There were four variogram models used to model the sample variogram: the spherical, Gaussian, exponential, and Matern models.

Variogram Models

The algorithm used to fit the spherical model uses least squares. The algorithm used to fit the exponential, Gaussian, and Matern models is maximum likelihood. The spherical model is fit to get an estimate of the sill, nugget, and range.

These estimates will be used to fit the other three models. The best model will be the model that minimizes the AICC statistic.

Analysis The data analyzed is a set of particle size and biological variables for the Ohio River. The data was collected by The Ohio River Valley Sanitation Commission. This is better known as ORSANCO.

ORANSCO data collection

There were between 190 and 235 unique sampling sites, depending on the variable. Some sites had more than one observation. In these situations, the average value for the site was used for analysis.

Ohio River Sampling Sites

There were two main types of data: particle size data and biological levels. The particle size data measured percent gravel, percent sand, percent fines, percent hardpan, percent boulder, and percent cobble.

The biological data measured -Number of individuals at a site -Number of species at a site -Percent tolerant fish -Percent simple lithophilic fish (fish that lay eggs on rocks) -Percent non-native fish -Percent detritivore fish (fish that eat mostly decomposed plants or animals) -Percent invertivore (fish that eat mostly invertebrate animals) -Percent Piscivore (fish that eat mostly other fish)

The results of the analysis fell into three main groups: -Sample variogram fit well -Sample variogram did not fit well -Analysis not reasonable

Good Results: Number of Individuals at a site Skewness coefficient of data is This is much too high. The data is transformed using the natural logarithm New skewness coefficient is reduced to.56. Not perfect, but much less skewed.

Check Normality of log(Num Individuals)

Check Second-Order Stationarity of log(Num Individuals)

Check for outliers of log(Num Individuals)

There are a number of outliers for the transformed variable We should do analysis with and without the outliers present

log(Num Individuals) Sample Variogram with outliers

Check normality of log(Num Individuals) without outliers

log(Num Individuals) Sample Variogram without outliers

We were not able to model the sample variogram perfectly, but we were able to detect some amount of spatial correlation in the data, especially when the outliers were removed. For the transformed variable without outliers, the exponential model estimated the nugget to be.20, the sill to be.2709, and the range to be 37.7 miles.

Poor Results: Percent Sand Skewness coefficient only.18, so skewness not a major factor. Check second-order stationarity using scatter plot.

Check Stationarity of Percent Sand

There appears to be a trend in the data. After removing the trend, the data appears to be second-order stationary. The residuals are also approximately normal.

Check stationarity of percent sand residuals

Check normality of percent sand residuals

Sample Variogram of percent sand residuals

The sample variogram does not really increase monotonically with distance. Our variogram models cannot fit this very well. Though we can obtain estimates of the nugget, sill, and range, the estimates cannot be trusted.

No results: Percent Hardpan This variable was so badly skewed that analysis was not reasonable. The skewness coefficient is This is extremely high.

QQplot of Percent Hardpan

Scatter plot of Percent Hardpan

The data is nearly all zeros! There is also an erroneous data value. A percentage cannot be greater than 100%. Data analysis does not seem reasonable. Our data does not meet the conditions necessary to use the spatial methods discussed.

Conclusions Able to fit sample variogram reasonably well – percent gravel, number of individuals, number of species Not able to fit sample variogram well – percent sand, percent detritivore, percent simple lithophilic individuals, percent invertivore No results – remaining variables

Summary of Results

Future Work Data set involving three streams in Norfolk, Virginia. Each stream has 25 observations. Collected by researchers at Old Dominion University. Difficulties to overcome - What is the best way to measure distance between points? - Few observations - Overlapping points after coordinate conversion

Problem: What is the best way to measure distance between points? There is some aspect of two-dimensionality to the data, but it is still really a one- dimensional problem.

Paradise Creek Region of Interest

Paradise Creek Sampling Sites

Problem: 25 observations per stream is considered the minimum number of points to create a variogram - the sample variogram will be very rough - our variogram model estimates will probably be bad To correct this, we will explore the possibility of combining the data from the three streams

Problem: Overlapping points after conversion - Original data in longitude/latitude coordinates - Convert to UTM coordinates so that Euclidian distance makes sense - Converted UTM coordinates often result in overlapping sites (and even fewer unique sampling sites)

Stream Sampling Sites (Lat/Long)

Stream Sampling Sites (UTM)

Stream Sampling Sites (Lat/Long)

Stream Sampling Sites (UTM)

Acknowledgments - My committee: Dr. Urquhart, Dr. Wang, and Dr. Theobald - Dr. Davis and Dr. Reich for answering my spatial questions and letting me use their S-Plus spatial library

Concluding Thought Before you criticize someone, you should walk a mile in their shoes. That way, when you criticize them, youre a mile away and you have their shoes. - Jack Handey