Assessing the quality of spatial predictions Xiaogang (Marshall) Ma School of Science Rensselaer Polytechnic Institute Tuesday, Mar 26, 2013 GIS in the.

Slides:

Advertisements

Similar presentations

Spatial point patterns and Geostatistics an introduction

Advertisements

Spatial prediction from point samples (1) Xiaogang (Marshall) Ma School of Science Rensselaer Polytechnic Institute Tuesday, Mar 05, 2013 GIS in the Sciences.

Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.

Objectives (BPS chapter 24)

Basic geostatistics Austin Troy.

Spatial Interpolation

1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.

The Simple Linear Regression Model: Specification and Estimation

Deterministic Solutions Geostatistical Solutions

Spatial Interpolation

Evaluating Hypotheses

Applied Geostatistics

Experimental Evaluation

Inferences About Process Quality

The Calibration Process

Method of Soil Analysis 1. 5 Geostatistics Introduction 1. 5

Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: prediction Original citation: Dougherty, C. (2012) EC220 - Introduction.

Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.

Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.

Inference for regression - Simple linear regression

CORRELATION & REGRESSION

Ch 8 Estimating with Confidence. Today’s Objectives ✓ I can interpret a confidence level. ✓ I can interpret a confidence interval in context. ✓ I can.

Topic 5 Statistical inference: point and interval estimate

Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.

Chapter 8 Introduction to Hypothesis Testing

Introduction: Why statistics? Petter Mostad

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Ch 8 Estimating with Confidence. Today’s Objectives ✓ I can interpret a confidence level. ✓ I can interpret a confidence interval in context. ✓ I can.

Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.

From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.

Geographic Information Science

1 Peter Fox GIS for Science ERTH 4750 (98271) Week 8, Tuesday, March 20, 2012 Analysis and propagation of errors.

GEOSTATISICAL ANALYSIS Course: Special Topics in Remote Sensing & GIS Mirza Muhammad Waqar Contact: EXT:2257.

Chapter 7: Sample Variability Empirical Distribution of Sample Means.

The Semivariogram in Remote Sensing: An Introduction P. J. Curran, Remote Sensing of Environment 24: (1988). Presented by Dahl Winters Geog 577,

Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.

MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.

Statistical Methods II&III: Confidence Intervals ChE 477 (UO Lab) Lecture 5 Larry Baxter, William Hecker, & Ron Terry Brigham Young University.

1 Chapter 6 Estimates and Sample Sizes 6-1 Estimating a Population Mean: Large Samples / σ Known 6-2 Estimating a Population Mean: Small Samples / σ Unknown.

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

Spatial Analysis & Geostatistics Methods of Interpolation Linear interpolation using an equation to compute z at any point on a triangle.

Geo479/579: Geostatistics Ch15. Cross Validation.

KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.

1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.

Chapter 10 The t Test for Two Independent Samples

Summarizing Risk Analysis Results To quantify the risk of an output variable, 3 properties must be estimated: A measure of central tendency (e.g. µ ) A.

Sampling and estimation Petter Mostad

Lab for Remote Sensing Hydrology and Spatial Modeling Dept of Bioenvironmental Systems Engineering National Taiwan University 1/45 GEOSTATISTICS INTRODUCTION.

Machine Learning 5. Parametric Methods.

Statistics Presentation Ch En 475 Unit Operations.

Stochastic Hydrology Random Field Simulation Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.

Geostatistics GLY 560: GIS for Earth Scientists. 2/22/2016UB Geology GLY560: GIS Introduction Premise: One cannot obtain error-free estimates of unknowns.

Computacion Inteligente Least-Square Methods for System Identification.

Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Stats Methods at IC Lecture 3: Regression.

The Calibration Process

Lecture 19: Spatial Interpolation II

CPSC 531: System Modeling and Simulation

Statistical Methods Carey Williamson Department of Computer Science

Statistical Methods For Engineers

CHAPTER 29: Multiple Regression*

Introduction to Instrumentation Engineering

Stochastic Hydrology Random Field Simulation

Discrete Event Simulation - 4

Carey Williamson Department of Computer Science University of Calgary

IE 355: Quality and Applied Statistics I Confidence Intervals

Propagation of Error Berlin Chen

MGS 3100 Business Analysis Regression Feb 18, 2016

Presentation transcript:

Assessing the quality of spatial predictions Xiaogang (Marshall) Ma School of Science Rensselaer Polytechnic Institute Tuesday, Mar 26, 2013 GIS in the Sciences ERTH 4750 (38031)

Acknowledgements This lecture is partly based on: –Rossiter, D.G., Assessing the quality of spatial predictions. Lecture in distance course Applied Geostatistics. ITC, University of Twente 2

Contents 1.Assessment of model quality: overview 2.Model validation with an independent data set 3.Cross-validation 4.Kriging prediction variance 5.Spatial simulation 3

1 Assessment of model quality With any predictive method, we would like to know how good it is. This is model validation. –cf. model calibration, when we are building (fitting) the model. 4

Internal vs. external quality assessment 5

Prediction error 6

7 Residuals from validation and their location; Jura cobalt

2 Validation with an independent dataset An excellent check on the quality of any model is to compare its predictions with actual data values from an independent data set. –Advantages: objective measure of quality –Disadvantages: requires more samples; not all samples can be used for modeling (  poorer calibration?) 8

Selecting the validation data set The validation statistics presented next apply to the validation set. It must be a representative and unbiased sample of the population for which we want these statistics. Two methods: 1.Completely independent, according to a sampling plan; This can be from a different population than the calibration sample: we are testing the applicability of the fitted model for a separate target population. 2.A representative subset of the original sample. A random splitting of the original sample This validates the population from which the sample was drawn, only if the original sample was unbiased If the original sample was taken to emphasize certain areas of interest, the statistics do not summarize the validity in the whole study area 9

Measures of validity 10

Relative measures of validity The MPE and RMSE are expressed in the original units of the target variable, as absolute differences. The magnitude of these can be judged by absolute criteria, but is also relevant to compare them to the dataset itself: –MPE compared to the mean or median –RMSE compared to the range, inter-quartile range, or standard deviation 11

Model efficiency 12

3 Cross-validation If we don't have an independent data set to evaluate a model, we can reuse the same sample points that were used to estimate the model to validate that same model. This seems a bit dubious, but with enough points, the effect of the removed point on the model (which was estimated using that point) is minor. Note: This is not legitimate for non-geostatistical models, because there is no theory of spatial correlation. 13

Cross-validation procedure 1.Compute experimental variogram with all sample points in the normal way; model it to get a parameterized variogram model; 2.For each sample point –Remove the point from the sample set; –Predict at that point using the other points and the modeled variogram; 3.Summarize the deviations of the model from the actual point. –This is called leave-one-out cross-validation (LOOCV). –Then models can be compared by their summary statistics, also by looking at individual predictions of interest. 14

Summary statistics for cross-validation (1) Two are the same as for independent validation and are computed in the same way: Root Mean Square Error (RMSE): lower is better Bias or mean error (MPE): should be 0 15

Summary statistics for cross-validation (2) 16

17 Residuals from cross-validation and their location; Jura cobalt

4 Kriging prediction variance Recall from last lecture that kriging is “optimal” with respect to a given model of spatial dependence, because the kriging equations minimize the prediction variance at each point to be predicted. This is an internal measure of quality, because there is no independent dataset. –Advantage: gives a measure of quality at all points –Disadvantage: depends on the correctness of the variogram model This variance presumably is from the normally-distributed errors, so we can use it accordingly to compute confidence intervals or threshold probabilities. This is quite useful in risk assessment. 18

Confidence interval 19

20

Confidence intervals of OK 21

5 Spatial simulation Simulation is the process or result of representing what reality might look like, given a model. –In geostatistics, this reality is usually a spatial distribution (map). 22

What is stochastic simulation? “Simulation” is a general term for studying a system without physically implementing it. “Stochastic” simulation means that there is a random component to the simulation model: quantified uncertainty is included so that each simulation is different. Non-spatial example: planning the number and timing of clerks in a new branch bank; customer behavior (arrival times, transaction length) is stochastic and represented by probability distributions. Reference for spatial simulation: Goovaerts, P., Geostatistics for natural resources evaluation. Applied Geostatistics Series. Oxford University Press, New York; Chapter 8. 23

Why spatial simulation? Recall: the theory of regionalized variables assumes that the values we observe come from some random process; in the simplest case, with one expected value (first-order stationarity) with a spatially- correlated error that is the same over the whole area (second-order stationarity). So we'd like to see “alternative realities”; that is, spatial patterns that, by this theory, could have occurred in some “parallel universe” (i.e. another realization of the spatial process). In addition, kriging maps are unrealistically smooth, especially in areas with low sampling density. –Even if there is a high nugget effect in the variogram, this variability is not reflected in adjacent prediction points, since they are estimated from almost the same data. 24

When must simulation be used? Goovaerts: “Smooth interpolated maps should not be used for applications sensitive to the presence of extreme values and their patterns of continuity.” (p. 370) –Example: ground water travel time depends on sequences of large or small values (“critical paths”), not just on individual values. 25

Local uncertainty vs. spatial uncertainty Recall: kriging prediction also provides a prediction error; this is the BLUP and its error for each prediction location separately. So, at each prediction location we obtain a probability distribution of the prediction, a measure of its uncertainty. This is fine for evaluating each prediction individually. But, it is not valid to evaluate the set of predictions! Errors are by definition spatially-correlated (as shown by the fitted variogram model), so we can't simulate the error in a field by simulating the error in each point separately. Spatial uncertainty is a representation of the error over the entire field of prediction locations at the same time. 26

Practical applications of spatial simulation If the distribution of the target variable(s) over the study area is to be used as input to a model, then the uncertainty is represented by a number of simulations. Procedure: 1.Simulate a “large” number of realizations of the spatial field 2.Run the model on each simulation 3.Summarize the output of the different model runs The statistics of the output give a direct measure of the uncertainty of the model in the light of the sample and the model of spatial variability. 27

Conditional simulation This simulates the field, while respecting the sample. The simulated maps look more like the best (kriging) prediction, but usually much more spatially-variable (depending on the magnitude of the nugget). These can be used as inputs into spatially-explicit models, e.g. hydrology. 28

What is preserved in conditional simulation? Mean over field Covariance structure Observations (sample points are predicted exactly) See figures on the next page. The OK prediction is then reproduced for comparison. 29

30 Conditional simulations: same field, different realizations Jura Co concentration; known points over-printed as post-plot The “patchiness” and “graininess” of all realizations is similar. This is because they all use the same model of spatial dependence. The overall pattern of high and low patches is the same. This is because they use the same observation points for conditioning. The detailed local pattern is quite different, especially away from clusters of sample points. This is because the simulation has freedom to choose values as long as the covariance structure and sample values are respected.

31 OK prediction Compare the conditional simulations with the single “best” prediction made by OK: The conditional simulations are (realistically) “grainy”; the OK prediction is (unrealistically) smooth.

Unconditional simulation In unconditional simulation, we simulate the field with no reference to the actual sample, i.e. the data we have. (It's only one realization, no more valid than any other.) This is used to visualize a random field as modeled by a variogram, not for prediction. 32

What is preserved in unconditional simulation? Mean over field Covariance structure See figure on the next page. Note the similar degree of spatial continuity, but with no regard to the values in the sample. 33

34 Unconditional simulations: same field, different realizations Model based on variogram analysis of Jura Co concentration

35 Unconditional simulation: increasing nugget Variogram modelsSimulated fields

36 Unconditional simulation: different models Variogram modelsSimulated fields

Simulation algorithm There are several ways to simulate; see Emery, X. (2008). Statistical tests for validating geostatistical simulation algorithms. Computers & Geosciences, 34(11), doi: /j.cageo One algorithm is sequential simulation as used in the gstat package; in simplified form: 1.If conditional, place the data on the prediction grid 2.Pick a random unknown point; make a kriging prediction, along with its prediction variance 3.Assuming a normally-distributed prediction variance, simulate one value from this; add to the kriging prediction and place this at the previously- unknown point 4.This point is now considered “known"; repeat steps (2)-(3) until no more points are left to predict Pebesma, E. J., & Wesseling, C. G. (1998). Gstat: a program for geostatistical modelling, prediction and simulation. Computers & Geosciences, 24(1),

Reading for this week –Data Quality in GIS –Handouts: Formulas and key ideas card from the book ‘Introduction to the Practice of Statistics’ (digital copies available from instructor and TA via request) 38

Next classes Friday class: –Assessing the quality of spatial predictions with R In preparation: –Next Tuesday (Apr. 2): Interfacing R spatial statistics with GIS 39