Introduction to Spatial Regression Glen Johnson, PhD Lehman College / CUNY School of Public Health

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Introduction to Smoothing and Spatial Regression
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Inference for Regression
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Linear regression models
EPI 809/Spring Probability Distribution of Random Error.
SPATIAL DATA ANALYSIS Tony E. Smith University of Pennsylvania Point Pattern Analysis Spatial Regression Analysis Continuous Pattern Analysis.
Objectives (BPS chapter 24)
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Multiple Linear Regression Model
Correlation and Autocorrelation
Chapter 10 Simple Regression.
Clustered or Multilevel Data
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Chapter Topics Types of Regression Models
1 Regression Analysis Regression used to estimate relationship between dependent variable (Y) and one or more independent variables (X). Consider the variable.
SA basics Lack of independence for nearby obs
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Logistic Regression II Simple 2x2 Table (courtesy Hosmer and Lemeshow) Exposure=1Exposure=0 Disease = 1 Disease = 0.
Diane Stockton Trend analysis. Introduction Why do we want to look at trends over time? –To see how things have changed What is the information used for?
Objectives of Multiple Regression
Regression and Correlation Methods Judy Zhong Ph.D.
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Regression Method.
Modeling Menstrual Cycle Length in Pre- and Peri-Menopausal Women Michael Elliott Xiaobi Huang Sioban Harlow University of Michigan School of Public Health.
Simple Linear Regression
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 4: Taking Risks and Playing the Odds: OR vs.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
The Examination of Residuals. The residuals are defined as the n differences : where is an observation and is the corresponding fitted value obtained.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
Lecture 8: Generalized Linear Models for Longitudinal Data.
Estimating Demand Functions Chapter Objectives of Demand Estimation to determine the relative influence of demand factors to forecast future demand.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
Multilevel Data in Outcomes Research Types of multilevel data common in outcomes research Random versus fixed effects Statistical Model Choices “Shrinkage.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
BACKGROUND Benzene is a known carcinogen. Occupational exposure to benzene is an established risk factor for leukaemia. Less is known about the effects.
MARKETING RESEARCH CHAPTER 18 :Correlation and Regression.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Latent Class Regression Model Graphical Diagnostics Using an MCMC Estimation Procedure Elizabeth S. Garrett Scott L. Zeger Johns Hopkins University
Data Analysis in Practice- Based Research Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine October.
Lecture 2: Statistical learning primer for biologists
Logistic Regression Analysis Gerrit Rooks
Statistical methods for real estate data prof. RNDr. Beáta Stehlíková, CSc
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
ESTIMATION METHODS We know how to calculate confidence intervals for estimates of  and  2 Now, we need procedures to calculate  and  2, themselves.
1 Spatial assessment of deprivation and mortality risk in Nova Scotia: Comparison between Bayesian and non-Bayesian approaches Prepared for 2008 CPHA Conference,
Analysis of financial data Anders Lundquist Spring 2010.
Spatio-temporal Modelling and Mapping of Teenage Birth Data Paramjit Gill Okanagan University College, Kelowna, BC, Canada Abstract We.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.
CHAPTER 7 Linear Correlation & Regression Methods
Notes on Logistic Regression
Bayesian data analysis
Ch3: Model Building through Regression
CHAPTER 29: Multiple Regression*
Multiple Regression Chapter 14.
Simple Linear Regression
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Presentation transcript:

Introduction to Spatial Regression Glen Johnson, PhD Lehman College / CUNY School of Public Health

Typical scenario: Have a health outcome and covariables aggregated at a common geographic level, such as counties, census tracts, ZIP codes … Want to measure association between the outcome and the covariables. Specific Question is: Are there variables that co-vary spatially with the outcome variable ?

Benzene in ambient airSmoking rate Lung Cancer Rates (observed) + … + ? = + + residual

Consider the linear model: This is the point of departure.

When applying regression modeling to spatial units that are connected in space (lattice data), the critical assumption that residuals are independently distributed with constant variance is typically violated. Tobler’s First Law of Geography: Things closer in space tend to be more similar than things further apart

When we model the expected value, E[y], as a function of spatially-varying covariates, it is possible that we may explain all of the spatial variation of the observed response, y, with the covariates, leaving uncorrelated residuals. When this is not the case, as is typical, the assumption of iid residuals is violated and we will obtained biased estimates of the variance – typically biased downward, leading to underestimating our standard errors and concluding that some covariates are significant when in fact they are not.

Tests for spatial autocorrelation should be applied to residuals if a “conventional” regression model is applied. This may be done with various software packages or GIS add-ons. A common statistic is Moran’s I, which equals

When residual spatial autocorrelation is present, several approaches may be taken to adjust for it. The simplest is to add a fixed effect dummy variable to allow the model intercept to change with spatial location. For example, an adjustment is made that depends on a county of membership. This is essentially stratifying the analysis by location And can be done with any statistical software

Since spatial location is a proxy for unobserved randomly varying covariables, it is more correctly treated as a random effect in a mixed effect model, such as Which can be solved for through pseudo- likelihood methods, using software like PROC GLIMMIX or PROC MIXED in SAS, or R with appropriate library (?)

Illustration: Community Teen Pregnancy Rates vs. Socioeconomic Status and Demographic Composition

For each ZIP code:  Response (i.e. Teen Pregnancy cases)  Predictors: % pop. > age 24 w/ 4-year or greater college degree % single-parent households out of households w/ at least one child < 18 years old % of tot. pop. that is Black Alone % of tot. pop. that is Hispanic, regardless of race % of tot. pop. that is a foreign-born naturalized citizen % of tot. pop. with income below poverty  Population at Risk  County (crude indicator of neighborhood effect)

...

The Model … For i = 1, …,n ZIP codes, let y i = observed caseload n i = population at risk {x 1, …, x p } i = community predictors {β 1, …, β p } = coefficients L i = location effect, arising from a random process such that L i ~ N(0, σ L 2 ) Then, the expected value of y i, given {x 1, …, x p, L} i = E[y i | {x 1, …, x p, L} i ] = n i exp(β 1 x 1i + … + β p x pi + L i )

Values for the unknown coefficients {β 1, …, β p, σ L 2 } are estimated with SAS PROC GLIMMIX, assuming y i arose from a Poisson random process, conditional on location. … thus allowing risk adjusted estimates of caseload for each ZIP code. Incorporating the “location effect” - adjusts for unidentified covariables that co-vary spatially with the response, thus reducing residual spatial autocorrelation and potential confounding - also provides a “smoothing” effect, in that the predicted caseload is adjusted towards a common local value

Teen Pregnancy Association with Select Covariables No Spatial Effectwith Spatial Effect coefficient nameestimatep-valueestimatep-value intercept-3.423< < % adults w/ Bachelors-0.016< < % Black Alone0.008< < % Hispanic0.009< < % Foreign Born % single-parent households0.04< < model parameters scale chi-square / d.f log likelihood Residual Spatial Autocorrelation (Moran's I)

Other approaches include … A spatial lag model, where and a spatial error model, where for a spatial autoregressive coefficient ρ. These two models differ by whether the adjustment is made by a weighted sum of the response variable or the residuals.

The spatial lag and spatial error models can be solved for in Geoda, a simple, well supported freeware found at … but only for gaussian responses. For generalized linear models (i.e. Poisson and logistic regression), see R with appropriate libraries

Another approach is hierarchical modelling, which treats the response as conditional on the weighted average of local neighborhood errors.

Frequentist solutions exist, but these hierarchical models lend themselves well to a fully Bayesian solution, as used by many geographic epidemiologists Main advantages include * flexibility offered by Generalized Linear Mixed Models * obtain full distribution of possible outcomes - allows many ways to view the outcome (mean, median, percentiles) - inference based on actual probability distributions, instead of confidence intervals Main limitation is level of conceptual difficulty; however, implementation is accessible through free software … WINBUGS (Bayesian Inference Using the Gibbs Sampler)

A Hierarchical Model

is distributed conditionally on location, such that Focus on the random effect that captures local spatial autocorrelation

A Directed Acyclic Graph of the Bayesian Model

Gibbs sampling basic procedure -All stochastic parameters in the model are assigned an initial value (somewhat arbitrarily). -The values for each parameter are updated by random simulation from a conditional probability distribution, given all other parameters in the model. -After all terms have been updated, completing one cycle (of what is called a Markov Chain), the cycle is repeated. -After many iterations, the simulated values for each term converge to a stationary posterior distribution (further iterations don’t change the distribution) Estimation and inference can then be made from these posterior distributions For example, a simulated sample of 1000 fitted SIR values (μ i / E i ) can be used to yield a point estimate (typically the median) and an interval estimate, such as the 95 %-tile range (credible set)

50 th %-tile5 th %-tile95 th %-tile

An illustration for geospatial analysis of prostate cancer incidence in New York State, USA …

Prostate Cancer Incidence by ZIP code adjusted for age and race New York State

Example Output: Posterior Kernel Densities of Prostate Cancer Incidence (`94-`98) for Some Manhattan ZIP Codes

some references Waller, L.A. and Gotway, C.A Applied Spatial Statistics for Public Health Data. Wiley. 494 pp. Johnson, G.D Smoothing Small Area Maps of Prostate Cancer Incidence in New York State (USA) using Fully Bayesian Hierarchical Modelling. International Journal of Health Geographics 2004, 3:29 ( ) Elliot, P., Wakefield, J.C., Best, N.G. and Briggs, D.J Spatial Epidemiology: Methods and Applications. Oxford. 475 pp. Statistics in Medicine Vol. 19 (special issue on disease mapping) Lawson, A. et al Disease Mapping and Risk Assessment for Public Health. Wiley. 482 pp.

GeoDa (with links to R and R-Geo) WINBUGS for Bayesian Modeling Both of these freewares are supported by large international community with active listserves Method and Software Sources