Lecture 5 : Spatial Regression Pat Browne

Lecture 5 : Spatial Regression Pat Browne
Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related than distant things. Drowning in Data yet Starving for Knowledge [Naisbitt -Rogers] Lecture 5 : Spatial Regression Pat Browne The earliest use of this quote is by John Naisbitt ( in his 1982 book Megatrends - he wrote "we are drowning in information, but we are starved for knowledge". In 1985, Rutherford D. Rogers, a librarian at Yale, was quoted in New York Times: "We're drowning in information and starving for knowledge."

Standard statistical concepts: Regression
Regression: takes a numerical dataset and develops a mathematical formula that fits the data. The results can be used to predict future behaviour. Works well with continuous quantitative data like weight, speed or age. Not good for categorical data where order is not significant, like colour, name, gender, nest/no nest. Example: plotting snowfall against height above sea level.

Y = A + BX; The response variable is y, and x is the continuous explanatory variable. Parameter A is the intercept. Parameter B is the slope. The difference between each data point and the value predicted by the line (the model) is called a residual

The regression equation can be given as: zi = β0 + β1 yi Where zi is the predicated value. β0 is the intercept β1 is the slope coefficient

Alternative notation for linear regression equation: Y = a + bX where Y is the dependent variable a is the intercept b is the slope or regression coefficient X is the independent variable (or covariate)

Standard statistical concepts: Null hypothesis
The null hypothesis, H0, represents a theory that has been put forward, either because it is believed to be true, but has not been proved. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug H0: there is no difference between the two drugs on average. In general, the null hypothesis for spatial data is that either the features themselves or of the values associated with those features are randomly distributed (e.g. no spatial pattern or bias).

Relation of i.i.d., regression, and correlation with spatial phenomena.
The first law of geography according to Waldo Tobler is "Everything is related to everything else, but near things are more related than distant things." In statistical terms this is called autocorrelation where the traditional i.i.d. assumption is not valid for spatially dependent variables (e.g. temperature or crime rate) we need special techniques to handle this type of data (e.g. Moran’s I). These techniques usually involve including a weight matrix which contains location information. The non-i.i.d. nature of spatially dependent variables carries over into regression and correlation which require spatial weights

Relation of i.i.d., regression, and correlation with spatial database
Spatial databases are used for spatial data mining, which includes statistical techniques and more specialised DM techniques such as association rules.. In this case the data mining algorithms need to have a spatial context. We must explicitly include location information where previously with the i.i.d. assumption it was not required Typical generic data mining activities such as clustering, regression, classification, association rules, all need a spatial context. Spatial DM is used in a broad range scientific disciplines, such as analysis of crime, modelling land prices, poverty mapping, epidemiology, air pollution and health, natural and environmental sciences, etc. The analyst must be aware the special techniques required for SDM. Epidemiology is the study of factors affecting the health and illness of populations, and serves as the foundation and logic of interventions made in the interest of public health and preventive medicine. It is considered a cornerstone methodology of public health research, and is highly regarded in evidence-based medicine for identifying risk factors for disease and determining optimal treatment approaches to clinical practice. In the study of communicable and non-communicable diseases, the work of epidemiologists ranges from outbreak investigation to study design, data collection and analysis including the development of statistical models to test hypotheses and the documentation of results for submission to peer-reviewed journals. Epidemiologists also study the interaction of diseases in a population, a condition known as a syndemic. Epidemiologists rely on a number of other scientific disciplines such as biology (to better understand disease processes), biostatistics (the current raw information available), Geographic Information Science (to store data and map disease patterns) and social science disciplines (to better understand proximate and distal risk factors).

Relation of i.i.d., regression, and correlation with spatial database
Spatial databases are also used for pure statistical research (e.g. environmental studies). Those variables that are spatially dependent (e.g. the PH of the soil) need to be clearly identified and special techniques applied to take into account their spatial bias.

Unique features of spatial data Statistics
General Statistics assumes the samples are independently generated, which is may not the case with spatial dependent data. Like things tend to cluster together. Change tends to be gradual over space.

Unique features of spatial data Statistics Spatial dependent values
The previous maps illustrate two important features of spatial data: Spatial Autocorrelation (not independent) The probability that they both occur is equal to the product of the probabilities of the two individual events, i.e. P(AB) = P(A)  P(B) Spatial data is not identically distributed. Two events A and B are identically distributed if P(A) =P(B) i.e. they have the same probability distribution.

Unique features of spatial data Statistics Autocorrelation & Spatial Heterogeneity.
Spatial autocorrelation is detected when the value of a variable in a location is correlated with values of the same variable in the neighbourhood (can be measured with Moran I). Spatial heterogeneity is characterized by different values or behaviours through space which can be measured by Local Indicators of Spatial Association (LISA). Characterizes the non-stationarity of most geographic processes, meaning that global parameters may not accurately reflect the process occurring at a particular location. Other approaches to measuring SH are 1) Degree of Similarity in the variable between adjacent sampling or grid points. 2)Fractal Geometry, 3)Information Theory Spatial stationarity assumes that the relationship between dependent and independent variables is constant in space and time. Statistically we say structure of the spatial–temporal covariance does not change with location or time. Covariance is a measure of how much two variables change together

Spatial Heterogeneity.
Spatial heterogeneity; Is there such a thing as an average place with respect to some property (e.g. vegetation). is difficult to imagine any subset of the Earth’s surface being a representative sample of the whole. GWR (later) addresses the localness of spatial data. The Earth’s surface displays almost incredible variety, from the landscapes of the Tibetan plateau to the deserts of Australia and the urban complexity of London or Tokyo. Nowhere can be reasonably described as an average place and it is difficult to imagine any subset of the Earth’s surface being a representative sample of the whole. The results of any analysis over a limited area can be expected to change as that limited area is relocated, and to be different from the results that would be obtained for the surface of the Earth as a whole. These concepts are collectively described as spatial heterogeneity, and they tend to affect almost any kind of spatial analysis conducted on geographic data. Many techniques such as Geographically Weighted Regression (Fotheringham, Brunsdon, and Charlton, 2002, discussed in Section of this Guide) take spatial heterogeneity as given — as a universally observed property of the Earth’s surface — and focus on providing results that are specific to each area, and can be used as evidence in support of local policies. Such techniques are often termed place-based or local.

Neigbourhood relationship contiguity matrix

Spatial regression (SR)
Spatial regression (SR) is a global spatial modeling technique in which spatial autocorrelation among the regression parameters are taken into account. SR is usually performed for spatial data obtained from spatial zones or areas. The basic aim in SR modeling is to establish the relationship between a dependent variable measured over a spatial zone and other attributes of the spatial zone, for a given study area, where the spatial zones are the subset of the study area. While SR is known to be a modeling method in spatial data analysis literature in spatial data-mining literature it is considered to be a classification technique

Spatial regression (SR)
The coefficient of determination (COF) of a linear regression model is the quotient of the variances of the fitted values and observed values of the dependent variable. The COF

Geographically weighted regression (GWR)
Geographically weighted regression (GWR) is a powerful exploratory method in spatial data analysis. It serves for detecting local variations in spatial behavior and understanding local details, which may be masked by global regression models. Unlike SR, where regression coefficient for each independent variable and the intercept are obtained for the whole study region, in GWR, regression coefficients are computed for every spatial zone. Therefore, the regression coefficients can be mapped and the appropriateness of stationarity assumption in the conventional regression analyses can be checked.

Geographically weighted regression (GWR)
GWR is an effective technique for exploring spatial nonstationarity, which is characterized by changes in relationships across the study region leading to varying relations between dependent and independent variables. Hence there is a need for better understanding of the spatial processes has emerged local modeling techniques. GWR has been implemented in various disciplines such as the natural, environmental, social and earth sciences.

Exploring spatial patterning in spatial data values1.
Two issues 1. How do variables change from place to place? Zone similar to neighbours? 2. How are variables related. How does the relationship between rainfall and altitude vary from place to place. Lloyd: Spatial Data Analysis, Chapter 8, Oxford University Press.

Local Statistics1 moving window
Geographical Weights Binary: Rook or queen neighbours Distance based Boundary or perimeter based. Weights can be row-normalized using the number of adjacent cells Lloyd: Spatial Data Analysis, Chapter 8, Oxford University Press.

Local Univariate measures1 moving window
Standard univariate can be computed for a moving window, supplying the degree and nature of variation in summary statistics across a region of interest (e.g. we could compute the standard deviation for several windows and assess the degree of variability from place to place. Geographical weighting schemes can be used for the calculation of local statistics. Lloyd: Spatial Data Analysis, Chapter 8, Oxford University Press.

Local spatial autocorrelation1
Global statistics such as Moran’s I can mask local spatial structure. The local Moran can be used to measure local spatial autocorrelation. Only if there is little or no variation in the local observations do the global observations provide any reliable information on the local areas within the study area. As the spatial variation of the local observations increases, the reliability of the global observation as representative of local conditions decreases. Lloyd: Spatial Data Analysis, Chapter 8, Oxford University Press.

Local spatial autocorrelation1
The weights could be based on rook, queen, distance, perimeter and normalized by number of neighbours ( slide 28) Lloyd: Spatial Data Analysis, Chapter 8, Oxford University Press.

Spatial Regression1 The assumption of i.i.d. underlying ordinary least squares regression rarely holds for spatial data. There are several techniques that handle the spatial case; Moving window regression Geographic Weighted Regression (GWR) We will look at GWR GWR has been used primarily for exploratory data analysis, rather than hypothesis testing.

Geographic Weighted Regression (GWR) 1
The steps are; Go to a location Conduct regression using the raw data and a geographic weighting scheme. Move to next location go back to stage 2 until all locations have been visited. The output is a set of regression coefficients (e.g. slope and intercept) at each location

Coords of observations, variables
Coords of observations, variables. distance from first observation, and geographic weights point x y Var 1 Var 2 dist Geo w 1 25 45 12 6 2 44 34 52 0.995 3 21 48 32 41 5 0.8825 4 27 8 0.7261 16 31 11 22 0.278 42 35 14 9 20 0.0889 7 65 56 43 26 0.034 29 76 75 67 0.006 61 66 0.0002

Location of points for previous table

Regression using previous table and locations, the geographic weighting pulls the line towards the points with larger weights

Summary of spatial stats
Moran’s I measures the average correlation between the value of a variable at one location and the value at nearby locations. Local Moran statistic measures spatial dependence on a local basis, allowing the researcher to see its variation over space, and by Geographically Geographically Weighted Regression allows the parameters of a regression analysis to vary spatially. GWR helps in detecting local variations in spatial behavior and understanding local details, which may be masked by global regression models. GWR, regression coefficients are computed for every spatial zone.

Two scatter plots and fitted lines for different aggregations of same value
© Oxford University Press, All rights reserved. Lloyd: Spatial Data Analysis

Second Law of Geography1
Second law of geography: Spatial heterogeneity [Goodchild] Spatial heterogeneity describes geographic variation in the constants or parameters of relationships When it is present, the outcome of an analysis depends on the area over which the analysis is made. Spatial heterogeneity depends on the spatial resolution. Global model might be inconsistent with respect to a regional model(s). 1 Locational effects also manifest as spatial heterogeneity, or the apparent variation in a process with respect to location in a geospace. Unless a space is uniform and boundless, every location will have some degree of uniqueness relative to the other locations. This affects the spatial dependency relations and therefore the spatial process. Spatial heterogeneity means that overall parameters estimated for the entire system may not adequately describe the process at any given location.

Second Law of Geography
Spatial heterogeneity definitions: quantitative information characterizing the ground spatial structure spatial variance distribution of the variable considered, within the coarse sample resolution (e.g. pixel or grid) The patterning or patchiness in important landscape properties such as vegetation cover. Geographically weighted regression (GWR) is a local version of spatial regression that generates parameters disaggregated by the spatial units of analysis. This allows assessment of the spatial heterogeneity in the estimated relationships between the independent and dependent variables.

Spatial heterogeneity has been quantified from remote sensing images by using two basic approaches: (a) the direct image approach, where straight reflectance or reflectance indices of remote sensing images are used to quantify spatial heterogeneity, using the original pixel size of the image (b) the cartographic or patch mosaic approach, where the image is subdivided into homogeneous mapping units through classification. 1. From: The first approach assumes that spatial heterogeneity is at the pixel size of the image and, in this case, it is only the reflectance values that change in space. The argument against this approach is that its choice of scale (i.e. window size) is arbitrary, thus it is subjective. Alternatively, using the patch mosaic approach to quantify spatial heterogeneity assumes a collection of discrete patches. Based on this approach, characterisation of spatial heterogeneity is highly dependent on the initial definition of mapping units by the researcher. The argument against this approach is that patches have abrupt boundaries and the variation within the patches is assumed to be irrelevant. The patch mosaic model is parsimonious and has therefore become the operating paradigm. It is particularly valid where landscape patches have crisp. boundaries, as with the regular landscapes of Europe. However, the model poorly represents spatial heterogeneity in landscapes that are characterised by gradients rather than discrete patches, for instance in savanna landscapes, and this leads to both loss of information and the introduction of subjectivity. Alternative approaches for characterising spatial heterogeneity remain underdevelopement.

Suppose there is a relationship between number of AIDS cases and number of people living in an area The form of this relationship will vary spatially in some areas the number of cases per capita2 will be higher than in others we could map the constant of proportionality3 Spatial heterogeneity describes this geographic variation in the constants or parameters of relationships . When it is present, the outcome of an analysis depends on the area over which the analysis is made. Often this area is arbitrarily determined by a map boundary or political jurisdiction 1 2'for each head’ indicates the average amount of AIDs associated with each person. 3Since small decimal numbers are awkward to interpret, we change the ratio to a rate by multiplying it by a constant of proportionality. This constant of proportionality can be any number (say 1000), as long as the same number is used in calculating every rate. More generally, if an object travels at a constant speed, then the distance travelled is proportional to the time spent travelling, with the speed being the constant of proportionality.

Second Law of Geography
Second law of geography [Goodchild] Spatial heterogeneity Global model often inconsistent with regional models (e.g. the average does not hold anywhere).

How to decide the weight wij ?
The weight indicates the spatial interaction between entities. Binary wij, also called absolute adjacency. Covers the general case answering the question is a value in a region similar or different to its neighbours. wij = 1 if two geographic entities are adjacent; otherwise, wij = 0. Choice of adjacency definition queens(8) or rooks(4).

How to decide the weight wij ?
The weight indicates the spatial interaction between entities. 2) The distance between geographic entities. Often the inverse distance is used, further objects get less weight, near object get more weight e.g. centre of epidemic. wij = f(dist(i,j)), dist(i,j) is the distance between i and j. 3) The length of common boundary for area entities. Policing borders, smaller borders less weight. wij = f(leng(i,j)), leng(i,j) is the length of common boundary between i and j.

How to decide the weight wij ?1
The choice of weights should ultimately be driven by a rationale for including those areas as neighbors that have a spatial effect on a given location. This rationale can be derived from theory or be the result of using ESDA to experiment with different weights and connectivity orders. Since weights matrices are used to create spatial lags that average neighboring values, the choice of a weights matrix will determine which neighboring values will be averaged. For instance, since rook weights will usually have fewer neighbors than queen weights, on average, each neighboring observation has more influence. 1. Tips in Geoda by Luc Anselin

How to decide the weight wij ?1
The question of which weights to choose is more pertinent in the context of modeling than ESDA since modeling is based on substantive notions of spatial effects while ESDA prioritizes the rejection of spatial randomness. Therefore, if there are no substantive reasons to guide the choice of weights in ESDA, using a weights file with as few neighbors as possible (such as rook) makes sense. Especially with irregular areal units (as opposed to grids), the difference between rook and queen weights is often minimal. However, it is advisable to test how sensitive your results are to your weights specifications by comparing multiple weights matrices. 1. Tips in Geoda by Luc Anselin

Spatial Outlier Detection
Global outliers are observations which appear inconsistent with the remainder of that data set. Global outliers deviate so much from other observations that it may be possible that they were generated by a different mechanism. Spatial outliers are observations that appear inconsistent with their neighbours.

Detecting spatial outliers has important applications in transportation, ecology, public safety, public health, climatology and location based services. Geographic objects have a spatial (location, shape, metric & topological properties) & non-spatial component (house owner, sensor id., soil type).

Spatial neighbourhoods may be defined using spatial attributes & spatial relations. Comparisons between spatially referenced objects can be based on non-spatial attributes. A spatial outlier is a spatially referenced object whose non-spatial attribute values differ from those of other spatially referenced objects in its spatial neighbourhood.

Data for Outlier detection
In diagram on left G,P,S,Q show a big change in attribute for a small change in location. The right hand diagram shows a normal distribution (corresponds to attribute axis in left diagram)

The upper left & lower right quadrants of figure 7.17 indicate a spatial association of dissimilar values; low values surrounded by high value neighbours (P & Q) and high values surrounded by low values (S).

Moranoutlier is a point located in the upper left or lower right quadrant of a Moran scatter plot.

Moranoutlier is a point located in the upper left or lower right quadrant of a Moran scatter plot. WZ Q4 = LH Q1= HH Cb Db a Q2= LL Q3 = HL z values in a given location

LISA for Crime in Columbus, OH
High crime clusters LISA map (only significant values plotted) Significance map (only significant values plotted) For more detail on LISA, see: Luc Anselin Local Indicators of Spatial Association-LISA Geographical Analysis 27: Low crime clusters

Model Evaluation Consider the two-class classification problem ‘nest’ or ‘no-nest’. The four possible outcomes (or predictions) are shown on the next slide. The desired predictions are: 1) where the model says the should be a nest and there is an actual nest (True Positive) 2) where the model says there is no nest and there is no nest (True Negative) The other outcomes are not desirable and point to a flaw in the model.

Model Evaluation

Spatial Statistical Models
A Point Process is a model for the spatial distribution of points in a point pattern. Examples: the position of trees in a forest, location of petrol stations in a city. Actual real world point patterns can be compared (using distance) with a randomly distributed point pattern random.

Lecture 5 : Spatial Regression Pat Browne

Similar presentations

Presentation on theme: "Lecture 5 : Spatial Regression Pat Browne"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 5 : Spatial Regression Pat Browne

Similar presentations

Presentation on theme: "Lecture 5 : Spatial Regression Pat Browne"— Presentation transcript:

Similar presentations

About project

Feedback