Stefan Falke stefan@me.wustl.edu An Overview of Spatial Data Analysis Stefan Falke stefan@me.wustl.edu http://capita.wustl.edu/ENVE424/REU/SpatialAnalysis.htm
Pop vs Soda vs Coke http://www.popvssoda.com/
Pop vs Soda vs Coke by County
2000 Presidential Election Results Bush States: 30 votes: 50,456,169 Gore States: 21 votes: 50,996,116
2000 Presidential Election Results by County Bush Gore
Environmental Pattern and Trend Analysis When analyzing environmental data we examine: Spatial Patterns Temporal Trends We are particularly interested in changes in these patterns and trends and relationships with other patterns and trends The analysis also strives to determine why we see these patterns and trends – what are the casual factors and what are their impacts.
Spatial and Temporal Data Analysis Turns raw data into useful information by adding greater informative content and value Wisdom Knowledge / Evidence Data Information
What is Spatial Data Analysis? Spatial analysis is the quantitative and qualitative study of phenomena that are located in space. Environmental spatial data analysis describes characteristics and behavior of the environment Explores patterns, trends, and relationships in environmental data Seeks to explain these patterns, trends, and relationships Differs from general data analysis and statistics in that: Spatial data are dependent on location and related by location (they do not adhere to the independence assumption made in regular data analysis) Have properties that require special analysis methods Why is spatial analysis such a big deal? about 85% of environmental data is spatial
What is GIS? Traditional definition is that GIS is a set of computer tools for accessing, processing, visualizing, analyzing, interpreting, and presenting spatial data. ‘GIS’ is Geographical Information System OR IS IT Geographical Information Science? GISystems: Emphasis on technology and tools GIScience: Fundamental issues raised by the use of GIS, such as Spatial analysis Map projections Accuracy Scientific visualization Implementation and application of GIS covers a wide spectrum: Simple maps Overlaying multiple map “layers” Conducting proximity or cluster analysis based on distance Comparing data sets (simple spatial statistics) Complex statistical analysis
Nature Vol 427 22 January 2004
Special Spatial Nomenclature Geographic – Limited to phenomena and problems relating to Earth’s surface and near-surface Spatial – Any space, including geographic, but not restricted to geographic coordinate space, e.g. medical imaging Geospatial – A recent term to represent the subset of spatial applied specifically to the Earth’s surface. (synonymous with geographic)
http://labs.google.com/location
Tobler’s First Law of Geography “Everything is related to everything else, but near things are more related than distant things.” Tobler, 1970 This general assumption is what subjects spatial data subject to special statistical laws
Types of Spatial Analysis There are literally thousands of techniques Bailey and Gatrell, 1995 offer four spatial data analysis classes: Point Data Analysis Do the locations of point data and the relationship among the points represent a ‘significant’ pattern Continuous Data Analysis What are the spatial pattern and characteristics over a region given a set of samples Area Data Analysis Analysis of data that have been aggregated over a spatial zone, e.g. county
The John Snow Map A classic example of the use of location to draw inferences 1854 cholera outbreak in London Point data map indicated some spatial clustering Overlaying a map of water pump locations showed many cases were concentrated around a single pump
Continuous Data Analysis Temperature data is well suited for converting from point to continuous data - It has high spatial density - Ambient temperature is relatively spatially homogenous (no sharp gradients)
County Level Aggregated Data Also known as a chloropleth plot
Scale The most appropriate analysis method to use depends on the spatial and temporal scales of the problem. The spatial variability of temperature at a ‘local’ scale is not necessarily significant when conducting an analysis over at the ‘regional’ or ‘global’ scale.
Scale Dependent Measurements How long is Maine’s coastline? length=340 km length=355 km length=415 km From Longley et al., 2001
What’s in a map, anyway? Theme: Static map Maps of entities whose location is known and constant (relatively) Roads, borders, locations of buildings These types of layers are often referred to as “thematic” layers Are usually used to provide context to other spatial data Statistical: Realization of one of the many possible patterns that may have been generated by a process Given a set of conditions, a given spatial pattern is just one instance among a distribution of possible patterns The question is: Is the observed realization significantly different than what would be expected by chance?
Deterministic versus Stochastic Processes Deterministic processes have one realization: the value at a given location is always the same, regardless of the number of times the process is occurs Stochastic processes have multiple realizations that are not precisely predicted and involve a random component. For our purposes, random refers to the method used to generate a pattern not the resulting pattern itself.
Examples of Deterministic & Stochastic Processes random variable
Random Spatial Processes A random process does not mean that all events are independent of one another, as is the case with flipping a coin or rolling dice. Rather, spatial random processes are random with dependence (or rules). Consider a “conditionally” random display of 4 coins: Flip the first 3 coins and display by their flipped side (head or tails) The 4th coin will not be flipped The 4th coin is displayed as follows: If the 2nd and 3rd flipped coins are heads, the 4th is the same as the first Otherwise, the 4th is opposite of the first.
Basic Statistical Concepts Variance: Mean: Median: The value in the distribution at which 50% of the data points lie both above and below Covariance: Frequency/Probability Distributions Normal or Gaussian Poisson mean=variance mean=median
Distribution Summary Statistics The features of a distribution can be summarized using: Measures of Location Mean Median Quantiles Measures of Spread Standard Deviation = Square Root of Variance Measures of Shape Coefficient of skewness – a measure of symmetry Kurtosis – a measure of the likelihood of outliers
Complete Spatial Randomness Take as an example a randomly generated point data set where 1) the chance of a given x,y point existing is equal to the chance any other point existing (uniform probability distribution) 2) the existence of a x,y point is independent of the existence of any other point These two conditions constitute an independent random process (IRP) or complete spatial randomness (CSR)
Exploratory Spatial Data Analysis (ESDA) Aim is to identify data properties for purposes of pattern detection Based on the use of graphical and visual methods and the use of numerical techniques that are statistically robust i.e. not much affected by extreme or atypical data values. ArcGIS Geostatistical Analyst extension contains a set of ESDA tools: Histogram (Frequency Distribution) Voronoi Map QQPlot Trend Analysis
Exploratory Analysis Example
Summary Statistics
Quantile Plots Graphs the quantiles of a dataset against the quantiles of a normal distribution
Vornoi Plot Voronoi plots assign or calculate values to a point’s polygon. Including: value itself mean of neighboring polygons most frequent value among neighboring polygons unique value among neighbors variation among neighbors
Spatial Smoothing/Averaging
Data Types Two general views to organizing spatial data: Entities or objects Point measurements, rivers, structures Have attributes or features attached to them Point, vector or area format Values exist at discrete locations Fields Continuous data such as temperature gradient fields and satellite imagery Values exist over an area Raster format (grids)
Data Types Entities and fields can be transformed to the other type
Raster and Vector Data Models Real World 600 1 2 3 4 5 6 7 8 9 10 1 B G Trees 500 2 B G G 3 B 400 4 B G G Trees Y-AXIS 5 B G G 300 6 B G G BK House 7 B 200 8 B B River 9 B 100 10 B 100 200 300 400 500 600 X-AXIS Raster Representation Vector Representation adapted from Lembo, 2003
Landcover Raster Grid (16-20) (11-15) (6-10) (1-5) 2 17 16 15 14 11 13 12 10 8 7 6 5 4 3 Legend Mixed conifer Douglas fir Oak savannah Grassland
What is GIS? Traditional definition is that GIS is a set of computer tools for accessing, processing, visualizing, analyzing, interpreting, and presenting spatial data. ‘GIS’ is Geographical Information System OR IS IT Geographical Information Science? GISystems: Emphasis on technology and tools GIScience: Fundamental issues raised by the use of GIS, such as Spatial analysis Map projections Accuracy Scientific visualization Implementation and application of GIS covers a wide spectrum: Simple maps Overlaying multiple map “layers” Conducting proximity or cluster analysis based on distance Comparing data sets (simple spatial statistics) Complex statistical analysis
GIS Functionality Filtering Aggregation Integration Retrieves a subset of a dataset Examples Query (search) Aggregation Combines attributes or features within data sources (layers) Reclassify, dissolve Integration Combine two or more data sources (layers) Example Polygon overlay, table joining
Spatial Queries (Filter) Identifying features based on spatial criteria Criteria include variations on: adjacency, containment, arrangement, and connectivity Adjacency Which states are adjacent to the State of Missouri? Containment Which states “contain” the Mississippi River and its tributaries?
Reclassification (Aggregation) An assignment of a class or value based on the attributes or geography of an object
Reclassification & Dissolve
Variable Distance Buffering
Polygon Overlay (Integration) Topology describes the relationships between elements of a map. A topological data structure defines the elements of the map in a way that makes it possible to know which line segments are connected to each other and to know what polygon is adjacent to each side of a line segment.
Polygon Overlay Examples “Cookie-cutter” method
© Paul Bolstad, GIS Fundamentals Coordinate Systems A geographical coordinate system uses a three-dimensional spherical surface to define locations on the earth. Divides space into orderly structure of locations. Two types: Cartesian and angular (spherical) © Paul Bolstad, GIS Fundamentals
Parallels and Meridians Meridians are great circles of constant longitude Example is the prime meridian Parallels are circles of constant latitude Example is the equator latitude (φ): angular distance from equator longitude (λ): angular distance from standard meridian St. Louis 38° 39' N 90° 38' W New York 40° 47' N 73° 58' W Los Angeles 34° 3' N 118° 14' W Rome 41° 48' N 12° 36' E Sydney 33° 52' S 151° 12' E
Earth’s Expanding Waistline From the Chronicle of Higher Education Jan 17, 2003
Datum While a spheroid approximates the shape of the earth, a datum defines the position of the ellipsoid relative to the center of the Earth The datum provides a frame of reference for measuring locations on the surface of the Earth A datum is chosen to align a spheroid to closely fit the Earth’s surface in a particular area
Map Projections and Distortions Three general types of projections: Equal area – the ratio of areas on the earth and on the map are constant. Shape, angle, and scale are distorted. Conformal – the shape of any small surface of the map is preserved in its original form. If meridians and parallel lines are at 90-degree angles, then angles are also preserved. Equidistant - preserve distances between certain points. Scale is not maintained correctly, however, typically one or more lines has its scale maintained.
Comparing Projections
Summary Statistics of a Point Pattern Mean center average of the x and y coordinates (geographic mean) X Standard Distance average distance of points from center (provides measure of dispersion) X Summary Circle Centered at mean center with a radius of the standard distance X
US Population Density
Geographic Center of US Population The center of the US population is calculated as the average latitude and longitudes weighted by the population at a uniformly spaced set of points
Quadrant Count A quadrant count is conducted by superimposing a regular grid over data, counting the number of events in each grid cell and divide the count by its cell area to get intensity. 40 grid cells Variance: Mean cell count A s2 to µ ratio greater than 1 indicates clustering
Spatial Autocorrelation Defines the correlation between values of the same variable at different spatial locations Positive Spatial Autocorrelation Like values tend to cluster in space Negative Spatial Autocorrelation Neighbors are dissimilar Zero Spatial Autocorrelation No correlation
spatial estimation method continuous surface of estimates (map) From points to fields The factor that determines how much influence a data point is assigned during the calculation of the estimate spatial estimation method ci is the estimated value at location i n is the number of data points cj is the value at data point j wij is the weight assigned to data point j continuous surface of estimates (map) point monitoring data The weighting factor is usually the distinguishing feature of interpolation methods. Biggest challenge: How to determine the weights?
Inverse Distance Interpolation k is the power-law of distance weighting Constrained to the minimum and maximum values in point data set
Spatial Smoothing/Averaging
Landcover Raster Grid (16-20) (11-15) (6-10) (1-5) 2 17 16 15 14 11 13 12 10 8 7 6 5 4 3 Legend Mixed conifer Douglas fir Oak savannah Grassland
Raster Analysis (Continuous Data) 2 7 Moving Windows minimum maximum 2 3 5 2 3 6 3 5 7 range mean 5 4
Slope Slope is the change is elevation (rise) with a change in horizontal position (run). The steepest decent between a cell and its neighbors is known as the gradient. Slope is often reported in degrees (0° is flat, 90° is vertical) but is also expressed as a percent
Hands-on Exercise: Mapping Census Data Database manipulation (table joins) Reprojecting maps Calculating derived values (population density, change population over time) Visualization
ArcGIS Main Components ArcCatalog ArcToolbox ArcMap
Data Quality It is impossible to make a perfect representation of the world, so uncertainty about it is inevitable Uncertainty is found in data and in its processing and analysis The outputs from spatial data analysis and GIS are only as good as the inputs and associated assumptions.
Logical Consistency Representation of data that does not make sense Road in the water Contours that cross or end Features on steep slopes
Modifiable areal unit problem Multiple ways to aggregate data into zones and thereby yielding different results.
Anscombe’s Quartet These four data sets look identical from a statistical perspective.
Anscombe’s Quartet They don’t look anything alike from a graphical perspective!!