Spatial Data Analysis Areas I: Rate Smoothing and the MAUP Gilberto Câmara INPE, Brazil Ifgi, Muenster, Fall School 2005
Areal data Study region is partitioned in disjoint areas The region is the union of the areas Each map has one or more associated measures Treated as random variables Examples: Map of Germany divided in municipalities. For each area, we measure the unemployment rate and the literacy rate. Is unemployment correlated with years of school? What about Brazil?
Violence in Minas Gerais
Attributes in areal data As a general rule, each measure is a sum, count or a similar aggregated function over all the area Each value is associated to all the corresponding area If we need to choose a single location, usually we take the polygon centroid There are no intermediate values
What is mapped in areal data? Typical values are rates or proportions Numerator = events Denominador = pop at risk Log maps?
Log rate of motor vehicle accident death per residents,
Log ratio of homicide death of males per residents of same group age,
Models of Discrete Spatial Variation Random variable in area i n° of ill people n° of newborn babies per capita income Source: Renato Assunção (UFMG/Brasil)
When the study variable is a rate or a proportion, mapping those rates is the first obvious step in any analysis. However, the use of raw observed rates might be misleading, since the variability of those rates will be a function of the population counts, which differs widely between the areas. Bailey,1995 Dealing with rates and proportions
Source: Fred Ramos (CEDEST/Brasil)
Model-Driven Approaches Model of discrete spatial variation Each subregion is described by is a statistical distribution Z i e.g., homicides numbers are Poisson ( , ). The main objective of the analysis is to estimate the joint distribution of random variables Z = {Z 1,…,Z n } We use a model-driven approach to correct the missing data It is called the “Empirical Bayes” method... We could also use the “Full Bayes” method (but that is another story...)
i (measured rate) In Bayesian statistics, the best estimate of the true and unknown rate is where Source: Fred Ramos (CEDEST/Brasil)
Simplifying assumptions for estimating means and variances for all random variables of all areas (Marshall, 1991) Empirical Bayes Source: Fred Ramos (CEDEST/Brasil)
Infant Mortality Rate – São Paulo (Raw) Source: Fred Ramos (CEDEST/Brasil)
Infant Mortality Rate – São Paulo (Corrected) Source: Fred Ramos (CEDEST/Brasil)
Some Important Questions How does scale matter? How do the spatial partitions matter? How does proximity matter? What can we learn by studing how multiple data vary in space? How much prior assumptions can we impose in our spatial data?
Problema das Unidades de Área Modificáveis - MAUP A Question of Scale A basic problem with areal data The spatial definition of the frontiers of the areas impacts the results Different results can be obtained by just changing the frontiers of these zones. This problem is known as the “the modifiable area unit problem”
Per capita income Jobs/ population Illiterate / population Scale Effects Source: Fred Ramos (CEDEST/Brasil)
Scale Effects Per capita income Jobs/ population Illiterate / population Source: Fred Ramos (CEDEST/Brasil)
Population >60 years Illiteratesper capita income 270 ZONES OD97 Scale Effects: Figthing the MAUP Source: Fred Ramos (CEDEST/Brasil)
96 DISTRICTS OF SÃO PAULO Scale Effects: Figthing the MAUP Population >60 years Illiteratesper capita income Source: Fred Ramos (CEDEST/Brasil)
96 INCOME-HOMOGENOUS ZONES IN SÃO PAULO Scale Effects: Figthing the MAUP Population >60 years Illiteratesper capita income Source: Fred Ramos (CEDEST/Brasil)
270 ZONES OD97 96 DISTRICTS 96 INCOME- AGGREGATED A) Percentage of population 60 year-old or more B) Percentage of illiterate population C) Per capita individual income VARIABLES Correlation matrices Source: Fred Ramos (CEDEST/Brasil)
Get census data Identify inter- tract variation Adaptation Minimize the outlier effect Reduce data variability A Questão da Escala
Regionalization Reagregate N small areas (finest scale available) into M bigger regions to reduce scale effects. A possible solution: constrained clustering
Regionalization: Maps as graphs
Simple aggregationPopulation-constrained aggregation