CyberGIS: Reston, VA, September 22, 2018 TerraPopulus is a relatively new project at the Minnesota Population Center. The project is led by the MPC in collaboration with our partners at the University of Minnesota Libraries, the Institute on the Environment at the University of Minnesota, CIESIN at Columbia University, and ICPSR at the University of Michigan.
Mission Statement Enabling research, learning, and policy analysis by providing integrated spatiotemporal data describing people and their environment.
Overview Future Collaborators Big Heterogeneous Data Location Integration Future Paragon Dynamic Tabulator Terra Explorer TerraPop API Collaborators
Big Heterogeneous Data
TerraPop Data Formats Microdata: Characteristics of individuals and households Area-level data: Characteristics of places defined by boundaries Raster data: Values tied to spatial coordinates
Summarization Tabulation Join Contextual data Dasymetric Mapping Zonal Statistics Spatial Reallocation Join Contextual data Area-Level Data Microdata Rasters Summarization Tabulation Dasymetric Mapping
Location-Based Integration Microdata Area-level Raster
Location-Based Integration Microdata Mix and match variables originating in any of the data structures Obtain output in the data structure most useful to you Integration across domains, formats hinges on geography Users get any type of data in format useful to them Requires boundary files, boundaries harmonized over time Rasters Area-level data
Location-Based Integration Microdata Summarized environmental and population County ID G17003100001 G17003100002 G17003100003 G17003100004 G17003100005 G17003100006 G17003100007 County ID Avg. Ann. Temp. Avg. Ann. Precip. Rent, Rural Rent, Urban Own, Rural Own, Urban G17003100001 21.2 768 3129 1063 637 365 G17003100002 23.4 589 2949 1075 1469 717 G17003100003 24.3 867 3418 1589 1108 617 G17003100004 21.5 943 1882 425 202 142 G17003100005 24.1 2416 572 426 197 G17003100006 24.4 697 2560 934 950 563 G17003100007 25.6 701 2126 653 321 215 County ID Mean Ann. Temp. Max. Ann. Precip. G17003100001 21.2 768 G17003100002 23.4 589 G17003100003 24.3 867 G17003100004 21.5 943 G17003100005 24.1 G17003100006 24.4 697 G17003100007 25.6 701 characteristics for administrative districts Integration across domains, formats hinges on geography Users get any type of data in format useful to them Requires boundary files, boundaries harmonized over time Rasters Area-level data
Swap this out for a Latin American country
Location-Based Integration Microdata Individuals and households with their environmental and social context Integration across domains, formats hinges on geography Users get any type of data in format useful to them Requires boundary files, boundaries harmonized over time Rasters Area-level data
Location-Based Integration Microdata Rasters of population and environment data Integration across domains, formats hinges on geography Users get any type of data in format useful to them Requires boundary files, boundaries harmonized over time Rasters Area-level data
Current Work Data Paragon Tabulation Geovisualization
Data Aggregate census data Gridded Population of the World Historical data (48 countries) Variables in addition to population by sex (65 countries) Gridded Population of the World Environmental data CRU monthly time series – precipitation & temperature Vegetation characteristics – NDVI, greenness Elevation and derived characteristics Soils Species distribution (GBIF)
Raster Data MODIS Land Data Earth Science Climate Datasets Yearly land cover data derived from the MODIS Terra and Aqua satellites, available for 2001 – 2013 5 land cover classifications, 240 Gigabytes Earth Science Aster 30 Meter DEM resolution - 500 Gigabytes TAUDEM derivatives: slope, solar radiance, wetness index will result in about 6-8 more Terabytes of data Climate Datasets NetCDF Format Climate Research Unit – 40 Gigabytes
Paragon
Joins in Distributed Databases Create a temporary table TMP Reconstitute area on each node as TMP Join TMP with the two local partition of line . . . line1 line2 area2 TMP TMP area1 … nodeN node1 node2
Spatial Join Paragon Query: select a.gid , b.gid from edges_merge_ca_shall as a, arealm_merge_ca_shall as b where st_crosses(a.geom, b.geom) ; PostgreSQL (standalone): 463 seconds Stado-Spatial (2 nodes): 96 seconds
Tabulation
Tabulator Generates area-level data from microdata using geographic level codes National, First Level (e.g. State), Second Level (e.g. County) Parquet on Apache Spark High Compression Ratio 8 Gigabytes gzip compressed 3 Gigabytes parquet compressed Columnar Storage (3,000+)
Query Performance 1 12 million 5 seconds 9 seconds 10 seconds 10 Number of datasets Number of records Time to aggregate by 1 column Time to aggregate by 2 columns Time to aggregate by 3 columns 1 12 million 5 seconds 9 seconds 10 seconds 10 25 million 7 seconds 18 seconds 20 82 million 12 seconds 27 seconds 56 128 million 7.5 seconds 20 seconds 30 seconds
Visualization
Landing Page
Terra Populus Software Stack Geospatial Data Processing Web Application Geospatial Server