Data Projects at the Minnesota Population Center Resources for Comparative Population and Health Research Seattle, Washington May 22, 2014 Elizabeth Boyle, Miriam King, Matthew Sobek Minnesota Population Center, University of Minnesota
Integrated Public Use Microdata Series
U.S. Labor Force Participation: Men Women
Steve Ruggles 1995: “King of Quant” President Population Association of America
New U.S. Data From Ancestry.com
We build data infrastructure for research community. Specialize in data harmonization. World’s largest collection of individual population and health data, across 9 projects. 50,000 registered users from over 100 countries. Free Minnesota Population Center
MPC Data Dissemination, Gigabytes per week
MPC Data Projects
The Problem 1.Combining data from multiple sources is time consuming Discovery Data management 2.It’s error prone Recoding data Overlook documentation 3.Hard to replicate results 4.Discourages comparative research
Outline Harmonization methods Dissemination system International projects Integrated DHS Terra Populus IPUMS-International
Terminology Harmonization: Combining datasets collected at different times or places into a single, consistent data series. “Integration” Metadata: Data about data. Documentation in broadest sense.
Relation to head Marital status Education Occupation Microdata
Summary Data
Harmonization Methods Metadata Data Dissemination
Systematize Metadata (record layout file, pdf)
MPC Data Dictionary
Water Access Convert Questionnaires to Metadata (Mexico 2000)
Metadata: Questionnaire Text
Water access Bedrooms Rooms XML-Tagged Questionnaire Text
Data: Variable Harmonization Marital Status: IPUMS-International Bangladesh = Unmarried 2 = Married 3 = Widowed 4 = Divorced/separated Mexico = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated
Translation Table Input Bangladesh = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed Mexico = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated
LabelCode Translation Table Harmonized 1 = Never married1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Single Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated 3 Widowed 4
LabelCode Translation Table Harmonized 1 = Never married 1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Single Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated 3 Widowed 4
Data Dissemination System
Variables Page
238 censuses
Sample Filtering
Variables Page – Filtered
Variable Page: Marital Status
Variable Codes (Marital status)
Variable Codes (Marital status)
Variable Codes (Marital status)
Variable Page: Marital Status
Variable Comparability Discussion (Marital status)
Variable Page: Documentation
Questionnaire Text
(Marital status, Cambodia)
Variables Page
Extract Summary
Case Selection
Age of spouse Employment status of father Occupation of father Attached Characteristics
Extract Summary
Download or Revise Extract
On-line Analysis
The International Projects
Integrated DHS
Foremost source of health information for the developing world Funded by USAID Since 1980s, over 300 surveys, 90 countries Topics: fertility, nutrition, HIV, malaria, maternal and child health, etc Demographic and Health Surveys
5-year NIH grant (end of year 2) Focus on Africa, with India Partnership with ICF-International and USAID IDHS Project
Motivation: DHS is incredibly valuable, but it’s hard to capitalize on its full potential. Problem: Data discovery Dispersed documentation Data management Variable changes over time Not unique to DHS: endemic to any survey that’s persisted over decades. Why an Integrated DHS?
DHS Research Process Example: Find data on female genital cutting Survey Search Tool
Recode notes Data dictionary Just the woman file – for one survey. 61 to go. Still need Report (377 page pdf) Contains questionnaire and sample design information Errata file
DHS “Recode Variables” make it more harmonized than most surveys Consistent variable names Each DHS phase has a shared model questionnaire But: 6 phases over 25+ years Country control over final wording of surveys Country-specific variables The recode variables can be a two-edged sword At least the DHS variables are already harmonized, right?
Ghana 1993 V130 Ghana 2008 V130 India 1992 V130 India 2005 V130 Harmonization: Religion
Harmonization: Female Circumcision Ever Circumcised
Timeline: 2014 (current) 9 countries, 39 samples Much of woman files Women of child bearing age as unit of analysis
Timeline: countries, 69 samples Complete the woman files Children & birth files
Timeline: countries, 94 samples Men and couples files
Timeline: Next grant 41 African countries, 130+ samples 11 Asian countries, 32+ samples
Beta
Lower barriers to conducting research on population and the environment. Motivation: The data from different domains have incompatible formats, and few researchers have the skills to combine them Terra Populus Goal
5 year grant NSF At mid-point: year 3 TerraPop
6 countries: Argentina Brazil Malawi Spain United States Vietnam Population Microdata
Tabulations of census data for administrative units Area-level Data
Land cover from satellite images (Global Land Cover 2000) Agricultural use from satellites and government records (Global Landscapes Initiative) Climate from weather stations (WorldClim) Environmental Data Rasters (Grid Cells)
Microdata Area-level data Rasters Mix and match variables originating in any of the data structures Obtain output in the data structure most useful to you Location-Based Integration
Individuals and households with their environmental and social context Microdata Area-level data Rasters Location-Based Integration
Summarized environmental and population Microdata Area-level data Rasters County ID G G G G G G G County ID Mean Ann. Temp. Max. Ann. Precip. G G G G G G G County ID Mean Ann. Temp. Max. Ann. Precip. Rent, Rural Rent, Urban Own, Rural Own, Urban G G G G G G G characteristics for administrative districts Location-Based Integration
Rasters of population and environment data Microdata Area-level data Rasters Location-Based Integration
Rasterization of Area-Level Data
Area-Level Summary of Raster Data
Linkages across data formats rely on administrative unit boundaries Particular needs Lower level boundaries Historical boundaries Boundaries are Key
Geographic Harmonization
Web interface will change significantly in fall 2014 Fast microdata tabulator needed Beta Version
IPUMS-International
Census microdata from around world Funded by NSF and NIH Motivation: Provide data access Preservation
Khartoum, CBS-Sudan
Dhaka, Bangladesh Bureau of Statistics
IPUMS-International Participating Disseminating
IPUMS Censuses Per Country
Variables Included in Extracts
Top Institutional Users
Millennium Development Goals Ratio of literate women to men, years old Source: Cuesta and Lovatón (2014) 1990 Census round
Millennium Development Goals Source: Cuesta and Lovatón (2014) Data Source: IPUMS-International, Minnesota Population Center Census 1993 Census 2005 Colombia: Adolescent Birth Rate
Data acquisition Outreach: developing countries Virtual data enclave IPUMSI Future
Thank you!