Download presentation
Presentation is loading. Please wait.
Published byTimothy Park Modified over 9 years ago
1
The Minnesota Data Harmonization Projects Bill & Melinda Gates Foundation Seattle, Washington May 21, 2014 Elizabeth Boyle, Miriam King, Matthew Sobek Minnesota Population Center, University of Minnesota sobek@umn.edu
3
Integrated Public Use Microdata Series
4
We build data infrastructure for research community. Specialize in data harmonization. World’s largest collection of individual population and health data, across 9 projects. 50,000 registered users from over 100 countries. Free Minnesota Population Center
5
MPC Data Dissemination, 1993-2012 Gigabytes per week
6
MPC Data Projects
7
The Problem 1.Combining data from multiple sources is time consuming Discovery Data management 2.It’s error prone Recoding data Overlook documentation 3.Hard to replicate results 4.Discourages comparative research
8
Outline Harmonization methods Dissemination system International projects Integrated DHS Terra Populus IPUMS-International
9
Terminology Harmonization: Combining datasets collected at different times or places into a single, consistent data series. “Integration” Metadata: Data about data. Documentation in broadest sense.
10
Relation to head Marital status Education Occupation Microdata
11
Summary Data
12
Harmonization Methods Metadata Data Dissemination
13
Systematize Metadata (record layout file, pdf)
14
MPC Data Dictionary
15
Water Access Convert Questionnaires to Metadata (Mexico 2000)
16
Metadata: Questionnaire Text
17
Water access Bedrooms Rooms XML-Tagged Questionnaire Text
18
Data: Variable Harmonization Marital Status: IPUMS-International Bangladesh 2011 1 = Unmarried 2 = Married 3 = Widowed 4 = Divorced/separated Mexico 1970 1 = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya 1999 1 = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated
19
Translation Table Input Bangladesh 2011 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed Mexico 1970 1 = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya 1999 1 = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated
20
LabelCode Translation Table Harmonized 1 = Never married1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Single Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated 100 200 210 211 212 213 214 215 220 00 310 320 00 Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated 3 Widowed 4
21
LabelCode Translation Table Harmonized 1 = Never married 1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Single Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated 100 200 210 211 212 213 214 215 220 00 310 320 00 Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated 3 Widowed 4
22
Data Dissemination System
24
Variables Page
25
238 censuses
26
Sample Filtering
27
Variables Page – Filtered
28
Variable Page: Marital Status
29
Variable Codes (Marital status)
30
Variable Codes (Marital status)
31
Variable Codes (Marital status)
32
Variable Page: Marital Status
33
Variable Comparability Discussion (Marital status)
34
Variable Page: Documentation
35
Questionnaire Text
36
(Marital status, Cambodia)
37
Variables Page
38
Extract Summary
39
Case Selection
40
Age of spouse Employment status of father Occupation of father Attached Characteristics
41
Extract Summary
42
Download or Revise Extract
43
On-line Analysis
44
The International Projects
45
Integrated DHS
46
Foremost source of health information for the developing world Funded by USAID Since 1980s, over 300 surveys, 90 countries Topics: fertility, nutrition, HIV, malaria, maternal and child health, etc Demographic and Health Surveys
47
5-year NIH grant (end of year 2) Focus on Africa, with India Partnership with ICF-International and USAID IDHS Project
48
Motivation: DHS is incredibly valuable, but it’s hard to capitalize on its full potential. Problem: Data discovery Dispersed documentation Data management Variable changes over time Not unique to DHS: endemic to any survey that’s persisted over decades. Why an Integrated DHS?
49
DHS Research Process Example: Find data on female genital cutting Survey Search Tool
52
Recode notes Data dictionary Just the woman file – for one survey. 61 to go. Still need Report (377 page pdf) Contains questionnaire and sample design information Errata file
53
DHS “Recode Variables” make it more harmonized than most surveys Consistent variable names Each DHS phase has a shared model questionnaire But: 6 phases over 25+ years Country control over final wording of surveys Country-specific variables The recode variables can be a two-edged sword At least the DHS variables are already harmonized, right?
54
Ghana 1993 V130 Ghana 2008 V130 India 1992 V130 India 2005 V130 Harmonization: Religion
55
Harmonization: Female Circumcision Ever Circumcised
56
Timeline: 2014 (current) 9 countries, 39 samples Much of woman files Women of child bearing age as unit of analysis
57
Timeline: 2015 15 countries, 69 samples Complete the woman files Children & birth files
58
Timeline: 2017 21 countries, 94 samples Men and couples files
59
Timeline: Next grant 41 African countries, 130+ samples 11 Asian countries, 32+ samples
60
Beta
61
Lower barriers to conducting research on population and the environment. Motivation: The data from different domains have incompatible formats, and few researchers have the skills to combine them Terra Populus Goal
62
5 year grant NSF At mid-point: year 3 TerraPop
63
6 countries: Argentina Brazil Malawi Spain United States Vietnam Population Microdata
64
Tabulations of census data for administrative units Area-level Data
65
Land cover from satellite images (Global Land Cover 2000) Agricultural use from satellites and government records (Global Landscapes Initiative) Climate from weather stations (WorldClim) Environmental Data Rasters (Grid Cells)
66
Microdata Area-level data Rasters Mix and match variables originating in any of the data structures Obtain output in the data structure most useful to you Location-Based Integration
67
Individuals and households with their environmental and social context Microdata Area-level data Rasters Location-Based Integration
68
Summarized environmental and population Microdata Area-level data Rasters County ID G17003100001 G17003100002 G17003100003 G17003100004 G17003100005 G17003100006 G17003100007 County ID Mean Ann. Temp. Max. Ann. Precip. G1700310000121.2768 G1700310000223.4589 G1700310000324.3867 G1700310000421.5943 G1700310000524.1867 G1700310000624.4697 G1700310000725.6701 County ID Mean Ann. Temp. Max. Ann. Precip. Rent, Rural Rent, Urban Own, Rural Own, Urban G1700310000121.276831291063637365 G1700310000223.4589294910751469717 G1700310000324.3867341815891108617 G1700310000421.59431882425202142 G1700310000524.18672416572426197 G1700310000624.46972560934950563 G1700310000725.67012126653321215 characteristics for administrative districts Location-Based Integration
69
Rasters of population and environment data Microdata Area-level data Rasters Location-Based Integration
70
Rasterization of Area-Level Data
71
Area-Level Summary of Raster Data
72
Linkages across data formats rely on administrative unit boundaries Particular needs Lower level boundaries Historical boundaries Boundaries are Key
73
Geographic Harmonization
76
Web interface will change significantly in fall 2014 Fast microdata tabulator needed Beta Version
77
IPUMS-International
78
Census microdata from around world Funded by NSF and NIH Motivation: Provide data access Preservation
79
Khartoum, CBS-Sudan
80
Dhaka, Bangladesh Bureau of Statistics
81
IPUMS-International Participating Disseminating
82
IPUMS Censuses Per Country
84
Variables Included in Extracts
85
Top Institutional Users
86
Millennium Development Goals Ratio of literate women to men, 15-24 years old Source: Cuesta and Lovatón (2014) 1990 Census round
87
Millennium Development Goals Source: Cuesta and Lovatón (2014) Data Source: IPUMS-International, Minnesota Population Center Census 1993 Census 2005 Colombia: Adolescent Birth Rate
88
Data acquisition Outreach: developing countries Virtual data enclave IPUMSI Future
89
Thank you! sobek@umn.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.