Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Minnesota Data Harmonization Projects Bill & Melinda Gates Foundation Seattle, Washington May 21, 2014 Elizabeth Boyle, Miriam King, Matthew Sobek.

Similar presentations


Presentation on theme: "The Minnesota Data Harmonization Projects Bill & Melinda Gates Foundation Seattle, Washington May 21, 2014 Elizabeth Boyle, Miriam King, Matthew Sobek."— Presentation transcript:

1 The Minnesota Data Harmonization Projects Bill & Melinda Gates Foundation Seattle, Washington May 21, 2014 Elizabeth Boyle, Miriam King, Matthew Sobek Minnesota Population Center, University of Minnesota sobek@umn.edu

2

3 Integrated Public Use Microdata Series

4  We build data infrastructure for research community. Specialize in data harmonization.  World’s largest collection of individual population and health data, across 9 projects.  50,000 registered users from over 100 countries.  Free Minnesota Population Center

5 MPC Data Dissemination, 1993-2012 Gigabytes per week

6 MPC Data Projects

7 The Problem 1.Combining data from multiple sources is time consuming  Discovery  Data management 2.It’s error prone  Recoding data  Overlook documentation 3.Hard to replicate results 4.Discourages comparative research

8 Outline  Harmonization methods  Dissemination system  International projects  Integrated DHS  Terra Populus  IPUMS-International

9 Terminology Harmonization: Combining datasets collected at different times or places into a single, consistent data series. “Integration” Metadata: Data about data. Documentation in broadest sense.

10 Relation to head Marital status Education Occupation Microdata

11 Summary Data

12 Harmonization Methods  Metadata  Data  Dissemination

13 Systematize Metadata (record layout file, pdf)

14 MPC Data Dictionary

15 Water Access Convert Questionnaires to Metadata (Mexico 2000)

16 Metadata: Questionnaire Text

17 Water access Bedrooms Rooms XML-Tagged Questionnaire Text

18 Data: Variable Harmonization Marital Status: IPUMS-International Bangladesh 2011 1 = Unmarried 2 = Married 3 = Widowed 4 = Divorced/separated Mexico 1970 1 = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya 1999 1 = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated

19 Translation Table Input Bangladesh 2011 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed Mexico 1970 1 = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya 1999 1 = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated

20 LabelCode Translation Table Harmonized 1 = Never married1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Single Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated 100 200 210 211 212 213 214 215 220 00 310 320 00 Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated 3 Widowed 4

21 LabelCode Translation Table Harmonized 1 = Never married 1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Single Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated 100 200 210 211 212 213 214 215 220 00 310 320 00 Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated 3 Widowed 4

22 Data Dissemination System

23

24 Variables Page

25 238 censuses

26 Sample Filtering

27 Variables Page – Filtered

28 Variable Page: Marital Status

29 Variable Codes (Marital status)

30 Variable Codes (Marital status)

31 Variable Codes (Marital status)

32 Variable Page: Marital Status

33 Variable Comparability Discussion (Marital status)

34 Variable Page: Documentation

35 Questionnaire Text

36 (Marital status, Cambodia)

37 Variables Page

38 Extract Summary

39 Case Selection

40 Age of spouse Employment status of father Occupation of father Attached Characteristics

41 Extract Summary

42 Download or Revise Extract

43 On-line Analysis

44 The International Projects

45 Integrated DHS

46  Foremost source of health information for the developing world  Funded by USAID  Since 1980s, over 300 surveys, 90 countries  Topics: fertility, nutrition, HIV, malaria, maternal and child health, etc Demographic and Health Surveys

47  5-year NIH grant (end of year 2)  Focus on Africa, with India  Partnership with ICF-International and USAID IDHS Project

48 Motivation: DHS is incredibly valuable, but it’s hard to capitalize on its full potential. Problem:  Data discovery  Dispersed documentation  Data management  Variable changes over time Not unique to DHS: endemic to any survey that’s persisted over decades. Why an Integrated DHS?

49 DHS Research Process Example: Find data on female genital cutting Survey Search Tool

50

51

52 Recode notes Data dictionary Just the woman file – for one survey. 61 to go. Still need Report (377 page pdf) Contains questionnaire and sample design information Errata file

53 DHS “Recode Variables” make it more harmonized than most surveys  Consistent variable names  Each DHS phase has a shared model questionnaire But:  6 phases over 25+ years  Country control over final wording of surveys  Country-specific variables  The recode variables can be a two-edged sword At least the DHS variables are already harmonized, right?

54 Ghana 1993 V130 Ghana 2008 V130 India 1992 V130 India 2005 V130 Harmonization: Religion

55 Harmonization: Female Circumcision Ever Circumcised

56 Timeline: 2014 (current) 9 countries, 39 samples Much of woman files Women of child bearing age as unit of analysis

57 Timeline: 2015 15 countries, 69 samples Complete the woman files Children & birth files

58 Timeline: 2017 21 countries, 94 samples Men and couples files

59 Timeline: Next grant 41 African countries, 130+ samples 11 Asian countries, 32+ samples

60 Beta

61 Lower barriers to conducting research on population and the environment. Motivation: The data from different domains have incompatible formats, and few researchers have the skills to combine them Terra Populus Goal

62 5 year grant NSF  At mid-point: year 3 TerraPop

63 6 countries:  Argentina  Brazil  Malawi  Spain  United States  Vietnam Population Microdata

64 Tabulations of census data for administrative units Area-level Data

65 Land cover from satellite images (Global Land Cover 2000) Agricultural use from satellites and government records (Global Landscapes Initiative) Climate from weather stations (WorldClim) Environmental Data Rasters (Grid Cells)

66 Microdata Area-level data Rasters Mix and match variables originating in any of the data structures Obtain output in the data structure most useful to you Location-Based Integration

67 Individuals and households with their environmental and social context Microdata Area-level data Rasters Location-Based Integration

68 Summarized environmental and population Microdata Area-level data Rasters County ID G17003100001 G17003100002 G17003100003 G17003100004 G17003100005 G17003100006 G17003100007 County ID Mean Ann. Temp. Max. Ann. Precip. G1700310000121.2768 G1700310000223.4589 G1700310000324.3867 G1700310000421.5943 G1700310000524.1867 G1700310000624.4697 G1700310000725.6701 County ID Mean Ann. Temp. Max. Ann. Precip. Rent, Rural Rent, Urban Own, Rural Own, Urban G1700310000121.276831291063637365 G1700310000223.4589294910751469717 G1700310000324.3867341815891108617 G1700310000421.59431882425202142 G1700310000524.18672416572426197 G1700310000624.46972560934950563 G1700310000725.67012126653321215 characteristics for administrative districts Location-Based Integration

69 Rasters of population and environment data Microdata Area-level data Rasters Location-Based Integration

70 Rasterization of Area-Level Data

71 Area-Level Summary of Raster Data

72 Linkages across data formats rely on administrative unit boundaries Particular needs  Lower level boundaries  Historical boundaries Boundaries are Key

73 Geographic Harmonization

74

75

76 Web interface will change significantly in fall 2014 Fast microdata tabulator needed Beta Version

77 IPUMS-International

78 Census microdata from around world Funded by NSF and NIH Motivation:  Provide data access  Preservation

79 Khartoum, CBS-Sudan

80 Dhaka, Bangladesh Bureau of Statistics

81 IPUMS-International Participating Disseminating

82 IPUMS Censuses Per Country

83

84 Variables Included in Extracts

85 Top Institutional Users

86 Millennium Development Goals Ratio of literate women to men, 15-24 years old Source: Cuesta and Lovatón (2014) 1990 Census round

87 Millennium Development Goals Source: Cuesta and Lovatón (2014) Data Source: IPUMS-International, Minnesota Population Center Census 1993 Census 2005 Colombia: Adolescent Birth Rate

88  Data acquisition  Outreach: developing countries  Virtual data enclave IPUMSI Future

89 Thank you! sobek@umn.edu


Download ppt "The Minnesota Data Harmonization Projects Bill & Melinda Gates Foundation Seattle, Washington May 21, 2014 Elizabeth Boyle, Miriam King, Matthew Sobek."

Similar presentations


Ads by Google