IPUMS-International Integration Process

Slides:



Advertisements
Similar presentations
Multiple Indicator Cluster Surveys Data Interpretation, Further Analysis and Dissemination Workshop Basic Concepts of Further Analysis.
Advertisements

Harvard Center for Population and Development Studies1 Census Editing and the Art of Motorcycle Maintenance Michael J. Levin Center for Population and.
IPUMS-International Integration Process Matt Sobek Minnesota Population Center
Preservation and Security IPUMS International Wendy Thomas Data Archivist.
IPUMS-International Integration Process Matt Sobek Minnesota Population Center
Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by:
WORKSHOP ON INTEGRATING GLOBAL CENSUS MICRO DATA Paris, June 7 – 10, 2006 UGANDA COUNTRY REPORT by Andrew Mukulu.
SOWK 6003 Social Work Research Week 10 Quantitative Data Analysis
5. Integration of Microdata and Metadata (9 slides)
Original dataOriginal data. (various) Reformat dataReformat data: structural issues draw sample confidentiality (general tools) Data dictionary. (txt/pdf)
Census Processing Procedures Matt Sobek Funded by the National Science Foundation Minnesota Population Center.
IPUMS-International Integration Process Matt Sobek Minnesota Population Center
Quantifying Data.
Brief Overview of Data Processing of Afghanistan Household Listing, Pilot Census Results, Population and Housing Census and NRVA Survey Brief Overview.
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and.
Harmonizing the World’s Census Microdata: The IPUMS Project Matt Sobek Minnesota Population Center
United Nations Workshop on Revision 3 of Principles and recommendations for Population and Housing Censuses and Census Evaluation Amman, Jordan, 19 – 23.
U.S. Decennial Census Finding and Accessing Data Summer Durrant October 20, 2014 Data & Geographical Information Librarian Research Data Services
Design and Use of the IPUMS-International Data Series
Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.
Statistical Coherence: Census Hub Hypercubes and IPUMS Microdata UNECE Expert Group on Population and Housing Censuses Geneva, September 2014 Lara.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Copyright 2010, The World Bank Group. All Rights Reserved. Data Processing and Tabulation, Part I.
United Nations Regional Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys Bangkok,
Design and Use of the IPUMS-International Data Serieshttp://international.ipums.org Matt Sobek Minnesota Population Center
The Minnesota Data Harmonization Projects Bill & Melinda Gates Foundation Seattle, Washington May 21, 2014 Elizabeth Boyle, Miriam King, Matthew Sobek.
IPUMS-International Methods Matt Sobek Minnesota Population Center
Data Analysis.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
IPUMS Microdata Relation to head Marital status Literacy Occupation.
2008 Population Census of Cambodia Post Enumeration Survey Mrs. Hang Lina Deputy Director General National Institute of Statistics, Min. of Planning Regional.
 Background Data harmonization Data output  Web: Variable documentation system  Web: Data extract system IPUMS Dissemination System.
Disclosure Control in the UK Census Keith Spicer 11 January 2005.
Integrated Public Use Microdata Series IPUMSwww.ipums.org Matt Sobek Minnesota Population Center
Pilot Census in Poland Some Quality Aspects Geneva, 7-9 July 2010 Janusz Dygaszewicz Central Statistical Office POLAND.
Copyright 2010, The World Bank Group. All Rights Reserved. Managing Data Processing Section B.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Chapter 6: Analyzing and Interpreting Quantitative Data
The Integrated Public Use Microdata Series database IPUMSwww.ipums.org Lab 1 Background on the IPUMS and SPSS.
IPUMS-International Process Matt Sobek Minnesota Population Center
Challenges of Census Data Harmonization: IPUMS-International Matt Sobek Minnesota Population Center
3. IPUMS Documentation Dynamic Metadata System: 5 “clicks” to compare any census question, in English, for any combination of years and countries in the.
Click “Browse and Select Data”:  to view integrated metadata  and to get microdata (make an “extract”) Note: the data are “pooled” into a single file–
Integrated Public Use Microdata Series IPUMS Internationalwww.ipums.org Matt Sobek Minnesota Population Center
Integrated Public Use Microdata Series IPUMSwww.ipums.org.
Data access and development: The IPUMS perspective United Nations Commission on Population and Development The data revolution in action: National and.
1 Handbook on Population and Housing Census Editing Department of Economic and Social Development United Nations Statistics Division Studies in Methods,
PROCESSING DATA.
Matt Sobek Minnesota Population Center
Introduction to fertility
Canadian Census E&I – Lessons Learned from 2006 with Plans for 2011
IPUMS-International Schedule
Press <spacebar> to continue tutorial
Analyzing and Interpreting Quantitative Data
IPUMS “Pointer” Variables
Explore variables metadata (18 slides)
Introduction to IPUMS NYTS and IPUMS YRBSS
Post Enumeration Surveys Pres. 2
Demographic Analysis and Evaluation
CENSUS MICRODATA : THAILAND
Overview of Census Evaluation and Selected Methods Pres. 2
2. Applying for Access (10 slides)
Demographic Analysis and Evaluation
Introduction to IPUMS NYTS and IPUMS YRBSS
Generic Statistical Business Process-Censuses
Integrating Gender into Population and Housing Censuses
Danilo Dolenc Statistical Office of the Republic of Slovenia
Treatment of Missing Data Pres. 8
The IPUMS-International Dissemination System
Technical Coordination Group, Zagreb, Croatia, 26 January 2018
Presentation transcript:

IPUMS-International Integration Process Matt Sobek Minnesota Population Center sobek@pop.umn.edu

1 2 3 4 Input material Pre-processing Standardization Integration DATA Data files Batch samples Reformat data Donation Draw sample Confidentiality A Code clean-up Verify data Confidentiality B Harmonize codes Variable programming Constructed variables METADATA Data dictionary Enumeration forms Enum. instructions Sample information Translate to English Images to editable files Ipums data dictionary Tag enumeration text Document unharmonized variables Variable descriptions Sample design

Matt Sobek Minnesota Population Center sobek@pop.umn.edu End Matt Sobek Minnesota Population Center sobek@pop.umn.edu

Batch Samples In spring we identify the samples to integrate the following year. Samples are processed as a group – one per year. The entire batch of samples is processed through each stage before we proceed to the next step. There is little flexibility in the work process. If a sample is not available for processing during the earliest stages of integration, it cannot be included in the data release for that year.

Original Input Data Some examples of differing file formats: SPSS and SAS system files Redatam-format IMPS format Records that combine household and person characteristics Separate files for persons, households (and dwellings, buildings) Different types of records (mortality or migration) Separate files for different administrative units

Reformatting: Original Data File

Reformatting: Data File after Reformatting

Reformatting: Rectangular Sample (Person records only; household data duplicated on person records) geography housing person (head) person (child) person (spouse) geography housing person (head) person (child) geography housing person (head) person (spouse) person (child) (Brazil 1980)

Reformatting: Dwelling-Household-Person Sample (Separate dwelling and household records) dwelling household dwelling household person (head) person (spouse) person (child) person (head) person (spouse) person (child) dwelling household household person (head) person (child) person (head) person (child) dwelling household person (head) person (spouse) dwelling household person (head) person (spouse) (Chile 1992)

Reformatting: Merge Household and Person Files Household File serial 001 geog & housing serial 002 serial 003 serial 001 household serial 001 head spouse serial 002 household serial 002 head child Person File serial 001 head spouse serial 002 child serial 003 serial 003 household serial 003 head (Brazil 2000)

Reformatting: Persons not Organized in Households (Individuals only; not organized in households) geog person housing geog person housing household person household person household person household person household person (Mexico 1960)

Donation and Error Correction Data are tested for errors that affect structural integrity, such as merged households, unmatched person and household records, corrupted records, etc. Such errors often do not affect tabulations, but create inconsistencies across records within households that affect sophisticated analyses. Some problems can be resolved with custom programming. Other problems are resolved by donating (substituting) a donor household for the corrupted one. Households are divided into strata based on predictor variables. Donors are drawn from the same strata as the corrupted household, ensuring they share key characteristics. If a sample is drawn from the full census, a substitute donor record is used; if we are already starting with a sample, the donor record is duplicated. A flag indicates that a record was duplicated.

Drawing a Sample About one-third of IPUMS samples are drawn from full-count data. After reformatting, we draw a systematic sample of every Nth dwelling to yield the desired sample density – typically 10%. If the input data are not full-count (for example, they include only the long-form records), the sample design might have to account for differing sample densities between areas. Very large dwelling units (over 30 persons) are sampled at the individual level – not as intact units – in order to reduce sampling error. Every Nth individual is taken.

Confidentiality Measures: A Swap a small percentage of cases between geographic areas. Reorder households within geographic areas. Suppress low-level geographic variables. Suppress any variable deemed too sensitive by the National Statistical Office. Encrypt all versions of the data prior to the imposition of these confidentiality measures.

Code Clean-Up: Recoding Unharmonized Variables Recode the input variables to conform to some basic standards for treatment of missing values, etc. Recode stray values into a consolidated missing category as appropriate. Convert non-numeric characters to numeric. Most recoding is performed using a data translation matrix like the one below for Marital Status in 1984 Costa Rica. If the recoding requires more complex logic, use custom programming.

Verify Data: Unharmonized Variables Examine the marginal frequencies of every input variable. Analyze the data universe for each variable – the population at risk of having a response. Determine the theoretical universe from enumeration materials or other documentation, then empirically determine any discrepancies from that universe. Document the universe for each variable and any other observations.

Confidentiality Measures: B Recode geographic units to ensure small localities cannot be identified (typically those with fewer than 20,000 persons). For recent censuses: Identify cells that represent very small numbers of persons in the population. Code them to a residual category or combine them. Top- or bottom-code continuous variables that have a long tail that could identify small subpopulations. Suppress specific categories of variables as requested by the National Statistical Office.

Translation Matrix for Marital Status Harmonize Codes: Translation Matrix for Marital Status China 1982 Colombia 1973 Kenya 1989 Mexico 1970 U.S.A. 1990

Variable Programming Some variable manipulations are too complex to be handled using the translation matrix tables. Typically these involve continuous variables or recoding logic that refers to multiple variables. This programming is written in C++.

Constructed “Pointer” Variables (Simple household) Spouse’s Pernum Relate Age Sex Marst Chborn 1 head 46 male married n/a 2 spouse 44 female 3 aunt 77 widow 7 4 child 15 single 5 13 6 11 Location   2 1 Mother’s Father’s Pernum Relate Age Sex Marst Chborn 1 head 46 male married n/a 2 spouse 44 female 3 aunt 77 widow 7 4 child 15 single 5 13 6 11 Location   Location   2 1 2 1 2 1 (Colombia 1985)

Constructed “Pointer” Variables (Complex household) Spouse’s Mother’s Father’s Pernum Relationship Age Sex Marst Chborn 1 head 53 female separated 6 2 child 28 male single n/a 3 22 4 21 5 25 married child-in-law 7 grandchild 8 9 non-relative 32 10 11 Location   Location   Location   1 6 5 5 6 5 6 9 9 (Colombia 1985)

Original Data Dictionary – Kenya 1989

Original Data Dictionary – Romania 1992

Original Data Dictionary – China 1982

Original Data Dictionary – Mexico 1990

Enumeration Form: Original File

Enumeration Instructions: Original File (Mexico 1990)

Sample Information – from Statistical Office Sample information is difficult for the IPUMS project to collect. Often only limited information can be gleaned from available documentation. It is extremely helpful when countries collate the information themselves, as was done below by the Netherlands:

Translate Documents to English Many countries provide their census documentation in English. For those that do not, the IPUMS project hires translators from around the world. Often these are persons currently or formerly associated with National Statistical Offices. Some common languages are translated by staff in Minnesota.

Editable Enumeration Form – In English

IPUMS Data Dictionary

XML-Tagged Enumeration Form

Document Unharmonized Variables The enumeration form and instruction text provides most of the documentation for the unharmonized input variables. Other documentation is written as needed to clarify the interpretation of the variable for users. We also empirically determine the universe of persons or households with valid values for each variable.

Variable Description (Literacy)

Sample Design