The purpose of this talk: “value added by 3 rd parties” 1. Encourage National Statistical Offices to entrust census microdata samples to the IPUMS-International project samples 2010 round samples 2. Describe some of the value that IPUMS-International adds to integrated microdata and metadata. Free access to the microdata for bona fide researchers Extensive analysis of data quality before the samples are released Integrated metadata (compare questions in 1, 2, … many censuses) Integrated, pooled microdata (multiple censuses, countries) 3. Encourage usage of integrated samples by African researchers Usage is relatively low, but increasing quickly as more samples become available
Introduction s: dissemination of census microdata began : IPUMS-International; 2009, 83 countries: a. Preserve census microdata and documentation world-wide b. Integrate microdata and metadata c. Disseminate to researchers world-wide, without cost; 2009: 130 samples, 43 countries, 279 million person records sustained, major funding: » National Science Foundation (USA)– renewed through 2014 » National Institutes of Health (USA)– eager to fund IPUMS-Africa 3. Asia-Pacific region: Vietnam 2001, China 2002 … 2009: 12 countries p. 55: “…there is a real risk that NSOs may become marginalized if they are unable to meet the requirements of researchers in this increasingly important area of data provision [census microdata].”
Advantages of IPUMS for Ireland Bonus for CSO: as a result of this project, our historic data sets are now in a much more usable format IPUMS allows – mix of Census years available in 1 file Comparability with other countries Ease of access for users Positive publicity for Census in Ireland Central Statistics Office-Ireland Deirdre Cullen, Senior Statistician, testimonial (not in the paper):
Introduction When NSOs disseminate microdata, the task is costly, risky and often unsatisfactory IPUMS+AICMD partnership offers solution for African countries Invitation to participate, entrust microdata for 2010 and earlier censuses without undue delay IPUMS+AICMD adds value to population microdata: 1. Statistical confidentiality and security – disclosure controls, restricted access 2. Integration – census microdata and metadata 3. Dissemination – custom tailored extracts: country(ies), census(es), populations, variables, sample density, metadata 4. Ethics - statistical transparency, academic freedom, responsible use, sharing of results. Reflections Outline 6
Why Statistical Offices entrust Responsibility of Disseminating Census Microdata to IPUMS-International » NSO Dissemination is costly, risky and often unsatisfactory » Costly: scarce human resources to prepare sample, assure statistical confidentiality, and manage access for relatively few users (however important they may be!) » Risky: little experience in anonymizing and managing access to microdata, yet great responsibility » US Census Bureau anonymization protocol egregiously corrupted ages for elderly in ACS microdata—took 5 years to discover the error! » Unsatisfactory: excessive anonymization, slow to provide access. Troublesome for NSO statisticians who do not wish to risk their job to some academic. Most deny access to all but the most persistent, influential would-be users. Complaints (of a large European NSO): » “I haven't used the [microdata]; the bureaucracy was just too slow to get much use out of it.” » “[Access] is unbelievably bureaucratic and difficult – this discourages people from using it. It took me 6 months to get the data.”
IPUMS-International assumes responsibilities and risks for integrating & disseminating microdata and metadata » Uniform Memorandum of Understanding with each NSO: » Founding partners (2001): Kenya, South Africa, Ghana, Egypt, France, Spain, China, Vietnam, Kenya, Colombia, Mexico, USA … now almost 100 countries » Specific conditions of access: ownership of data (NSO), use, access, restrictions, confidentiality, security, publication, violations, sharing, jurisdiction, and precedence. » Almost 100 countries entrust census microdata to IPUMS-I. » 6 most populous countries NOT entrusting census microdata to IPUMS: India, *Nigeria, Russian Federation, Japan, Algeria, *Korea (RO—may join at the UNSC in New York) » * = negotiating » No data: Congo (DR), Myanmar, Afghanistan, Uzbekistan, Somalia 8
9 90+ National Statistics Offices have endorsed the IPUMS- International Memorandum of Understanding
IPUMS Milestones » 1995: IPUMS-USA first release of integrated microdata » 1999: IPUMS-International funded by NSF & NIH » 2002: 1 st International launch: 7 countries, 25 samples. » 2007 launch (56 th ISI): » 2009 launch (57 th ISI): » ~279 million person records » ~3,000 registered users » 2011 launch (58 th ISI): » 397 million person records » 5,000 registered users » 2013 (ISI Hong Kong!): ~70 ~225 » ~500 million person records » ~7,000 registered users
Cartogram of IPUMS+AICMD partners weighted by population dark green = integrated and disseminating Open Invitation to Cooperate, Entrust and Access 12
The IPUMS-International team (includes National Science Foundation Board) (Not present: some computer gurus, researchers, research assistants, civil service employees, and others who were not at the NSF Board meeting) Steven Ruggles, inventor of IPUMS, Professor of History, and Director of the Minnesota Population Center
See, pp. 3-5: 2012: “IPUMS and AICMD Add Significant Value to African Census Microdata,” ASSD VII, Cape Town, South Africa, January I. Statistical Confidentiality and Security 14 1.Statistical Confidentiality and Microdata Security 2.Statistical disclosure control protections 3.Restricted access
MPC NSI …62+ NSI 1 …. MPC integrates metadata and confidentializes microdata samples IPUMS- International IPUMS-International manages access and entrusts researchers with custom- tailored, SAS, STATA, and SPSS metadata and microdata extracts for any combination of countries, censuses, sub-populations, and variables Trusted researcher …. 1. Statistical Confidentiality and security. Trusted researcher receives customized extracts NSI entrusts census metadata and anonymized microdata to MPC 15
» “...the best practice for an international repository of microdata” » “The security of IPUMS is first class…the standard of the best national statistical offices” » “...a valuable and trustworthy microdata service. It meets the fundamental principles of good practice with respect to confidentiality and microdata.” » “in full compliance with the principles and recommendations of the CES [Conference of European Statisticians]” Dennis Trewin on-site evaluation. former: Australian Statistician, chair: Conference of European Statisticians Task Force on Microdata and Confidentiality
2. Statistical Disclosure controls 1. Microdata are anonymized by suppressing any names, addresses, or precise geographic identifiers. 2. Sample is drawn so that researchers have access to only a minor fraction of the complete dataset. 3. Disclosure protections are imposed on the sample, variable- by-variable and code-by-code. 4. A small fraction of households is swapped across geographic boundaries. See case of Switzerland with 5% household samples for four censuses. See case of Switzerland with 5% household samples for four censuses. Suppression thresholds are set by each NSO. Suppression thresholds are set by each NSO. Great satisfaction from NSOs and researchers Great satisfaction from NSOs and researchers
3. Restricted access: Thwarting intruders by legal and administrative procedures » Usage is restricted to bona-fide researchers who agree to stringent conditions of use to protect statistical confidentiality » 1,100 word application form; <5,300 word Facebook policy » Agree to 8 specific conditions of use » Supply extensive personal and institution details » Identify your employer’s Office for Protection of Human Subject, IRB, etc. » Describe research detailing need for access » Rogue intruders face legal and institutional sanctions » University attorney’s office is obligated to initiate sanctions against both individual and the institution —similar to NIH probationary status
Links to Partner Statistical Agency Websites Restricted Access: User Registration and Login 19 Despite the “P” (Public) in IPUMS, access to the microdata is restricted.
Thwarting intruders by legal and administrative procedures » Usage is restricted to bona-fide researchers who agree to stringent conditions of use to protect statistical confidentiality » 1,100 word application form; <5,300 word Facebook policy » Agree to 8 specific conditions of use » Supply extensive personal and institution details » Identify your employer’s Office for Protection of Human Subject, IRB, etc. » Describe research detailing need for access » Rogue intruders face legal and institutional sanctions » University attorney’s office is obligated to initiate sanctions against both individual and the institution —similar to NIH probationary status Application form for IPUMS-I requesting information on institutional affiliation
Conditions of use: must agree to each one--no exceptions Data must not be redistributed without authorization. All data extracted from the IPUMS-International database are intended solely for the use of the licensee. Under IPUMS-International agreements with collaborating agencies, redistribution of the data to third parties is prohibited. Each member of a research team using the data must apply for access and be licensed individually. The microdata are intended only for scholarly research and educational purposes. These microdata are provided for the exclusive purposes of teaching and scholarly research, and may not be used for any other purposes without explicit written approval from the relevant official statistical authority. Commercial use and redistribution of the microdata is strictly prohibited. Users are prohibited from using microdata acquired from the Integrated Public Use Microdata Series International or other authorized distributors in the pursuit of any commercial or income-generating venture either privately, or otherwise. Use of the microdata must follow strict rules of confidentiality. Users will maintain the confidentiality of persons and households. Any attempt to ascertain the identity of persons or households from the microdata is prohibited. Alleging that a person or household has been identified in these data is also prohibited. Statistical results that might reveal the identity of persons or entities may not be reported or published in any form. The microdata must always be safely secured. Users will implement security measures to prevent unauthorized access to microdata acquired from Integrated Public Use Microdata Series International, its partners or authorized distributors. Upon the completion of this research, data may be retained only if they can be safely secured. If security cannot be guaranteed, the microdata must be destroyed. Scholarly publications are permitted, and must be cited appropriately. The publishing of research results based on IPUMS-International microdata is permitted in communications such as scholarly papers, journals and the like. The authors of these communications are required to cite Integrated Public Use Microdata Series-International and the relevant official statistical authority as the source of the microdata, and to indicate that the results and views expressed are those of the author. Users are requested to provide the IPUMS-International staff with a full citation for any publications resulting from their work with these data. Any violation of this license agreement will result in disciplinary action, including possible loss of employment. Violation of this agreement will lead to revocation of this license, recall of all microdata acquired, a motion of censure to the relevant professional organization(s) and civil prosecution under national or international statutes, at the discretion of the Regents of the University of Minnesota and the official statistical agencies. Sanctions likewise may be taken against the institution with which the violator is affiliated. User agrees to notify regarding errors in the data. √ √ √ √ √ √ √ √
See, pp. 6-8: 2012: “IPUMS and AICMD Add Significant Value to African Census Microdata,” ASSD VII, Cape Town, South Africa, January II. Integration 22 4.Comprehensive Source Metadata 5.Integrated, DDI Compatible Metadata 6.Integrated Microdata 7.IPUMS-I Value-Added Variables 8.Integrated Boundary Files
Links to Official Statistical Agency Partners 4. Comprehensive Source Documents (forms, instruction manuals) --for integrated censuses Bibliography: view cites, link to publications 23
5. DDI Compatible Metadata (we share!) 25 Mapped in DDI; compatible with IHSN Microdata toolkit copies entered into the NADA catalog and archive
User Registration, conditions of use license 6. Integrated Metadata (Browse and Select Data Link to Official Statistical Agency home pages Source documents (forms, instruction manuals) Download Data Extract (and codebook) Bibliography: view cites, link to publications 26
27 Integrated metadata: open access, dynamically constructed. Example: Marital Status Page is constructed dynamically Displays currently selected samples
Integrated IPUMS-I Metadata: Codes and Frequencies Detailed, Case-Count View 2 rules: 1. Retain details 2. Harmonize everything Page is constructed dynamically Displays currently selected samples 28
Integrated IPUMS-I Metadata: Enumeration text View text in English for any combination of countries and censuses. 2 documents: First the form 29 Page is constructed dynamically Displays currently selected samples
Integrated IPUMS-I Metadata: Enumeration text View text in English for any combination of countries and censuses. 2 documents: First, the form; then, the enumeration instructions scroll down for more 30 Page is constructed dynamically Displays currently selected samples
7. Integrated Microdata (Table 2) 32 most popular integrated variables in IPUMS-International (85,505 Sample Extracts) RankLabelExtractsMnemonicComment 1 Educational attainment 19,307EDATTAN 2 Age (single years to 85+) 19,009AGE Grouped age n=3,838 3 Employment status 18,490EMPSTAT 4 Marital status 18,214MARST 5 Person weight 17,511WTPER Technical variable 6 Relationship to head 15,783RELATE 7Sex14,595SEX 8 Class of work 12,583CLASSWK 9 Ownership of dwelling 8,050OWNRSHP 10 Occupation ISCO recode 8,004OCCISCO 11 School attendance 7,919SCHOOL 12 Years of schooling 7,576YRSCHL 13Literate7,290LIT 14Urban/rural7,098URBAN 15 Industry-general code 7,044INDGEN 16 Household weight 6,656WTHH Technical variable 31
Table most popular integrated variables in IPUMS-International (85,505 Sample Extracts) RankLabelExtractsMnemonicComment 17 Children ever born 6,363CHBORN 18 Nativity (native/foreign born) 6,332NATIVTY 19Occupation6,246OCC 20 Country of birth 6,153BPLCTRY 21Religion6,075RELIG 22Industry5,670IND 23 Location of spouse in household 5,007SPLOC IPUMS unique 24 Rule for locating spouse 4,171SPRULE IPUMS unique 25 Location of mother in hh 4,153MOMLOC IPUMS unique 26 Number of children surviving 4,074CHSURV 27 Place of residence 5 years ago 4,064MGRATE5 28 Location of father in household 3,983POPLOC IPUMS unique 29 Total household income 3,965INCTOT Household variable 30 Earned income 3,655INCEARN 31 Number of rooms 3,465ROOMS 32 Consensual union 3,443CONSENS IPUMS unique 32
33 Appendix D. 42 (of 60) Integrated Household Variables: Availability for 13 African Countries (25 Censuses)
Appendix E. 88 (of 108) Integrated Person Variables: Availability for 13 African Countries (25 Censuses) 34
8. GIS Boundary files (and other Data Files Link to Official Statistical Agency home pages Source documents (forms, instruction manuals) Bibliography: view cites, link to publications 35
See, pp. 9-10: 2012: “IPUMS and AICMD Add Significant Value to African Census Microdata,” ASSD VII, Cape Town, South Africa, January III. Dissemination 36 9.Trans-border Access 10. Custom-Tailored Extracts 11.Usage Round Census Microdata
9. Transborder access. IPUMS-I Extracts by researcher’s place of identity 37 Samples ExtractsExtractedInstitutions Place of Identity(N)(mean)(N) United States14, France Spain United Kingdom Canada Colombia Brazil Mexico Singapore Germany Austria Italy Chile Argentina Switzerland Belgium Australia Netherlands China Japan
Top 20 institutions using IPUMS-I (Appendix 4) 38 1University of Michigan742 2Columbia University701 3Universitat de Barcelona, Spain615 4Harvard University589 5Inter - American Development Bank499 6Arizona State University495 7National University of Singapore, Singapore467 8World Bank408 9University of California - Berkeley362 10Universidade Federal de Minas Gerais, Brazil314 11University of Chicago285 12Universidad del Valle, Colombia270 13Institute for Health Metrics & Evaluation260 14Princeton University237 15University of Wisconsin - Madison234 16Brown University229 17University of Vienna, Austria229 18University of Pittsburgh227 19University of Delaware213 20El Colegio de México, México214
Dissemination of microdata and metadata extracts » The massive scale of IPUMS requires users to be selective: » Select country (or countries) » Select samples (census years) » Select variables (e.g., age, sex, educational attainment, etc.) » Select sub-populations (e.g., nurses) » Select sample density » Once an extract request is submitted, the IPUMS extract engine: » Constructs the microdata extract » Constructs the metadata » s the researcher to retrieve the extract password protected, transmission is encrypted 128 bit SSL » The researcher downloads the extract, un-zips and analyzes » Extract system validated as usage has soared
10. Custom tailored extracts. b-1. Study documentation b-2. Create extract c. Receive ; logon with p/word d-1. Download extract (SSL encrypted) d-2. UnZip data a. Login with password e. Analyze using own software
Use the extract system to “Select Cases”. Example: Disability
Second: Click the box to include the variable Third: Click “select cases” box
Click here, to select every person in households containing an individual with employment disability Fourth: Scroll down, select “disabled”, then “Continue to next step”
2010 round censuses. Minimum Standards for Samples Entrusted to IPUMS for dissemination 1. Household samples 2. High precision: 5% minimum, 10% preferred 3. Broad set of variables—omit only those required for statistical confidentiality (low-level geography, low frequency attributes) 4. Detailed codes » Age: single year to 85 » Occupation, industry: 3 digit ISCO, ISIC » Country of birth: detail individual countries consistent with statistical confidentiality » Thanks to INSEE France for sample of recensement renovee, : 20 million person records launched in IPUMS-I
See, pp. 11: 2012: “IPUMS and AICMD Add Significant Value to African Census Microdata,” ASSD VII, Cape Town, South Africa, January IV. Ethics Statistical Transparency 14.Academic Freedom 15.Reduce Research Fraud and Exaggeration of Results 16.Share Research Results
1. Free, easy access to data for many countries and censuses 2. Large sample sizes: Make it possible to include many different variables in a regression… multi-level model… Make it possible to include many different variables in a regression… multi-level model… Produce separate estimates for population sub-groups Produce separate estimates for population sub-groups Easy to extract samples with a target sample size (e.g., 50mb) Easy to extract samples with a target sample size (e.g., 50mb) Easy to revise an extract for a larger size or to include more countries, censuses, variables or sub-populations Easy to revise an extract for a larger size or to include more countries, censuses, variables or sub-populations 2. Students show a great deal of creativity in using IPUMS-I 3. Skills acquired have an immediate pay-off when applying for jobs (e.g., World Bank), graduate school, etc. “IPUMS-I is an excellent resource for teaching…” -- Dr. David Lam, president Population Association of America
Africa Mirror Site:
