IPUMS-International: High precision Population Census Samples: Balancing the Privacy-Quality Tradeoff by Means of Restricted Access Microdata Extracts.

Slides:



Advertisements
Similar presentations
Microdata dissemination best practice Draft note prepared by the World Bank Development Data Group for the CCSA twenty-second session, Ankara, September.
Advertisements

Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
How IPUMS Harmonizes Microdata Data Sources and Bibliography Data Sources: Original census data are contributed to the IPUMS- International project by.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
Welcome IPUMS/IECM-Europe Workshop: Accomplishments, plans and challenges * * * Robert McCaa, Professor of.
IPUMS workshop * * * Robert McCaa, Professor of Population History University of Minnesota additional information.
Hist.umn.edu/~rmccaa/ipums-europe1 Population Activities Unit 1990 census round harmonization project: focused on Aging » Begun 1992: PAU/UNECE, UNFPA,
Census 2000 symposium, session 4 paper 261 Archiving Census Documentation and Microdata: Preserving Memory, Increasing Stakeholders * * * Wendy L. Thomas.
Using a restricted-access web-site of anonymized, integrated census microdata (for 1, 2, 3, 4,
Hist.umn.edu/~rmccaa/ipums-europe1 IPUMS i integration principles IPUMS i integration principles » 1. Respect absolute anonymity and confidentiality »
4. Creating an Extract (9 slides). 4. Creating an extract » Password protected: to make and retrieve extracts » Licensed researcher selects: » Countries,
St. Lucia Country Report By Edwin St Catherine Director, Central Statistical Office Presented to IPUMS Workshop August 24 th, 2007.
A proposal to preserve, integrate and manage access to anonymized census samples of the Official Statistical Agencies of the Arab States in cooperation.
6. Managing access to IPUMS integrated census microdata “extracts” (13 slides)
Hist.umn.edu/~rmccaa/ipums-europe1 Sister-project: IPUMS-Latin America: 17 countries, ~500 million pop., 5 census rounds 80+ samples, 100+ million person.
Building Historical Social Science Infrastructure: Data Integration Projects of the Minnesota Population Center Steven Ruggles Minnesota Population Center.
Statistical confidentiality and privacy. 2. Case study: IPUMS-International * * * Robert McCaa Minnesota Population Center.
5. Integration of Microdata and Metadata (9 slides)
The IPUMS-International dynamic metadata system * * * Robert McCaa, Professor of Population History University of Minnesota.
Hist.umn.edu/~rmccaa/ipums-europe1 From IPUMS-USA (1989-) & PAU-Aging (1992-) From IPUMS-USA (1989-) & PAU-Aging (1992-) to IPUMS-International (1999-)
Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center “ Inadequate.
IPUMS-Europe: Confidentiality measures for licensing and disseminating restricted-access census microdata extracts
DWB – 2 nd Regional Workshop Athens, October 2014 Adolfo Gálvez INE Accesing microdata for scientific research purposes- INE Spain.
IPUMS-EurAsia, : Changing Patterns of Microdata Use * * * Robert McCaa, Professor of Population History University.
Building Historical Social Science Infrastructure: Data Integration Projects of the Minnesota Population Center Robert McCaa and Steven Ruggles Minnesota.
IPUMS-International: August * * * Robert McCaa, Professor of Population History University of Minnesota
Harmonizing the World’s Census Microdata: The IPUMS Project Matt Sobek Minnesota Population Center
Country Paper on: Census Data Accessibility, Confidentiality and Copyright Policy: Ethiopia’s Experience Seminar United Nations Regional Seminar on Census.
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
2014 SDC and CIC Annual Training Conference: Accessing ACS PUMS Data Tim Gilbert U.S. Census Bureau April 2, 2014.
Hist.umn.edu/~rmccaa/ipums-europe1 IPUMS-Europe, : Restricted-access, anonymized microdata for scientific and policy research * * * Robert McCaa,
Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, September 2011 Overview of Archiving of Microdata Session 4 United Nations.
U.S. Decennial Census Finding and Accessing Data Summer Durrant October 20, 2014 Data & Geographical Information Librarian Research Data Services
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
CES Task Force on Confidentiality and Microdata Tiina Luige UNECE Statistical Division Conference of European Statisticians UN Economic Commission for.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Integrating ACS with the World’s Census Data: ACS Microdata and the IPUMS Presented at the Pre-ALAP ACS/IPUMS Workshop November 16, 2010 Trent Alexander.
Plans for Access to UK Microdata from 2011 Census Emma White Office for National Statistics 24 May 2012.
Access to microdata in Europe P resented by Michel Isnard – Insee DwB Training Course, Barcelona, Jan
Population Census carried out in Armenia in 2011 as an example of the Generic Statistical Business Process Model Anahit Safyan Member of the State Council.
Design and Use of the IPUMS-International Data Serieshttp://international.ipums.org Matt Sobek Minnesota Population Center
Population census micro data for research: the case of Slovenia Danilo Dolenc Statistical Office of the Republic of Slovenia Ljubljana, First Regional.
* IPUMS-International * Using Integrated unit records for demographic and health research: Local, regional, national, and international * * * Robert McCaa,
IPUMS-International Free census samples (microdata) for researchers and policy makers: * * * Robert McCaa, Minnesota Population.
Trans-Border access to Census Microdata: The IPUMS-IECM partnership * * * Robert McCaa and Albert Esteve Palós “You have to.
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
Joint UNECE / Eurostat meeting on Population and Housing Censuses 7-9 July 2010, Geneva Disseminating Census information to maximise use and value Keith.
2008 NCHS Data Users’ Conference Omni Shoreham Hotel Washington, DC Wednesday, August 13, 2008.
Statistical data confidentiality and micro data in Albania
19 June 2007 Improving the quality of business registers UNECE/Eurostat/OECD 18 – 19 June 2007.
Integrated Public Use Microdata Series IPUMSwww.ipums.org Matt Sobek Minnesota Population Center
1 Dissemination Michael J. Levin Harvard Center for Population and Development Studies
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
The Integrated Public Use Microdata Series database IPUMSwww.ipums.org Lab 1 Background on the IPUMS and SPSS.
Data Dissemination Conditions in the European Statistical System (ESS) UNECE, Warschau May 2009.
HETUS Pilot Group 8 Privacy procedures and ethical issues Kimberly Fisher, Centre for Time Use Research – co-ordinator External consultant Kai Ludwigs.
Census Office Fernando Casimiro Geneva, July 2010 Portugal – Census results tailored to user needs «
Integrated Public Use Microdata Series IPUMS Internationalwww.ipums.org Matt Sobek Minnesota Population Center
Integrated Public Use Microdata Series IPUMSwww.ipums.org.
1. Introduction 2. Background 3. Funding framework 4. EU participation 5. Timetable 6. Progress report 7. Future plans I ntegrating the E uropean C ensus.
Data access and development: The IPUMS perspective United Nations Commission on Population and Development The data revolution in action: National and.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Data Confidentiality and the Common Good.
Welcome IPUMS/IECM-Europe Workshop: Accomplishments, plans and challenges * * * Robert McCaa, Professor.
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
OECD Chief Statistician and Director, Statistics Directorate
2. Applying for Access (10 slides)
Nicolás J. I. Rodríguez & Arild Mellesdal
Presentation transcript:

IPUMS-International: High precision Population Census Samples: Balancing the Privacy-Quality Tradeoff by Means of Restricted Access Microdata Extracts * * * Robert McCaa, Steven Ruggles, Michael Davern, Tami Swenson, and Krishna Mohan Palipudi Minnesota Population Center = information not in proceedings or on CD

Outline of paper (in proceedings, except “0.”) 1. Introduction: The Trusted User Approach 2. The Case for High Precision Samples: The USA Experience 3. High Precision Samples with Implicit Stratification 4. Access Disclosure Controls 5. Technical Disclosure Controls 6. Fear, Hysteria and Paranoia 7. Conclusions and Future Work 0. What’s a historian doing at PSD2006?

Why am I (a historian) here? 1. To learn from you to enhance IPUMS-International privacy and confidentiality techniques 2. To inform you of our existence and the challenges we face 3. To invite your contributions, as producers, users, and creators of statistical confidentiality methods 4. To advertise opportunities for post-docs, staff 5. To invite statistical agencies to entrust census microdata to the project

Confidentializing IPUMS-International, an integrated microdatabase with: » 150 census samples of households (50 countries) » Containing 300 million person records with hundreds of variables » Available to tens of thousands of licensed users regardless of country of birth, citizenship, residence or place of work » Not a single allegation of violation of privacy or statistical confidentiality-- What ’ s the problem?

IPUMS-International: a restricted-access, web-based census microdata extraction system » Password protected: to make and retrieve extracts » Licensed researcher selects: » Countries, » Censuses, » Cases/sub-populations, » Variables, and » Sample densities » Extract engine queues request, generates extract » Researcher retrieves extract via web with SSL 128-bit encryption and analyzes using own wares (soft/hard/wet) » NO: CDs, original codes, or complete datasets

6 steps using 1. Logon w/ password 2a. Study documentation 2b. Design extract 3. Receive ; logon with p/word 4. Download extract (SSL encrypted) 5. UnZip data (also SAS, STATA) 6. Analyze

IPUMS-International, December 2006 dark green = disseminating (20 countries, 63 censuses, 185mpr) green = harmonizing (37 countries, 100 censuses, 200mpr) lightest green = negotiating

What has happened since Geneva (xi/05)? 1. NSF-USA renewed funding for 5 years 2. Database grew: 12 countries, 35 censuses, 65mpr 3. More agreements signed, census data acquired 4. New, dynamic metadata system implemented 5. Number of users doubled 6. Publications are taking off 7. Paris Workshop (INED/CEPED): delegates from 14 European countries and 10 non-European, plus academic researchers

IPUMS-Europe December 2006 Dark green = Disseminating (5 countries, 15 censuses, 27mpr) In Lisbon: Portugal and Hungary will become “dark green” with the launch of samples for 4 censuses ea. for Argentina and Hungary, 3 for Portgual and Israel, 2 for Egypt and Rwanda, and 1 for Gaza and the West Bank

What will happen by Lisbon (ISI, viii/07)? 1. Confidentiality methods will be enhanced 2. Database will grow: 7 countries, 19 censuses, 25mpr 3. Dynamic metadata system will be expanded 4. Number of users will increase!!! 5. Publications!!! 6. IPUMS Workshop (Sat Aug 25 at INE-Pt) for producers and users (registration required; please 7. Microdata Session (Fri Aug 24) * Special conditions apply

1. Introduction: The “trusted-user” approach to disseminating integrated, anonymized census microdata sample

MBNA: world’s largest independent credit card issuer specialist in affinity marketing » 1982: MBNA founded by Charles Cawley –instead of competing on price, compete on affinity » 1983: Georgetown Univ Alumni Association (Cawley’s alma mater) supplied MBNA with names and addresses of its members in exchange for percentage of revenues on card usage » Big hit! Large number of new accounts, low risk, high spenders » 1985: new groups: American Dental Association, Aircraft Owners and Pilots Association, National Education Assoc., » 1994: Sierra Club, 45,000 members signed with MBNA generating $400,000 annually for Sierra Club »  The rest is history! » 2005:

MBNA: world’s largest independent credit card issuer specialist in affinity marketing » 1982: MBNA founded by Charles Cawley –instead of competing on price, compete on affinity » 2005: MBNA, with 25,000 employees, acquired by Bank of America, US$35 billion » How many credit cards do you have? » How many affinity credit cards do you have?

IPUMS-International: world’s largest provider of integrated census microdata to trusted users » 1999: Founded by Steven Ruggles and Bob McCaa, –restrict access to trusted users, and apply corresponding confidentiality techniques » 2002: 1 st release of integrated samples for 7 countries; >200 users in first year » Big hit! 69 countries signed; 57 entrusted data to IPUMS, datasets for more than 230 censuses, >150 entire datasets » 2006,

IPUMS-International: world’s largest provider of integrated census microdata to trusted users » 1999: Founded—seeks neither profits or popularity! » 2006, 3 rd release: » data for 20 countries, samples for 63 censuses, » 185 million person records, » >1,000 users » 2009, 8 th release: » data for 50 countries, samples for ~150 censuses » >300 million person records » thousands of users » Note: data extracts are provided only to licensed users.

2. The case of High Precision Samples: The USA Experience

2. High Precision Samples: The Case of the USA » Beginning with the 1980 census, US Census Bureau released 5% samples of households » Not a single allegation of misuse » 1988: first articles using high precision samples published in Demography Language use and fertility in the Mexican origin population Household size and regional outmigration » 1996: IPUMS-USA samples available via internet » Available at no cost to researchers worldwide » 81% of articles in Demography, since 1990, use high precision samples » In 2000 & 2001, high precision census microdata used twice as often as next most common data source » Analyze information for small population subgroups » very large census microdata samples are among the most powerful tools available for economic and demographic analysis

2. High Precision Samples: The Case of the USA » Beginning with the 1980 census, US Census Bureau released 5% samples of households » Not a single allegation of misuse » 1988: first articles using high precision samples published in Demography Language use and fertility in the Mexican origin population Household size and regional outmigration » 1996: IPUMS-USA samples available via internet » Available at no cost to researchers worldwide » 81% of articles in Demography, since 1990, use high precision samples » In 2000 & 2001, high precision census microdata used twice as often as next most common data source » Analyze information for small population subgroups » very large census microdata samples are among the most powerful tools available for economic and demographic analysis

3. High Precision Samples with Implicit Stratification Note: almost all NSIs are supplying household samples drawn to IPUMS specifications (every n th household from 100% fine-grained geographically stratified microdata)—see table 1

IPUMS-International: High precision samples with implicit stratification » Suppress all identifying information: names, id numbers, street addresses, low-level administrative geography (NUTS-5, NUTS-4?, NUTS-3?, NUTS-2?) » Sample is stratified by lowest level geography (census tract) » Lower standard errors than a classic random sample—to the extent that variables of interest are correlated with geography » Implicit geographical stratification is equivalent to extremely fine geographic stratification with proportional weighting » Many of our NSI partners have adopted the IPUMS sample design (see table 1). » 26 countries provided 100% microdata for the MPC to draw the sample » Europe: almost all NSIs have drawn samples to IPUMS specs. for all censuses » High precision samples for 57 countries entrusting microdata (12/12/2006) » 10% samples: 43 countries » 5% 10 countries » <5% 4 countries

IPUMS-International: High precision samples with implicit stratification » Suppress all identifying information: names, id numbers, street addresses, low-level administrative geography (NUTS-5, NUTS-4?, NUTS-3?, NUTS-2?) » Sample is stratified by lowest level geography (census tract) » Lower standard errors than a classic random sample—to the extent that variables of interest are correlated with geography » Implicit geographical stratification is equivalent to extremely fine geographic stratification with proportional weighting » Many of our NSI partners have adopted the IPUMS sample design (see table 1). » 26 countries provided 100% microdata for the MPC to draw the sample » Europe: almost all NSIs have drawn samples to IPUMS specs. for all censuses » High precision samples for 57 countries entrusting microdata (12/12/2006) » 10% samples: 43 countries » 5% 10 countries » <5% 4 countries

4. Access Disclosure Controls a. Memorandum with NSI b. License with researchers

A. NSI with U of Minnesota

A. NSI with U. of Minnesota (2005+)

Legally-binding license agreement » forces would-be snoopers to violate law by which they can be fined and jailed » protects privacy and confidentiality » assures proper use Access limited to: » Bona-fide researchers (credentials) » With a demonstrated scientific need » who agree to abide by license restrictions » Confidentiality » No redistribution » Safely secured » Alleging that a person has been identified is prohibited B. License with researchers Restricted Access web-based system LICENSELICENSELICENSELICENSE IPUMSiIPUMSiIPUMSiIPUMSi

Legally-binding license agreement » forces would-be snoopers to violate law » protects privacy and confidentiality » assures proper use Access limited to: » Bona-fide researchers (credentials) » With a demonstrated scientific need » who agree to abide by license restrictions » Confidentiality » No redistribution, no commercial use » Safely secured » Alleging that a person can be or has been identified is illegal B. License with researchers Restricted Access web-based system LICENSELICENSELICENSELICENSE IPUMSiIPUMSiIPUMSiIPUMSi

“Apply for Access”

End of application

5. Technical Disclosure Controls

CONFIDENTIALIZESCONFIDENTIALIZESCONFIDENTIALIZESCONFIDENTIALIZES IPUMSiIPUMSiIPUMSiIPUMSi » Suppress geographical detail » Blur/aggregate sensitive codes » Convert dates to ages (blur key vars.) » Swap cases between districts » Scramble order of records technical measures are also applied, in addition to the legal & administrative protections

EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International » 1. Restrict access to samples » 2. Limit geographical detail » 3. Re-code unique categories--top and bottom » 4. Sign non-disclosure agreement » 5. Prohibit redistribution to third parties » 6. Prohibit attempts to identify individuals or the making any claim to that effect » 7. Require users to provide copies of publications

EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International 8. Construct age from birthdate, if necessary8. Construct age from birthdate, if necessary 9. Do not identify date of birth9. Do not identify date of birth 10. Do not identify precise place of birth10. Do not identify precise place of birth 11. Migration: timing/place not identified in detail11. Migration: timing/place not identified in detail 12. Identify place of residence by major civil division (pop>20k, 60k, 100k, 1 million—i.e., national convention)12. Identify place of residence by major civil division (pop>20k, 60k, 100k, 1 million—i.e., national convention) 13. Do sensitivity analysis (not yet)13. Do sensitivity analysis (not yet) 14. Do confidentiality assessment (not yet)14. Do confidentiality assessment (not yet)

“There has been no known attempt at identification with the 1991 SARs [microdata samples of the UK]- nor in any other countries that disseminate samples of microdata” --Elliott and Dale, Journal of the Royal Statistical Society, Countering Fear, Hysteria and Paranoia…with reason

ChoicePoint Data Sources and Clients. Source: Washington Post Why Not? Companies want linkable data with names, addresses, ID #s, etc. * * * * * * * * * * * * * * * * * * * Probabilistic linking with 90% of the population missing is not good enough

“…there are no known incidents of researchers using their access to microdata to deliberately identify individuals...” --Managing Statistical Confidentiality and Microdata Access: Principles and Guidelines of Good Practice UNECE, Conference of European Statisticians, Task Force on Census Microdata (October 2006), p

“Statistical disclosure control methods may modify the data or the design of the statistic, or a combination of both. They will be judged sufficient when the guarantee of confidentiality can be maintained, taking account of information likely to be available to third parties, either from other sources or as previously released National Statistics outputs, against the following standard: “It would take a disproportionate amount of time, effort and expertise for an intruder to identify a statistical unit to others, or to reveal information about that unit not already in the public domain.” Protocols on Data Access and Confidentiality, pp. 7-8 (2004) “Statistical disclosure control methods may modify the data or the design of the statistic, or a combination of both. They will be judged sufficient when the guarantee of confidentiality can be maintained, taking account of information likely to be available to third parties, either from other sources or as previously released National Statistics outputs, against the following standard: “It would take a disproportionate amount of time, effort and expertise for an intruder to identify a statistical unit to others, or to reveal information about that unit not already in the public domain.” Protocols on Data Access and Confidentiality, pp ONS-UK(2004)

7. Conclusions and Future Work

1. Uniform legal authorization with national statistical authorities 2. Access restricted to academics with need who agree to abide by stringent confidentiality protections 3. Experienced integration teams 4. Proven web-based distribution system 5. High user satisfaction 6. Sustainable: NSF, NIH, FP-6 (7?) funded (Europe only) IPUMS-International strengths

Significant weakness: statistical disclosure controls …as a result of PSD2006, we will: » Re-consider our portfolio of statistical disclosure controls » Implement a uniform set of controls across all samples and countries » Do sensitivity analysis » Do confidentiality assessment » Revise our documentation on the confidentializing of datasets for each country, describing principles, but not the “keys” » Cite bibliography for users to confidentialize tables and graphs

IPUMS-International, August 2009??? dark green = disseminating (50 countries, 150 censuses, 300mpr) green = harmonizing (?? countries, ?? censuses, ???mpr) lightest green = negotiating

Thank you! additional information at: * * * * * * Contact: