IPUMS-International: High precision Population Census Samples: Balancing the Privacy-Quality Tradeoff by Means of Restricted Access Microdata Extracts * * * Robert McCaa, Steven Ruggles, Michael Davern, Tami Swenson, and Krishna Mohan Palipudi Minnesota Population Center = information not in proceedings or on CD
Outline of paper (in proceedings, except “0.”) 1. Introduction: The Trusted User Approach 2. The Case for High Precision Samples: The USA Experience 3. High Precision Samples with Implicit Stratification 4. Access Disclosure Controls 5. Technical Disclosure Controls 6. Fear, Hysteria and Paranoia 7. Conclusions and Future Work 0. What’s a historian doing at PSD2006?
Why am I (a historian) here? 1. To learn from you to enhance IPUMS-International privacy and confidentiality techniques 2. To inform you of our existence and the challenges we face 3. To invite your contributions, as producers, users, and creators of statistical confidentiality methods 4. To advertise opportunities for post-docs, staff 5. To invite statistical agencies to entrust census microdata to the project
Confidentializing IPUMS-International, an integrated microdatabase with: » 150 census samples of households (50 countries) » Containing 300 million person records with hundreds of variables » Available to tens of thousands of licensed users regardless of country of birth, citizenship, residence or place of work » Not a single allegation of violation of privacy or statistical confidentiality-- What ’ s the problem?
IPUMS-International: a restricted-access, web-based census microdata extraction system » Password protected: to make and retrieve extracts » Licensed researcher selects: » Countries, » Censuses, » Cases/sub-populations, » Variables, and » Sample densities » Extract engine queues request, generates extract » Researcher retrieves extract via web with SSL 128-bit encryption and analyzes using own wares (soft/hard/wet) » NO: CDs, original codes, or complete datasets
6 steps using 1. Logon w/ password 2a. Study documentation 2b. Design extract 3. Receive ; logon with p/word 4. Download extract (SSL encrypted) 5. UnZip data (also SAS, STATA) 6. Analyze
IPUMS-International, December 2006 dark green = disseminating (20 countries, 63 censuses, 185mpr) green = harmonizing (37 countries, 100 censuses, 200mpr) lightest green = negotiating
What has happened since Geneva (xi/05)? 1. NSF-USA renewed funding for 5 years 2. Database grew: 12 countries, 35 censuses, 65mpr 3. More agreements signed, census data acquired 4. New, dynamic metadata system implemented 5. Number of users doubled 6. Publications are taking off 7. Paris Workshop (INED/CEPED): delegates from 14 European countries and 10 non-European, plus academic researchers
IPUMS-Europe December 2006 Dark green = Disseminating (5 countries, 15 censuses, 27mpr) In Lisbon: Portugal and Hungary will become “dark green” with the launch of samples for 4 censuses ea. for Argentina and Hungary, 3 for Portgual and Israel, 2 for Egypt and Rwanda, and 1 for Gaza and the West Bank
What will happen by Lisbon (ISI, viii/07)? 1. Confidentiality methods will be enhanced 2. Database will grow: 7 countries, 19 censuses, 25mpr 3. Dynamic metadata system will be expanded 4. Number of users will increase!!! 5. Publications!!! 6. IPUMS Workshop (Sat Aug 25 at INE-Pt) for producers and users (registration required; please 7. Microdata Session (Fri Aug 24) * Special conditions apply
1. Introduction: The “trusted-user” approach to disseminating integrated, anonymized census microdata sample
MBNA: world’s largest independent credit card issuer specialist in affinity marketing » 1982: MBNA founded by Charles Cawley –instead of competing on price, compete on affinity » 1983: Georgetown Univ Alumni Association (Cawley’s alma mater) supplied MBNA with names and addresses of its members in exchange for percentage of revenues on card usage » Big hit! Large number of new accounts, low risk, high spenders » 1985: new groups: American Dental Association, Aircraft Owners and Pilots Association, National Education Assoc., » 1994: Sierra Club, 45,000 members signed with MBNA generating $400,000 annually for Sierra Club » The rest is history! » 2005:
MBNA: world’s largest independent credit card issuer specialist in affinity marketing » 1982: MBNA founded by Charles Cawley –instead of competing on price, compete on affinity » 2005: MBNA, with 25,000 employees, acquired by Bank of America, US$35 billion » How many credit cards do you have? » How many affinity credit cards do you have?
IPUMS-International: world’s largest provider of integrated census microdata to trusted users » 1999: Founded by Steven Ruggles and Bob McCaa, –restrict access to trusted users, and apply corresponding confidentiality techniques » 2002: 1 st release of integrated samples for 7 countries; >200 users in first year » Big hit! 69 countries signed; 57 entrusted data to IPUMS, datasets for more than 230 censuses, >150 entire datasets » 2006,
IPUMS-International: world’s largest provider of integrated census microdata to trusted users » 1999: Founded—seeks neither profits or popularity! » 2006, 3 rd release: » data for 20 countries, samples for 63 censuses, » 185 million person records, » >1,000 users » 2009, 8 th release: » data for 50 countries, samples for ~150 censuses » >300 million person records » thousands of users » Note: data extracts are provided only to licensed users.
2. The case of High Precision Samples: The USA Experience
2. High Precision Samples: The Case of the USA » Beginning with the 1980 census, US Census Bureau released 5% samples of households » Not a single allegation of misuse » 1988: first articles using high precision samples published in Demography Language use and fertility in the Mexican origin population Household size and regional outmigration » 1996: IPUMS-USA samples available via internet » Available at no cost to researchers worldwide » 81% of articles in Demography, since 1990, use high precision samples » In 2000 & 2001, high precision census microdata used twice as often as next most common data source » Analyze information for small population subgroups » very large census microdata samples are among the most powerful tools available for economic and demographic analysis
2. High Precision Samples: The Case of the USA » Beginning with the 1980 census, US Census Bureau released 5% samples of households » Not a single allegation of misuse » 1988: first articles using high precision samples published in Demography Language use and fertility in the Mexican origin population Household size and regional outmigration » 1996: IPUMS-USA samples available via internet » Available at no cost to researchers worldwide » 81% of articles in Demography, since 1990, use high precision samples » In 2000 & 2001, high precision census microdata used twice as often as next most common data source » Analyze information for small population subgroups » very large census microdata samples are among the most powerful tools available for economic and demographic analysis
3. High Precision Samples with Implicit Stratification Note: almost all NSIs are supplying household samples drawn to IPUMS specifications (every n th household from 100% fine-grained geographically stratified microdata)—see table 1
IPUMS-International: High precision samples with implicit stratification » Suppress all identifying information: names, id numbers, street addresses, low-level administrative geography (NUTS-5, NUTS-4?, NUTS-3?, NUTS-2?) » Sample is stratified by lowest level geography (census tract) » Lower standard errors than a classic random sample—to the extent that variables of interest are correlated with geography » Implicit geographical stratification is equivalent to extremely fine geographic stratification with proportional weighting » Many of our NSI partners have adopted the IPUMS sample design (see table 1). » 26 countries provided 100% microdata for the MPC to draw the sample » Europe: almost all NSIs have drawn samples to IPUMS specs. for all censuses » High precision samples for 57 countries entrusting microdata (12/12/2006) » 10% samples: 43 countries » 5% 10 countries » <5% 4 countries
IPUMS-International: High precision samples with implicit stratification » Suppress all identifying information: names, id numbers, street addresses, low-level administrative geography (NUTS-5, NUTS-4?, NUTS-3?, NUTS-2?) » Sample is stratified by lowest level geography (census tract) » Lower standard errors than a classic random sample—to the extent that variables of interest are correlated with geography » Implicit geographical stratification is equivalent to extremely fine geographic stratification with proportional weighting » Many of our NSI partners have adopted the IPUMS sample design (see table 1). » 26 countries provided 100% microdata for the MPC to draw the sample » Europe: almost all NSIs have drawn samples to IPUMS specs. for all censuses » High precision samples for 57 countries entrusting microdata (12/12/2006) » 10% samples: 43 countries » 5% 10 countries » <5% 4 countries
4. Access Disclosure Controls a. Memorandum with NSI b. License with researchers
A. NSI with U of Minnesota
A. NSI with U. of Minnesota (2005+)
Legally-binding license agreement » forces would-be snoopers to violate law by which they can be fined and jailed » protects privacy and confidentiality » assures proper use Access limited to: » Bona-fide researchers (credentials) » With a demonstrated scientific need » who agree to abide by license restrictions » Confidentiality » No redistribution » Safely secured » Alleging that a person has been identified is prohibited B. License with researchers Restricted Access web-based system LICENSELICENSELICENSELICENSE IPUMSiIPUMSiIPUMSiIPUMSi
Legally-binding license agreement » forces would-be snoopers to violate law » protects privacy and confidentiality » assures proper use Access limited to: » Bona-fide researchers (credentials) » With a demonstrated scientific need » who agree to abide by license restrictions » Confidentiality » No redistribution, no commercial use » Safely secured » Alleging that a person can be or has been identified is illegal B. License with researchers Restricted Access web-based system LICENSELICENSELICENSELICENSE IPUMSiIPUMSiIPUMSiIPUMSi
“Apply for Access”
End of application
5. Technical Disclosure Controls
CONFIDENTIALIZESCONFIDENTIALIZESCONFIDENTIALIZESCONFIDENTIALIZES IPUMSiIPUMSiIPUMSiIPUMSi » Suppress geographical detail » Blur/aggregate sensitive codes » Convert dates to ages (blur key vars.) » Swap cases between districts » Scramble order of records technical measures are also applied, in addition to the legal & administrative protections
EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International » 1. Restrict access to samples » 2. Limit geographical detail » 3. Re-code unique categories--top and bottom » 4. Sign non-disclosure agreement » 5. Prohibit redistribution to third parties » 6. Prohibit attempts to identify individuals or the making any claim to that effect » 7. Require users to provide copies of publications
EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International 8. Construct age from birthdate, if necessary8. Construct age from birthdate, if necessary 9. Do not identify date of birth9. Do not identify date of birth 10. Do not identify precise place of birth10. Do not identify precise place of birth 11. Migration: timing/place not identified in detail11. Migration: timing/place not identified in detail 12. Identify place of residence by major civil division (pop>20k, 60k, 100k, 1 million—i.e., national convention)12. Identify place of residence by major civil division (pop>20k, 60k, 100k, 1 million—i.e., national convention) 13. Do sensitivity analysis (not yet)13. Do sensitivity analysis (not yet) 14. Do confidentiality assessment (not yet)14. Do confidentiality assessment (not yet)
“There has been no known attempt at identification with the 1991 SARs [microdata samples of the UK]- nor in any other countries that disseminate samples of microdata” --Elliott and Dale, Journal of the Royal Statistical Society, Countering Fear, Hysteria and Paranoia…with reason
ChoicePoint Data Sources and Clients. Source: Washington Post Why Not? Companies want linkable data with names, addresses, ID #s, etc. * * * * * * * * * * * * * * * * * * * Probabilistic linking with 90% of the population missing is not good enough
“…there are no known incidents of researchers using their access to microdata to deliberately identify individuals...” --Managing Statistical Confidentiality and Microdata Access: Principles and Guidelines of Good Practice UNECE, Conference of European Statisticians, Task Force on Census Microdata (October 2006), p
“Statistical disclosure control methods may modify the data or the design of the statistic, or a combination of both. They will be judged sufficient when the guarantee of confidentiality can be maintained, taking account of information likely to be available to third parties, either from other sources or as previously released National Statistics outputs, against the following standard: “It would take a disproportionate amount of time, effort and expertise for an intruder to identify a statistical unit to others, or to reveal information about that unit not already in the public domain.” Protocols on Data Access and Confidentiality, pp. 7-8 (2004) “Statistical disclosure control methods may modify the data or the design of the statistic, or a combination of both. They will be judged sufficient when the guarantee of confidentiality can be maintained, taking account of information likely to be available to third parties, either from other sources or as previously released National Statistics outputs, against the following standard: “It would take a disproportionate amount of time, effort and expertise for an intruder to identify a statistical unit to others, or to reveal information about that unit not already in the public domain.” Protocols on Data Access and Confidentiality, pp ONS-UK(2004)
7. Conclusions and Future Work
1. Uniform legal authorization with national statistical authorities 2. Access restricted to academics with need who agree to abide by stringent confidentiality protections 3. Experienced integration teams 4. Proven web-based distribution system 5. High user satisfaction 6. Sustainable: NSF, NIH, FP-6 (7?) funded (Europe only) IPUMS-International strengths
Significant weakness: statistical disclosure controls …as a result of PSD2006, we will: » Re-consider our portfolio of statistical disclosure controls » Implement a uniform set of controls across all samples and countries » Do sensitivity analysis » Do confidentiality assessment » Revise our documentation on the confidentializing of datasets for each country, describing principles, but not the “keys” » Cite bibliography for users to confidentialize tables and graphs
IPUMS-International, August 2009??? dark green = disseminating (50 countries, 150 censuses, 300mpr) green = harmonizing (?? countries, ?? censuses, ???mpr) lightest green = negotiating
Thank you! additional information at: * * * * * * Contact: