Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center “ Inadequate.

Similar presentations


Presentation on theme: "Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center “ Inadequate."— Presentation transcript:

1 Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate use of microdata has high costs ” --Len Cook (2003, registrar general, ONS)

2 UNSD Principles and Recommendations (Rev. 1, 1997) endorse dissemination of census microdata » §1.218: “There are a range of methods…that can be used to make such microdata available while still protecting individuals’ rights to privacy.” (Rev. 2 has a stronger statement.) » In four decades of distributing microdata there is not a single allegation of a breach of confidentiality or privacy (includes 100% microdata stored at CELADE in Santiago, Chile).

3 Why disseminate microdata? Julia Lane, European Statisticians Conference (2003) » 1. Analyze more realistic questions » 2. Develop reality-based policy » 3. Acquire new constituencies and stakeholders » 4. Build trust; reduce suspicions of data cooking » 5. Replicate findings » a. use standards of UNSD, Eurostat, ISCO, ISCED, etc. » b. facilitate comparative research in time and space » 6. Calculate marginal effects » 7. Assess data quality » …and much, much more….

4 Confidentializing an integrated microdata base with: » 200+ samples of households (70+ countries) » Containing ½ billion person records with thousands of variables » Available to tens of thousands of licensed users regardless of country of birth, citizenship, residence or place of work » Without a single allegation of violation of privacy or statistical confidentiality-- What ’ s the problem?

5 5 Usage: Off-site vs. on-site use (secure microdata laboratory)? Germany RDC, 2005-8: ten-to-one Jan-Sept RDCs are expensive and attract few users.

6 “Statistical disclosure control methods may modify the data or the design of the statistic, or a combination of both. They will be judged sufficient when the guarantee of confidentiality can be maintained, taking account of information likely to be available to third parties, either from other sources or as previously released National Statistics outputs, against the following standard: “It would take a disproportionate amount of time, effort and expertise for an intruder to identify a statistical unit to others, or to reveal information about that unit not already in the public domain.” Protocols on Data Access and Confidentiality, pp. 7-8 (2004) www.statistics.gov.uk/about_ns/cop/downloads/prot_data_access_confidentiality.pdf “Statistical disclosure control methods may modify the data or the design of the statistic, or a combination of both. They will be judged sufficient when the guarantee of confidentiality can be maintained, taking account of information likely to be available to third parties, either from other sources or as previously released National Statistics outputs, against the following standard: “It would take a disproportionate amount of time, effort and expertise for an intruder to identify a statistical unit to others, or to reveal information about that unit not already in the public domain.” Protocols on Data Access and Confidentiality, pp. 7-8 --ONS-UK(2004) www.statistics.gov.uk/about_ns/cop/downloads/prot_data_access_confidentiality.pdf www.statistics.gov.uk/about_ns/cop/downloads/prot_data_access_confidentiality.pdf

7 Risk assessment of household samples of UK 1991 census: attempts at matching are “fruitless” few matches; many false positives » After taking into account errors in the data, coding variability and changing of personal characteristics in time » Dale and Elliott, JRSS-A (2003): “For a user of an outside database, attempting this sort of match with no opportunity for verification would prove fruitless. In the first place, the small degree of expected overlap would be a considerable deterrent to an intruder. However, if a match between the two files was attempted the large number of apparent matches would be highly confusing as an intruder would have no way of checking correct identification.”

8 8 complete microdata confidential microdata de-facto anonymised microdata delete direct identifier anonymisation method Degree of confidentiality Degree of analysis potential stronger anonymisation method fully anonymised microdata Level of Anonymization (FSO-Germany) Trade-off between confidentiality and analysis potential: is it monotonic (as portrayed)?

9 9 complete microdata confidential microdata de-facto anonymised microdata delete direct identifier anonymisation method Degree of confidentiality Degree of analysis potential stronger anonymisation method fully anonymised microdata Level of Anonymization— not monotonic 95% & Construct sample 50%25%45% 99%99.9% Trade-off is not monotonic

10 Resources » UN-ECE (2007), Managing Statistical Confidentiality & Microdata Access http://www.unece.org/stats/documents/tfcm.htm http://www.unece.org/stats/documents/tfcm.htm » IHSN Tools & Guidelines, anonymization: www.surveynetwork.org www.surveynetwork.org » Eurostat (1999)

11 UN-ECE (2007) www.unece.org/stats/documents/tfcm.htm www.unece.org/stats/documents/tfcm.htm

12 IHSN www. Survey network.org www. Survey network.orgwww. Survey network.org

13 IHSN www. Survey network.org www. Survey network.orgwww. Survey network.org

14 IHSN www. Survey network.org www. Survey network.orgwww. Survey network.org 1.Remove variables Identifiers: name, address, low-level administrative geographyIdentifiers: name, address, low-level administrative geography Sensitive: tribe, disabilitySensitive: tribe, disability 2.Global recoding Aggregate classes: age (5 yr groups), country of birth (continent), administrative geography, occupation (4 digit  3), etc.Aggregate classes: age (5 yr groups), country of birth (continent), administrative geography, occupation (4 digit  3), etc. Top and bottom coding (continuous variables-- income, size of residence, number of rooms, etc.)Top and bottom coding (continuous variables-- income, size of residence, number of rooms, etc.) 3.Local suppression--sparse categories (population n < 250…2,500) 4.Data swapping (household geography) 5.Complex perturbations

15 EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International » 1. Restrict access to samples » 2. Limit geographical detail » 3. Re-code unique categories--top and bottom » 4. Sign non-disclosure agreement » 5. Prohibit redistribution to third parties » 6. Prohibit attempts to identify individuals or the making any claim to that effect » 7. Require users to provide copies of publications

16 EUROSTAT statistical confidentiality standards (Thorogood, 1999) --all endorsed by IPUMS-International 8. Construct age from birthdate, if necessary8. Construct age from birthdate, if necessary 9. Do not identify date of birth9. Do not identify date of birth 10. Do not identify precise place of birth10. Do not identify precise place of birth 11. Migration: timing/place not identified in detail11. Migration: timing/place not identified in detail 12. Identify place of residence by major civil division (pop>20k, 60k, 100k, 1 million—i.e., national convention)12. Identify place of residence by major civil division (pop>20k, 60k, 100k, 1 million—i.e., national convention) 13. Do sensitivity analysis13. Do sensitivity analysis 14. Do confidentiality assessment (not yet)14. Do confidentiality assessment (not yet)

17 “There has been no known attempt at identification with the 1991 SARs [microdata samples of the UK]- nor in any other countries that disseminate samples of microdata” --Elliott and Dale, Journal of the Royal Statistical Society, 1999 Countering Fear, Hysteria and Paranoia…with reason

18 ChoicePoint Data Sources and Clients. Source: Washington Post http://www.choicepoint.com/ Why Not? Companies want linkable data with names, addresses, ID #s, etc. * * * * * * * * * * * * * * * * * * * Probabilistic linking with 90% of the population missing is not good enough

19 To play ”pizza” video: http://www.aclu.org/pizza/ http://www.aclu.org/pizza/

20

21 “There has been no known attempt at identification with the 1991 SARs [microdata samples of the UK]- nor in any other countries that disseminate samples of microdata” --Elliott and Dale, Journal of the Royal Statistical Society, 1999 Countering Fear, Hysteria and Paranoia…with reason

22 Please allow me to invite you to think about producing (or permitting IPUMS to produce) anonymized, integrated samples for all the censuses of your country for which microdata survive… Thank you * * * * * * Contact: rmccaa@umn.edu this ppt is available at: www.hist.umn.edu/~rmccaa/ipums-global See “Port of Spain workshop” rmccaa@umn.edu www.hist.umn.edu/~rmccaa/ipums-globalrmccaa@umn.edu www.hist.umn.edu/~rmccaa/ipums-global


Download ppt "Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center “ Inadequate."

Similar presentations


Ads by Google