Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003.

Slides:



Advertisements
Similar presentations
Public Use Microdata File (PUMF) 1. Change factors 2. Scenarios : characteristics 3. Analytic Content: additions and losses Outline DLI Ontario.
Advertisements

DLI Orientation: Concepts A Framework for Thinking about Statistical Information Train the Trainers Montreal, March 9, 2004 Chuck Humphrey Data Library.
Dealing with confidential research information anonymisation techniques and other measures to enable using and sharing research data Data Management and.
Dealing with confidential research information - Anonymisation techniques and access regulations to enable using and sharing research data Data Management.
Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Data linking – Project update 15 th May 2012 – Homecare & SDS event Atlantic Quay Ellen Lynch & Euan Patterson.
HIPAA. What Why Who How When What Is HIPAA? Health Insurance Portability & Accountability Act of 1996.
NCES Data Confidentiality and Data Licensing Program Marilyn Seastrom July, 2013 Washington, DC.
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
Research Ethics Levels of Measurement. Ethical Issues Include: Anonymity – researcher does not know who participated or is not able to match the response.
1 Chapter 20 Two Categorical Variables: The Chi-Square Test.
Aspects of the National Health Interview Survey (NHIS) Chris Moriarity National Conference on Health Statistics August 16, 2010
Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, September 2011 Overview of Archiving of Microdata Session 4 United Nations.
Transportation leadership you can trust. presented to TRB Census Data for Transportation Planning Meeting presented by Kevin Tierney Cambridge Systematics,
American Community Survey Presented at the Meeting of the National Neighborhood Indicators Partnership Susan Schechter May
Screening Data for Disclosure Risk and the Research behind One Possible Tool Kristine Witkowski Research support from the National Institute of Child Health.
Overview of 2002 CIPSEA: Methods to Protect Confidential Tabular Data Amrut Champaneri, Ph.D. U.S. Department of Transportation Bureau of Transportation.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Epidemiology The Basics Only… Adapted with permission from a class presentation developed by Dr. Charles Lynch – University of Iowa, Iowa City.
Evidence-Based Medicine 3 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
Intruder Testing: Demonstrating practical evidence of disclosure protection in 2011 UK Census Keith Spicer, Caroline Tudor and George Cornish 1 Joint UNECE/Eurostat.
Confidentiality and Security Issues in ART & MTCT Clinical Monitoring Systems Meade Morgan and Xen Santas Informatics Team Surveillance and Infrastructure.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Coding Compliance Plan July 12, Benefits of a compliance program  To demonstrate our commitment to honest and responsible conduct, decrease the.
Introduction to the Public Use Microdata Sample (PUMS) File from the American Community Survey Updated February 2013.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
Copyright restrictions may apply Household, Family, and Child Risk Factors After an Investigation for Suspected Child Maltreatment: A Missed Opportunity.
Disclosure Control in Practice: issues and approaches Andy Sutherland Health and Social Care Information Centre.
Health Datasets in Spatial Analyses: The General Overview Lukáš MAREK Department of Geoinformatics, Faculty.
1 The 2001 Census PUMFS Odyssey Sponsored by HAL and PALS Presented by Chuck Humphrey.
The Census of Canada and Immigration & Ethno-cultural Data Chuck Humphrey University of Alberta February 10, 2006.
Data Liberation Training 2001 Complex Files: Pasting and Cutting with SPSS Université de Montréal Wendy Watkins April 24, 2001.
Framework of Statistical Information. This is a typology of the categories or classes of statistical information. Remember the relationship between statistics.
The right item, right place, right time. DLA Privacy Act Code of Fair Information Principles.
RESEARCH ETHICS AND DATA CONFIDENTALITY: ANONYMISATION AND ACCESS CONTROL ……………………………………………………………………………………………………………………………….…………………………….. ……………………………………………………………......…...
Achieving Anonymity in Micro Data Files 10th Symposium on Identity and Trust on the Internet April 6-7, 2011 Privacy: An Emerging Landscape Alvan O. Zarate,
Creating Something from Nothing: Synthetic and Dummy files Bo Wandschneider University of Guelph Chuck Humphrey University of Alberta DLI Training: Ottawa,
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
The availability of Dutch census microdata Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social.
IM NETWORK MEETING 20 TH JULY, 2010 CONSULTATION WITH 3 RD PARTIES.
Disclosure Control in the UK Census Keith Spicer 11 January 2005.
1 Dissemination Michael J. Levin Harvard Center for Population and Development Studies
Creating Something from Nothing: Working with Synthetic Files ACCOLEDS /DLI Training: December 2003 Chuck Humphrey University of Alberta.
ANONYMISATION Research Data Management. c Research Data Management Sensitive Data Sensitive Data is information covering: The racial or ethnic origin.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Disclosure Analysis: What do RDC Analysts do? Research Data Centre Program, Statistics Canada James Chowhan Ontario DLI Training, Queen's University
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
Eve Powell-Griner National Center for Health Statistics Centers for Disease Control and Prevention National Center for Health Statistics Microdata Release.
Lisa Neidert Population Studies Center May 26-28, 2010 Ann Arbor, MI Third Working Group on Data Access.
Exploring Microsimulation Methodologies for the Estimation of Household Attributes Dimitris Ballas, Graham Clarke, and Ian Turton School of Geography University.
STATISTICS STATISTICS Numerical data. How Do We Make Sense of the Data? descriptively Researchers use statistics for two major purposes: (1) descriptively.
HETUS Pilot Group 8 Privacy procedures and ethical issues Kimberly Fisher, Centre for Time Use Research – co-ordinator External consultant Kai Ludwigs.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
The London Health Observatory: monitoring health and health care in the capital, supporting practitioners and informing decision-makers Disclosure control.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Disclosure scenario and risk assessment: Structure of Earnings Survey
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Creating Something from Nothing: Working with Synthetic Files
Move this to online module slides 11-56
Disclosure Avoidance: An Overview
High-level Working Group on Statistical Confidentiality
Item 2.2 Scientific Use Files for the Time Use Survey
Creating Something from Nothing: Working with Synthetic Files
Presentation transcript:

Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003

Note: The following slides were prepared in conjunction with the ACCOLEDS/DLI Training presentations at the University of Calgary (Alberta) on December 8, 2003, and are not intended for use as documentation of disclosure risk control and practices. For more information about the slides, please contact the author at

Presentation Outline Overview of data confidentiality Different types of disclosure and output Some examples Facing the challenge

Why is keeping data confidentiality so important? Retain and Respect Public Trust –Most household/population surveys do not have mandatory participation –Respondents volunteer their time and information –Respondents trust Statistics Canada to ensure their privacy and the confidentiality of their information –To ensure future data collection –Statistics Act - judiciously guarding respondents’ confidential information

Types of data Aggregated data vs. Microdata –Dictate the data release method Enterprise data vs. Household data –Mandatory vs. voluntary participation Admin Data and Census vs. Sample Survey –Different degree of risk of disclosure

Confidentiality and Disclosure Under the Statistics Act, Statistics Canada must protect the confidentiality of respondents’ data and identity. Disclosure relates to the inappropriate attribution of information to a data subject, whether the subject is an individual or an organization.

So what’s the problem? Direct Identifiers (name, address, health number, etc.) that uniquely identify a respondent. These are all stripped from released data files. Indirect Identifiers refer to variables such as age, marital status, occupation, ethnicity, postal code, type of business etc.). When combined they could be used to identify a respondent. Sensitive variables refer to information or characteristics relating to a respondent’s private life or business which are usually unknown to others (income, illness, behaviour etc.).

The concern is… Combining indirect identifiers with sensitive variables poses a disclosure risk, but… It is usually what researchers like to do –to relate specific characteristics of some response groups to some specific activities/characteristics –and how/why they are related Control method: restricted access, data reduction, disclosure analysis …

Controls on microdata release Restricted Access –License and data sharing agreement –Strictly control record linkage (direct identifier) –Survey data access restricted within the organization Employee access granted on a “need to know” basis only –Analytical (confidential) database with direct identifiers removed Direct access – authorized employee/deemed employee only Indirect data access (Remote Access services/Remote Data Access services) - screening Data Reduction – e.g. PUMF

Public Use Microdata File (PUMF) Files of anonymous individual records Created for research purposes Follows Statistics Canada’s Policy on Microdata Release Expect some forms of data reduction and suppression Expect suppression of sample design information (cluster, stratification, etc.)

PUMF disclosure risk control Suppress some indirect identifiers (e.g. small geographical code, race details, etc.) Avoid unique combination of indirect identifiers that can disclose a response unit (such as gender, age, occupation, chronic conditions, religion, etc.) Perform Univariate analyses and look for outliers Sometimes maximum/minimum values are capped And more…

Protection of confidential data Physical protection of the data storage area Protection of the computer systems Enforcement of data releasers’ and users’ responsibilities to protect respondent confidentiality Disclosure analysis on output that leaves the restricted data storage area

Identity Disclosure Identity Disclosure - When a respondent can be identified from the released data. –Combine identifier with sensitive variables Examples: Spontaneous recognition of well-known characteristic by others (e.g. from small sample) Self-disclosure (e.g., respondent self-identifies when complaining to the media on privacy violation)

Attribute Disclosure Attribute Disclosure - When confidential information is revealed and can be attributed to an individual or a group. –Such as, all persons with characteristic x have characteristic y Examples: People in occupation W make $ 50-60,000/year… 100% of the respondents of age W in area X reported that they experimented with …

Residual Disclosure Residual disclosure - when confidential information is disclosed by combining previously released output and information. Extra care is needed where risk of residual disclosure is high, such as –Subsequent cycles of longitudinal data files (e.g. NLSCY, NPHS, etc.) –Sample from dependent surveys (e.g. SLID and LFS) –Research projects using the same data file –Overlapping small geographical area (e.g. Health Region and Economic Region)

Types of outputs Analytic studies (e.g. inferential statistics/model output) –Model parameters such as, regression coefficients, etc. –Hypothesis test results such as, p-value, t-statistics, etc. Descriptive studies (e.g. table output) –Frequencies, percentiles, cross-tabulation, standard errors, correlation matrix, etc.

To lower disclosure risk General rules we follow for household sample surveys: Do not report statistics or table cells with small number of respondents (e.g. fewer than 5 respondents) No anecdotal information may be given about specific respondents ‘Zero’ and ‘Full’ cell restriction Min. and Max. value restriction Saturated models, covariance/correlation matrices treated like underlying tables And more…..

Some examples…

Low frequency cells F, 0 is a low frequency cell. Solution? Collapse column ‘M’ and ‘F’ = column ‘total’ Collapse row ‘1’ and ‘0’ = row ‘total’ Report either column ‘M’ and row ‘1’ but not along with the ‘total’ MFtotal total MFtotal X17 total491665

Frequency distributions Frequency curve, e.g.: user wishes to release the the value of observation at the 99 th percentile * child 1: family 1 child 2: family 1 child 3: family 2 child 4: family 2 child 5: family 3…. If < 5 respondents are above the 99th percentile, there is a problem. One solution is to describe the distribution using the 95th percentile. * If the survey is multilevel (NLSCY), then the 5 or more respondents from level 1 (child) must come from at least 3 different units from level 2 (household).

‘Zero’ and ‘Full’ cell (F, 1) is a full cell (F, 0) is a non-structural zero cell –Both could pose confidentiality problem (Married, age <12) is a structural zero cell –Not a data confidentiality problem –Not expect anyone to be in this category MFtotal agemarriedsingletotal < >

Implied tables - residual disclosure Implied tables are tables produced by subtracting results from one or more published tables from another published table In this example, ‘non- married’ individuals can easily be calculated Select if Married = 1 YesNo Select all cases YesNo

When reporting information… Writing a report is no different than working with table output, avoid statements such as: “… responded incomes ranging from $2,498 to $579,789.” –If necessary, give general indications (e.g. “no income was above $600,000”.) “… all respondents of age 16 reported experimenting with drugs.” –This is equivalent to a full cell situation.

Related Outputs If PUMF as well as analytical outputs using confidential data are released for the same survey, the published results should not disclose sensitive information about individual respondents that was suppressed in the PUMF. That is, from the reported results, it should not be possible to infer information that allows the identification of a PUMF respondent.

Facing Challenges No single control of all the releases –Remote Access, PUMFs, RDCs, survey data publications, etc. Potential residual disclosure Can residual disclosure be totally accounted for? Can it be better controlled?

What RDCs are doing now… Educate data users to –Take precautions when dealing with confidential information –Recognize disclosure risk –Make use of alternative reporting and complementary suppression –Limit intermediary outputs

What else should we do? Match against other types of file releases to assess overall disclosure risk? Future data reduction in PUMFs and publications? Follow the American RDC approach? Different disclosure analysis approach for different data files? Stricter screening process? ……