Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada

Slides:



Advertisements
Similar presentations
Public Use Microdata File (PUMF) 1. Change factors 2. Scenarios : characteristics 3. Analytic Content: additions and losses Outline DLI Ontario.
Advertisements

DLI Orientation: Concepts A Framework for Thinking about Statistical Information Train the Trainers Montreal, March 9, 2004 Chuck Humphrey Data Library.
The methodology used for the 2001 SARs Special Uniques Analysis Mark Elliot Anna Manning Confidentiality And Privacy Group ( University.
Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
9. Weighting and Weighted Standard Errors. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training.
Departments of Medicine and Biostatistics
Chuck Humphrey Data Library University of Alberta.
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Meeting the Challenge The National Population Health Survey and Data Access E. Hamilton UNB Libraries IASSIST 2003.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
The American Community Survey (ACS) Lisa Neidert NPC Workshop: Analyzing Poverty and Socioeconomic Trends Using the American Community Survey July 12 –
1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
The American Community Survey (ACS) Lisa Neidert McCormick Specialized Training Institute October , 2009.
The American Community Survey (ACS) Lisa Neidert NPC Workshop: Analyzing Poverty and Socioeconomic Trends Using the American Community Survey June 22 –
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
Quantitative Evidence for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library March 6, 2009.
Statistics and Data for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 27, 2008.
EAS 293 Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 14, 2008.
STATISTICS CANADA SURVEY LIFECYCLE WOLFVILLE, APRIL 2008 SURVEY LIFECYCLE Michel B. Séguin Atlantic DLI Training.
2014 SDC and CIC Annual Training Conference: Accessing ACS PUMS Data Tim Gilbert U.S. Census Bureau April 2, 2014.
Basque Statistics Office Confidentiality Project: Final stages Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality Tarragona, Spain,
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
Describing distributions with numbers
Determining Sample Size
American Community Survey Presented at the Meeting of the National Neighborhood Indicators Partnership Susan Schechter May
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Screening Data for Disclosure Risk and the Research behind One Possible Tool Kristine Witkowski Research support from the National Institute of Child Health.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics, A First Course 4 th Edition.
Chapter 3 – Descriptive Statistics
Statistics Canada’s Real Time Remote Access Solution 2011 MSIS Meeting – Karen Doherty May 2011.
Chapter 3 Descriptive Measures
G-Confid: Turning the tables on disclosure risk Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality Ottawa, Canada 30 October 2013 Peter.
Chapter 15 Data Analysis: Testing for Significant Differences.
Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
The 2006 National Health Interview Survey (NHIS) Paradata File: Overview And Applications Beth L. Taylor 2008 NCHS Data User’s Conference August 13 th,
Using the ACS: Issues with studying small areas and change over time Presented to Association of Public Data Users January 20, 2011.
The Census of Canada and Immigration & Ethno-cultural Data Chuck Humphrey University of Alberta February 10, 2006.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Framework of Statistical Information. This is a typology of the categories or classes of statistical information. Remember the relationship between statistics.
Describing distributions with numbers
Some ACS Data Issues and Statistical Significance (MOEs) Table Release Rules Statistical Filtering & Collapsing Disclosure Review Board Statistical Significance.
Creating Something from Nothing: Synthetic and Dummy files Bo Wandschneider University of Guelph Chuck Humphrey University of Alberta DLI Training: Ottawa,
American Community Survey “It Don’t Come Easy”, Ringo Starr Jane Traynham Maryland State Data Center March 15, 2011.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Michelle Simard Statistics Canada UNECE Worksessions on Statistical Disclosure Control Methods Helsinki, October 2015 Development of rules from administrative.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Lynn Lethbridge SHRUG November, What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly.
Creating Something from Nothing: Working with Synthetic Files ACCOLEDS /DLI Training: December 2003 Chuck Humphrey University of Alberta.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Disclosure Analysis: What do RDC Analysts do? Research Data Centre Program, Statistics Canada James Chowhan Ontario DLI Training, Queen's University
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
Statistics Canada Citizenship and Immigration Canada Methodological issues.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
Stretching Your Data Management Skills Chuck Humphrey University of Alberta Atlantic DLI Workshop 2003.
Michelle Simard Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality Tarragona, Spain, November 23 rd, 2011 Progress on Real Time Remote.
Soc 332.6: Principles of research design Finding statistics.
IAOS Shanghai – Reshaping Official Statistics Some Initiatives on Combining Data to Support Small Area Statistics and Analytical Requirements at.
Exploratory data analysis, descriptive measures and sampling or, “How to explore numbers in tables and charts”
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Disclosure scenario and risk assessment: Structure of Earnings Survey
Creating Something from Nothing: Working with Synthetic Files
Research Data Centre DLI Workshop (December, 2001)
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
Disclosure Avoidance: An Overview
Telling Canada’s story in numbers Marie-Josée Major
Presentation transcript:

Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada

Outline Statistics Canada’s context Public Use Microdata Files Research Data Centres Disclosure vetting at RDCs Remote Access References

Providing access to microdata The Statistics Act Sections 11 & 12 data sharing agreements Discretionary release Use of “deemed employees” Public Use Microdata Files

Anonymized microdata files for a sample of units – mostly household survey data Microdata Release Policy & Guidelines Need approval of Microdata Release Committee to release a PUMF Submissions must include data distributions, geographic level of detail, description of the weighting procedure and the methods to evaluate and decide on data to be presented

Preparation for PUMFs Suppress identifying variables Limit design & related information Clusters (& households), strata, Bootstrap weights Consider level of geographic detail Examine distribution of weights (low weights, geographical information implied by weights) Special analyses (relationships, multiplicity, Data Intrusion Simulation, linkages,...) Data suppression and perturbation Longitudinal PUMFs have rarely been released!

Special analyses Multiplicity Given a set of n indirect identifiers (ii), generate all 3-way tables involving 3 ii’s at a time Multiplicity = # tables in which unit is unique Analysis can be by sub-group (e.g., province) Data Intrusion Simulation (Elliot) Probability a unique match to a microdata record is a true match P(cm|um) ≈ #uniques / [#uniques + 2*#pairs*(weight-1)] Expanded to Poisson sampling by Skinner & Carter

Research Data Centres Initially created to provide researcher access to longitudinal surveys – now housing population & housing survey data Around 20 centres provide access to researchers in a secure university setting Always staffed by STC employees Accessible only to researchers with approved projects who have been sworn in as “deemed employees” under the Statistics Act All outputs are vetted before being released

Disclosure vetting at RDCs Two types of risks: Produce results for identifiable respondents Compromise confidentiality of PUMF data Since results are from sample surveys and are aggregated, risks are low BUT … many surveys release PUMFs – we do not want to risk compromising disclosure control methods used to protect PUMF data General rules implemented for all surveys – some surveys have additional rules

Disclosure vetting at RDCs Potential problems associated with availability of PUMF data Statistics based on few observations could be linked to individual respondents – risks increase if survey weights can help in linking (note: survey results based on few respondents are not reliable) Some distributional results provide information about extreme values (top-coded on PUMFs) Approximate location of sample units can be revealed – this affects more than one survey as many have sample in the same clusters

Disclosure vetting at RDCs Key aspects: Results should use survey weights (justify need for unweighted other than sample size indications) No unit-level results: apply 5-respondent minimum for frequencies & statistics (some surveys use 10) – use higher threshold if releasing weighted and unweighted tabular results Intermediary outputs increase the risk of residual disclosure and should be avoided Analytical and model outputs entail less risks than tabular ouputs

Disclosure vetting at RDCs Other rules: Careful about tables with full cells (i.e., only one nonzero cell in a row/column) 5-respondent min. applies to descriptive statistics; for medians & percentiles need at least 5 units at or above & at or below value No ranges, min. or max. for quantitative variables Model outputs are generally safe but: saturated models with categorical covariates should be vetted as if tabular results covariances/correlations involving dichotomous variables are releasable if results by value of dichotomous variable are releasable no unit-level results (e.g., residuals, scatterplots)

Disclosure vetting at RDCs Special rules for detailed geographical results: Do not reveal sensitive information about the location of the sample or of sample units on a map, table, list or otherwise Round weighted frequencies to base 50 Detailed geographical outputs for visible identifying characteristics, e.g., race or disability should only be released if they do not pose a risk (full cell problem) Researchers who wish to release geographical contextual information must indicate how those relate to geographical areas – if some areas are clearly identified from the contextual information the vetting rules should be applied at the level of those areas

Disclosure vetting at RDCs Rules apply to household survey data at RDCs Plans to put census data and some admin data at RDCs Census rules will apply for census data. Additionally, geographical detail will stop at the census tract (or equivalent) level and intermediary outputs will not be allowed. Admin data put in feasibility study mode – rules to be developed

Rules for census data Random rounding for counts (usually base 5) Population thresholds for “standard” & custom geographies (40 & 100) Population & household thresholds for income characteristics (250 & 40) # same-sex common-law couples available for areas over 5,000 people For place of work data size limits are applied to the employed labour force Suppression of statistics if: $ values of units in cell are in a narrow range; <4 records used in calculation; sum of weights <10; or presence of outliers Otherwise totals for quantitative statistics obtained by multiplying average with rounded weighted frequency

Remote Access Provide indirect access since the 1990s Researchers obtain survey & datafile documentation and “dummy” test data Note: Test files created from survey data need approval of Microdata Release Committee SAS/SPSS/Stata programs submitted by e- mail, results ed back after manual vetting for confidentiality Popular for some surveys (e.g., health) Disclosure issues similar to RDCs

References Elliot, M.J. (2000). Data Intrusion Simulation: Advances and a Vision for the Future of Disclosure Control. Presented at the Joint ECE/Eurostat Work Session on Statistical Data Confidentiality. Skopje, March 14-16, Mayda, J.E., Mohl, C. and Tambay, J.L. (1996). Variance Estimation and Confidentiality: They Are Related! Proceedings of the Survey Methods Section, SSC Annual Meeting. June, Skinner, C.J. and Carter, R.G. (2003). Estimation of a Measure of Disclosure Risk for Survey Microdata under Unequal Probability Sampling. Survey Methodology. 29, Statistics Canada (2005). Guide for Researchers under Agreement with Statistics Canada. October, Tambay, J.L., Goldmann, G. and White, P. (2001). Providing Greater Access to Survey Data for Analysis at Statistics Canada. Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001.