INFO 7470 Session 12 Statistical Tools: Methods of Confidentiality Protection John M. Abowd and Lars Vilhuber April 25, 2016.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Design of Experiments Lecture I
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Preparing for the First Hourly Summer Term. Summer Term Course Structure Probability and Design Issues  Descriptive Statistics, Confidence Intervals.
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Visual Recognition Tutorial
© John M. Abowd 2005, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2005.
Evaluating Hypotheses
Are Public Use (Micro) Data a Thing of the Past? John M. Abowd Cornell University US Census Bureau Prepared for IASSIST 2002.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
7-2 Estimating a Population Proportion
Personality, 9e Jerry M. Burger
BCOR 1020 Business Statistics
“OnTheMap” The Census Bureau’s New Tool for Residence-Workplace Analysis Fredrik Andersson and Jeremy Wu May 7, 2007 Daytona Beach, FL.
Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Fundamentals of Statistical Analysis DR. SUREJ P JOHN.
Determining Sample Size
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Chapter 1: Introduction to Statistics
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Lecture 14 Dustin Lueker. 2  Inferential statistical methods provide predictions about characteristics of a population, based on information in a sample.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Vegas Baby A trip to Vegas is just a sample of a random variable (i.e. 100 card games, 100 slot plays or 100 video poker games) Which is more likely? Win.
© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Preparing for the 2 nd Hourly. What is an hourly? An hourly is the same thing as an in-class test. How many problems will be on the hourly? There will.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
Some ACS Data Issues and Statistical Significance (MOEs) Table Release Rules Statistical Filtering & Collapsing Disclosure Review Board Statistical Significance.
Adaptive randomization
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
1 Using Fixed Intervals to Protect Sensitive Cells Instead of Cell Suppression By Steve Cohen and Bogong Li U.S. Bureau of Labor Statistics UNECE/Work.
Understanding Sampling
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
STA Lecture 171 STA 291 Lecture 17 Chap. 10 Estimation – Estimating the Population Proportion p –We are not predicting the next outcome (which is.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
Learning Simio Chapter 10 Analyzing Input Data
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Machine Learning 5. Parametric Methods.
Security Methods for Statistical Databases. Introduction  Statistical Databases containing medical information are often used for research  Some of.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.
Lecture 13 Dustin Lueker. 2  Inferential statistical methods provide predictions about characteristics of a population, based on information in a sample.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
The Fate of Empirical Economics When All Data are Private
12. Principles of Parameter Estimation
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
Differential Privacy in Practice
Classification Trees for Privacy in Sample Surveys
12. Principles of Parameter Estimation
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

INFO 7470 Session 12 Statistical Tools: Methods of Confidentiality Protection John M. Abowd and Lars Vilhuber April 25, 2016

Outline Part I Most of today’s lecture is based on Abowd, John M. and Ian Schmutte “Economic Analysis and Statistical Disclosure Limitation” Brookings Papers on Economics Activity (Spring 2015): Includes discussion. [free download] (Brookings does not use DOIs.) Online appendix is in the same place.[free download] Curated URL (Labor Dynamics Institute, Cornell) April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 2

First, a Live Demo! The demo will only be live at Cornell and Census. Other sites have too few enrollees for the exercise to work properly The local instructor is now distributing a sealed envelope to each person in the class CHOOSE YOUR OWN ENVELOPE DO NOT LET THE INSTRUCTOR CHOOSE DO NOT OPEN THE ENVELOPE!!!!!!!!!! April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 3

Now, Let’s Analyze the Data April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 4

The Basic Economics Scientific data quality is a pure public good Quantifiable privacy protection is also a pure public good (bad, when measured as “loss”) when supplied using the methods I will discuss shortly Computer scientists have succeeded in providing feasible technology sets relating the public goods: data quality and privacy protection These technology sets generate a quantifiable production possibilities frontier between data quality and privacy protection April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 5

The Basic Economics II We can now estimate the marginal social cost of data quality as a function of privacy protection—a big step forward The CS models are silent (or, occasionally, just wrong) about how to choose a socially optimal location on the PPF because they ignore social preferences To solve the social choice problem, we need to understand how to quantify preferences for data quality v. privacy protection For this we use the Marginal Social Cost of data quality and the Marginal Social Benefit of data quality, both measured in terms of the required privacy loss April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 6

Now, Back to the Live Example April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 7

Four Principles of Ideal Data Publication - Privacy Protection Systems To the maximum extent possible, scientific analysis should be performed on the original confidential data Publication of statistical results should respect a quantifiable privacy-loss budget constraint Data publication algorithms should provably compose Data publication algorithms should be provably robust to arbitrary ancillary information April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 8

Outline Why must users of restricted-access data learn about confidentiality protection? What is statistical disclosure limitation? What are formal privacy methods? Examples of SDL and formal privacy methods – Traditional SDL – Differential Privacy – The Inferential Disclosure Link April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 9

Why Are We Covering This? The vast majority of data users have no exposure to the SDL techniques applied to the data they use The tradition in SDL is to protect the details of what was done as part of the protocol – Here’s the complete description for the American Community Survey public-use micro sample: tation/pums_confidentiality/ tation/pums_confidentiality/ April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 10

Explosion of Research in Computer Science Differential privacy, developed by Cynthia Dwork and many collaborators fundamentally changed the nature of the discussion Generalizations of Dwork’s approach are called “formal privacy” models The standards of modern cryptography apply: – An algorithm only provides protection if it can survive an attack by anyone armed with all the details of the protection algorithm except the actual random numbers, if any, used in the protection This is where the four principles above originate April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 11

Restricted-access Data Users Must normally subject their analyses to statistical disclosure limitation, including limitation by differential privacy methods It is extremely important to understand what this means for the quality of the released research and its replicability April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 12

Statistical Disclosure Limitation Protection of the confidentiality of the underlying micro-data – Avoiding identity disclosure – Avoiding attribute disclosure – Avoiding inferential disclosure Identity disclosure: who (or what entity) is in the confidential micro-data Attribute disclosure: value of a characteristic for that entity or individual Inferential disclosure: improvement of the posterior odds of a particular event (identity or attribute) Reference: Duncan, George T., Mark Elliot, and Juan-José Salazar-González (2011) Statistical Confidentiality: Principles and Practice, New York: Springer. April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 13

Privacy-preserving Datamining and Differential Privacy Formally define the properties of “privacy” Introduce algorithmic uncertainty as part of the statistical process Prove that the algorithmic uncertainty meets the formal definition of privacy Differential privacy defines protection in terms of making the released information about an entity as close as possible to being independent of whether or not that entity’s data are included in the tabulation data file Reference: Dwork, Cynthia, and Aaron Roth (2014) “The Algorithmic Foundations of Differential Privacy,” Foundations and Trends in Theoretical Computer Science 9, nos. 3–4: 211–407. [free download]free download April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 14

Point of Commonality: Inferential disclosure April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 15

Point of Commonality: Inferential disclosure April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 16

The Impossibility Theorm  -differential privacy implies a bound on the maximal inferential disclosure from a publication Z based on confidential data Y for all sup|Y-Y*|  1 (neighbors) and i=0,1 implies where the supremum is taken over the rows of Y and Z. Therefore, there can be no informative data publication without some inferential disclosure April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 17

Discussion Many of the details have been left out Notice that when  =0, the posterior odds are equal to the prior odds, and the publication contains nothing informative This concept is called semantic security in the cryptography literature Reference: Goldwasser, S. and Micali, S. (1982) “Probabilistic encryption & how to play mental poker keeping secret all partial information,” Proceedings of the fourteenth annual ACM symposium on Theory of computing, ACM, pp. 365–377. April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 18

Now, Back to the Live Example April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 19

General Methods for Statistical Disclosure Limitation At the Census Bureau SDL is called Disclosure Avoidance Review Traditional methods – Suppression – Coarsening – Adding noise via swapping – Adding noise via sampling Newer methods – Explicit noise infusion – Synthetic data – Formal privacy-preserving sanitizers April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 20

Suppression This is by far the most common technique Model the sensitivity of a particular data item or observation (“disclosure risk”) Do not allow the release of data items that have excessive disclosure risk (primary suppression) Do not allow the release of other data from which the sensitive item can be calculated (complementary suppression) April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 21

Suppression in Model-base Releases Most data analysis done in the RDCs is model- based The released data consist of summary statistics, model coefficients, standard errors, some diagnostic statistics The SDL technique used for these releases is usually suppression: the suppression rules are contained (up to confidential parameters) in the RDC Researcher’s Handbook April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 22

Coarsening Coarsening is the creation of a smaller number of categories from the variable in order to increase the number of cases in each cell Computer scientists call this “generalizing” Geographic coarsening: block-block group-tract-minor civil division-county-state-region Top coding of income is a form of coarsening All continuous variables in a micro-data file can be considered coarsened to the level of precision (significant digits) released This method is often applied to model-based data releases by restricting the number of significant digits that can be released April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 23

Swapping Estimate the disclosure risk of certain attributes or individuals If the risk is too great, attributes of one data record are (randomly) swapped with the same attributes of another record If geographic attributes are swapped this has the effect of placing the risky attributes in a different location from the truth Commonly used in household censuses and surveys Rarely used with establishment data April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 24

Sampling Sampling is the original SDL technique By only selecting certain entities from the population on which to collect additional data (data not on the frame), uncertainty about which entity was sampled provides some protection In modern, detailed surveys, sampling is of limited use for SDL April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 25

Rules and Methods for Model-based SDL Refer to Chapter 3 of the RDC Researcher’s Handbook Suppress: coefficients on detailed indicator variables, on cells with too few entities Smooth: density estimation and quantiles, use a kernel density estimator to produce quantiles Coarsen: variables with heavy tails (earnings, payroll), residuals (truncate range, suppress labels of range) April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 26

PROPERTIES OF STATISTICAL DISCLOSURE LIMITATION METHODS April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 27

Statistical disclosure limitation is ignorable if and only if the analysis designed for the confidential data yields the same result when applied to the published data. SDL is almost never ignorable. April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 28

Nonignorable statistical disclosure limitation is known if and only if the analysis of the published data can be exactly corrected for the data alterations introduced by the SDL. Nonignorable SDL is known for a limited number of methods. April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 29

Nonignorable statistical disclosure limitation is discoverable if and only if the analysis of the published data can be probabilistically corrected for the data alterations introduced by the SDL. Nonignorable SDL is discoverable for many methods. April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 30

In modern SDL, the relevant trade-off for economists is between methods that are easy to make SDL aware (generalized randomized response when certain parameters are public) and those that are not (suppression and swapping as currently implemented). April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 31

Oldest SDL technique (Warner 1965; predates formal SDL itself) Hide the question asked from the interviewer Respondent’s answer “yes” applies to either the sensitive question or a non-sensitive question Two SDL parameters: probability asked sensitive question, probability “yes” for non-sensitive question SDL affects the mean and standard error for parameter of interest: proportion of population “yes” for sensitive question Example 1: Randomized Response April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 32

“Yes”: Probability asked sensitive question ½ Probability “yes” for innocuous question ½ Ever xxxx: Inferential disclosure (Bayes factor):  -differential privacy (maximum ln Bayes factor): ln () = SDL nonignorable and known Example 1: Randomized Response (continued) April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 33

Now, Back to the Live Example April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 34

Published income topcoded at T Quantiles of published income = quantiles of confidential income for all quantiles less than the quantile of T SDL is ignorable for quantiles less than the quantile of T SDL is nonignorable but discoverable for quantiles at or above the quantile of T (discovery is via inspection of the data combined with the knowledge that SDL topcoding was applied) Example 2: Topcoding April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 35

Published data are a large contingency table with confidential data available for every cell Oldest formal SDL technique (Fellegi 1972) Most common SDL method in use worldwide Some values are suppressed because either the data in the cell were deemed “sensitive” or a complementary suppression was needed to protect another sensitive cell The missing data items are not ignorable; therefore the SDL is nonignorable Suppression is almost always discoverable Example 3: Tabular Suppression April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 36

Example 4: Regression Discontinuity and Regression Kink Running variable subjected to SDL from the generalized randomized response class (noise infusion, suppress and impute, synthetic data) SDL is nonignorable because the estimated treatment effect is confounded by the probability that SDL was applied to the running variable SDL is known if the agency publishes that probability SDL is discoverable if the agency publishes the fact that a generalized randomized response method was used April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 37

Example 4: Regression Discontinuity and Regression Kink (continued) In RD and RK designs where the probability of SDL is known, divide the treatment effect and its standard error by the probability In designs where the SDL is discoverable use the “compliance” function implied by generalized randomized response to estimate the treatment effect using the appropriate fuzzy RD/RK estimator; adjustment of standard error is part of the fuzzy RD/RK method All other RD/RK assumptions are identical to those made in the confidential data analysis Analysis similar to Lee and Card (2008) April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 38

Same data structure as in tabular suppression: published data are a large contingency table with confidential data available for every cell Instead of suppression, all items are published by tabulating input data that have been infused with noise (every input value modified) SDL is nonignorable Example 5: Tabular Noise Infusion April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 39

SDL is known if the agency publishes the statistical properties of the noise process—usually, i.i.d. with zero mean and constant known variance SDL is discoverable if there is another tabular summary with the same expected value but published using a different SDL method or different noise infusion parameters Example: Quarterly Workforce Indicators, Quarterly Census of Employment and Wages, County Business Patterns Example 5: Tabular Noise Infusion (continued) April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 40

All of the following can be modeled as generalized random response: – Swapping – Input noise infusion (adding random values) – Suppress and impute (replacing some values with imputations; also called partial synthetic data) – Synthetic data (replacing some or all values with samples from the posterior distribution) – Case considered by Alexander, Davern and Stevenson (2010) SDL is never ignorable SDL is discoverable Example 6: Microdata Noise Infusion April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 41

Regression analysis when the dependent variable has been subject to “suppress and impute” SDL No other SDL SDL is nonignorable SDl is discoverable if the agency publishes the suppression rate, the list of variables used in the imputation models, and uses the complete confidential data matrix (including observations with suppressions) for imputation Analysis similar to Bollinger and Hirsch (2006) Example 6: Microdata Noise Infusion (continued) April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 42

Fully synthetic data (all values replaced with samples from their posterior predictive distributions) Models fit on synthetic data validated by the agency on the actual confidential data Custom tabulation SDL applied to output from the confidential data (usually some suppression of coefficients) SDL is nonignorable Synthetic data systems published with fully SDL- aware analysis tools Example 7: Synthetic Data with Validation April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 43

Synthetic Data and Program Evaluation Fully synthetic data cannot identify RD/RK effects (program treatment effects) Do permit full development of the specification (semi- parametric estimation of the response and compliance functions) Certify design on synthetic data Validate certified design on the confidential data Prevents ex post adjustment of the evaluation (as in current best practices for clinical trials) Allows agency to fully restrict access to the confidential data without limiting evaluation options Publishes the intended evaluation April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 44

Variable y 3 has been subjected to formal privacy protection and is published as z 3. The RD Evaluation Setup April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 45

Details of Evaluation with Synthetic Data April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 46

From Bertrand et al. April 25, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 47 Timeline: SDS application November 2012, gold standard results January Bertrand, Marianne, Emir Kamenica and Jessica Pan (QJE 2015) “Gender identity and relative income within households.”