Classification Trees for Privacy in Sample Surveys

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
The Microdata Analysis System (MAS): A Tool for Data Dissemination Disclaimer: The views expressed are those of the authors and not necessarily those of.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
CMPUT 466/551 Principal Source: CMU
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Statistical Methods Chichang Jou Tamkang University.
Ensemble Learning: An Introduction
Classification and Prediction: Regression Analysis
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
Comments: The Big Picture for Small Areas Alan M. Zaslavsky Harvard Medical School.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Statistica /Statistics Statistics is a discipline that has as its goal the study of quantity and quality of a particular phenomenon in conditions of.
Weighting and imputation PHC 6716 July 13, 2011 Chris McCarty.
Prediction and Missing Data. Summarising Distributions ● Models are often large and complex ● Often only interested in some parameters – e.g. not so interested.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Multiple Regression.
Statistics in Management
University of Texas at El Paso
Chapter 7. Classification and Prediction
Demand Point Aggregation for Location Models Chapter 7 – Facility Location Text Adam Bilger 7/15/09.
The treatment of uncertainty in the results
Making inferences from collected data involve two possible tasks:
Lecture 3 Biostatistics in practice of health protection
Regression Analysis Module 3.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Assessing Disclosure Risk in Microdata
Modeling approaches for the allocation of costs
PCB 3043L - General Ecology Data Analysis.
Graduate School of Business Leadership
CHAPTER 12 Sample Surveys.
Chapter 12 Sample Surveys
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
ECE 5424: Introduction to Machine Learning
Multiple Imputation Using Stata
Operational Agility in the American Community Survey: The Promise of Administrative Records Victoria Velkoff and Jennifer Ortman American Community Survey.
K Nearest Neighbor Classification
Bellwork.
Data Mining Practical Machine Learning Tools and Techniques
Adjusting Census Figures
Multiple Regression.
Tabulations and Statistics
Discrete Event Simulation - 4
Towards a Fully Adjusted Census Database for the 2011 Census
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chapter 8: Weighting adjustment
Incremental Partitioning of Variance (aka Hierarchical Regression)
Administrative Data and their Use in Economic Statistics
15.1 The Role of Statistics in the Research Process
Sampling Chapter 6.
Two Halves to Statistics
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Parallel Session: BR maintenance Quality in maintenance of a BR:
The European Statistical Training Programme (ESTP)
Tract Mapping with the American Community Survey
Adjusting Census Figures Pres. 11
Protecting the Confidentiality of the 2020 Census Statistics
Jerome Reiter Department of Statistical Science Duke University
Status Update on 2020 Census Data Products Plan
Presentation transcript:

Classification Trees for Privacy in Sample Surveys Rolando A. Rodríguez Michael H. Freiman Jerome P. Reiter Amy D. Lauger Symposium for Data Science and Statistics May 18, 2018 Jerry = Duke & Census Any views expressed are those of the author and not necessarily those of the U.S. Census Bureau.

The American Community Survey (ACS) The ACS is the Census Bureau’s largest demographic survey Collects a wealth of information about households and people Household characteristics (relationships, mortgage/rent, utilities) Person characteristics (age, sex, ancestry, schooling, occupation) Over 35 topic areas ACS is the basis for the distribution of ~$675 billion in federal funds annually Used for many non-government purposes

Usefulness and privacy in the ACS We are required to release estimates from the ACS We are required to protect the identities and attributes of respondents in the ACS ACS yearly data releases include: 1-year and 5-year microdata and tables On the order of 10 billion estimates produced per year Current ACS privacy practices include: Swapping, coarsening, and censoring for microdata Coarsening and suppression for tables

Users need to account for privacy protection Privacy algorithms usually add bias and/or variance to estimators We often keep certain parameters of the algorithms secret Data users can’t account for the effect of such ‘private’ algorithms We are researching methods to add transparency New methods need to give at least as much privacy

Enter synthetic data We use distributional draws to generate new ACS records The theory accounts for added variance via multiple-imputation Models can augment survey data with administrative records

How synthetic data works Prior distribution Posterior distribution Natural phenomenon under Bayesian stats If we simulate values, low probability of complete match Data likelihood Posterior predictive draws Multiple data releases

Choosing models for the ACS We want to generate new values for all records and all variables Majority of variables in ACS are categorical We have dimensionality woes, so conditional models helpful Stakeholders prefer models that they can understand The models must reasonably work within synthetic data framework ‘Proper’ posterior draws can be hard to obtain Other methods can be useful with adaptations to fit the framework “Likelihood is more subjective than the prior”

Classification trees are good models to try Trees can identify important relationships automatically We can easily fit trees conditionally on previous trees The logical flow of a tree is easy to visualize We can use bootstrapping on the trees leaves for synthetic data

Tree-based synthesis: grow the tree 🏠 🏢 🏰 🏠🏠 The only test set of interest is the data itself, so we train the tree on entire data 🏨 🏨🏨 + 🏠 🏢 🏠🏠 🏰 🏠 🏢 💲 💲💲 💲💲 💲💲💲+ 🏢 🏠 🏠 🏠🏠 🏠 🏰 🏢

Tree-based synthesis: grow the tree 🏠 🏢 🏰 🏠🏠 🏨 🏨🏨 + 🏠 🏢 🏠🏠 🏰 🏠 🏢 💲 💲💲 💲💲 💲💲💲+ 🏢 🏠 🏠 🏠🏠 🏠 🏰 🏢 Only previously synthesized features (predictors) available to a given tree

Tree-based synthesis: grow the tree 🏠 🏢 🏰 🏠🏠 🏨 🏨🏨 + 🏠 🏢 🏠🏠 🏰 🏠 🏢 💲 💲💲 💲💲 💲💲💲+ 🏢 🏠 🏠 🏠🏠 🏠 🏰 🏢 We limit minimal split/leaf size to: Avoid overfitting Allow plausible synthesis

Tree-based synthesis: use leaves for synthesis 🏠 🏢 🏰 🏠🏠 🏨 🏨🏨 + 🏠 🏢 🏠🏠 🏰 🏠 🏢 💲 💲💲 💲💲 💲💲💲+ 🏢 🏠 🏠 🏠🏠 🏠 🏰 🏢

Tree-based synthesis: use leaves for synthesis 🏠 🏢 🏰 🏠🏠 🏰 🏠 🏢 🏰 🏢 (🏨🏨 +, 💲💲💲+) 🏢

Current results for ACS tabulations are promising We assess a variety of estimates on several metrics Internal microdata cross-tabulations Unweighted production tables Common analyses such as regressions Results are generally good, but certain metrics highlight potential issues Problems in tree fits are relatively tractable Trees may not always split in ways that support the tables Limitations in tree depth can cause issues Unweighted b/c we don’t synthesize weights Note this is a pretty hard metric NOT to fail on

What about privacy? We can analyze risk against specific external databases How certain are links between records? How much do statistics change? How can we protect against attacks in general? Need to define privacy in a computable way

Quantifiable privacy requires formal definitions Ultimate goal is to use formal privacy methods on the ACS Formal privacy definitions are mathematical Differential privacy is the most common flavor Your participation does not change released statistics ‘too much’ Guarantee is over all possible data realizations and statistics Amount of privacy loss encapsulated in a parameter, 𝜀 Mathematical reads “computable’

A simple example of differential privacy 🚗 🚉 🚶 🚲 🚢 👩 700 5 1 👨 800 4

A simple example of differential privacy ε = 0.1 🚗 🚉 🚶 🚲 🚢 👩 700 5 1 👨 800 4

A simple example of differential privacy ε = 0.1 🚗 🚉 🚶 🚲 🚢 👩 712 5 4 3 1 👨 783 23

A simple example of differential privacy ε = 0.1 🚗 🚉 🚶 🚲 🚢 👩 805 1 8 17 4 👨 692 14 3

A simple example of differential privacy ε = 0.1 🚗 🚉 🚶 🚲 🚢 👩 796 -9 15 -7 16 👨 712 5 12 -5 6

A simple example of differential privacy ε = 0.1 🚗 🚉 🚶 🚲 🚢 👩 700 5 1 👨 800 4

A simple example of differential privacy ε = 1 🚗 🚉 🚶 🚲 🚢 👩 700 5 1 👨 800 4

Now we can discuss a privacy tradeoff We express accuracy under differential privacy as the distance between the original and noisy tables. A common metric is the 𝐿 1 -norm, the sum of the absolute cell differences. More privacy loss (greater 𝜀) means more accuracy 1− 𝐿 1 2𝑁 𝜀

Formal privacy can be hard for sample surveys Detailed surveys are blessed (cursed) by dimensionality Total # of cells increases exponentially with # of variables Tables of small populations will tend to be sparse Rounding negative counts can lead to large marginal bias Data releases can contain summaries that are ‘harder’ to protect How many people live with a person with property P? What is the median age of said persons? Surveys are weighted How do we define ‘your participation’? How do we report accurate standard errors?

The path forward Continue to improve synthetic data methods for the ACS Research ways in which formal privacy might be viable for ACS Use administrative records where possible Thank you! rolando.a.rodriguez@census.gov