Differential Privacy in US Census CompSci 590.03 Instructor: Ashwin Machanavajjhala 1Lecture 17: 590.03 Fall 12.

Slides:



Advertisements
Similar presentations
Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
Advertisements

Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
Sampling Distributions (§ )
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Overview of Cryptography Anupam Datta CMU Fall A: Foundations of Security and Privacy.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Information Theory and Security
Inferential Statistics
Foundations of Cryptography Lecture 2 Lecturer: Moni Naor.
Probability, Bayes’ Theorem and the Monty Hall Problem
“OnTheMap” The Census Bureau’s New Tool for Residence-Workplace Analysis Fredrik Andersson and Jeremy Wu May 7, 2007 Daytona Beach, FL.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Issues of Security with the Oswald-Aigner Exponentiation Algorithm Colin D Walter Comodo Research Lab, Bradford, UK Colin D Walter.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Tuning Privacy-Utility Tradeoffs in Statistical Databases using Policies Ashwin Machanavajjhala cs.duke.edu Collaborators: Daniel Kifer (PSU),
Refined privacy models
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Bayesian Methods I: Parameter Estimation “A statistician is a person who draws a mathematically precise line from an unwarranted assumption to a foregone.
Privacy of Correlated Data & Relaxations of Differential Privacy CompSci Instructor: Ashwin Machanavajjhala 1Lecture 16: Fall 12.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
MSE-415: B. Hawrylo Chapter 13 – Robust Design What is robust design/process/product?: A robust product (process) is one that performs as intended even.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
INTRODUCTION TO Machine Learning 3rd Edition
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
MaskIt: Privately Releasing User Context Streams for Personalized Mobile Applications SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
Lecture 29 Dr. MUMTAZ AHMED MTH 161: Introduction To Statistics.
MATH 256 Probability and Random Processes Yrd. Doç. Dr. Didem Kivanc Tureli 14/10/2011Lecture 3 OKAN UNIVERSITY.
Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Univariate Gaussian Case (Cont.)
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
No Free Lunch in Data Privacy CompSci Instructor: Ashwin Machanavajjhala 1Lecture 15: Fall 12.
Approximation Algorithms based on linear programming.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
No Free Lunch: Working Within the Tradeoff Between Quality and Privacy
Private Data Management with Verification
Privacy-preserving Release of Statistics: Differential Privacy
Elementary Statistics
Differential Privacy in Practice
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Welcome to the Kernel-Club
Statistical Data Analysis
Published in: IEEE Transactions on Industrial Informatics
Sampling Distributions (§ )
Refined privacy models
Differential Privacy (1)
Presentation transcript:

Differential Privacy in US Census CompSci Instructor: Ashwin Machanavajjhala 1Lecture 17: Fall 12

Announcements No class on Wednesday, Oct 31 Guest Lecture: Prof. Jerry Reiter, (Duke Stats) Friday Nov 2, “Privacy in the U.S. Census and Synthetic Data Generation” Lecture 17: Fall 122

Outline (continuation of last class) Relaxing differential privacy for utility – E-privacy [M et al VLDB ‘09] Application of Differential Privacy in US Census [M et al ICDE ‘08] Lecture 17: Fall 123

E-PRIVACY Lecture 17: Fall 124

Defining Privacy... nothing about an individual should be learnable from the database that cannot be learned without access to the database. T. Dalenius, 1977 Problem with this approach: Analyst knows Bob has green hair. Analysts learns from published data that people with green hair have 99% probability of cancer. Therefore analyst knows Bob has high risk of cancer, even if Bob is not in the published data. Lecture 17: Fall 125

Defining Privacy Therefore analyst knows Bob has high risk of cancer, even if Bob is not in the published data. This should not be considered a privacy breach – such correlations are exactly what we want the analyst to learn. Lecture 17: Fall 126

Counterfactual Approach Consider 2 distributions: Pr[Bob has cancer | adversary’s prior + output of mechanism on D] – “What adversary learns about Bob after seeing the published information” Pr[Bob has cancer | adversary’s prior + output of mechanism on D -Bob ] where D’ = D – {Bob} – “What adversary would have learnt about Bob even (in the hypothetical case) when Bob was not in the data” – Must be careful … when removing Bob you may also need to remove other tuples correlated with Bob … Lecture 17: Fall 127

Counterfactual Privacy Consider a set of data evolution scenarios (adversaries) {θ} For every property s Bob about Bob, and output of mechanism w |log P(s Bob | θ, M(D) = w) - log P(s Bob | θ, M(D -Bob ) = w) | ≤ ε When {θ} is the set of all product distributions which are independent across individuals, – D -Bob = D – {Bob’s record} – A mechanism satisfies the above definition if and only if it satisfies differential privacy Lecture 17: Fall 128

Counterfactual Privacy Consider a set of data evolution scenarios (adversaries) {θ} For every property s Bob about Bob, and output of mechanism w |log P(s Bob | θ, M(D) = w) - log P(s Bob | θ, M(D -Bob ) = w) | ≤ ε What about for other sets of prior distributions {θ}? – Open Question: If {θ} contains correlations, then definition of D -Bob itself is not very clear (like discussed in previous class about count constraints, or social networks) Lecture 17: Fall 129

Certain vs Uncertain Adversaries Suppose an adversary has an uncertain prior: Consider a two sided coin A certain adversary knows bias of coin = p (for some p) – Exactly knows the bias of the coin – Knows every coin flip is random draw which is H with prob p An uncertain adversary may think the coin’s bias is in [p – δ, p+δ] – Does not exactly know the bias of the coin. Assumes coin’s bias θ is drawn from some probability distribution π – Given θ, every coin flip is random draw which is H with prob θ Lecture 17: Fall 1210

Learning In machine learning/statistics, you want to use the observed data to learn something about the population. E.g., Given 10 flips of a coin, what is the bias of the coin? Assume your population is drawn from some prior distribution θ. We don’t know the θ, but may know some θ’s are more likely than others (like π, a probability distribution over θ’s) We want to learn the best θ that explains the observations … If you are certain about θ in the first place, there is no need for statistics/machine learning. Lecture 17: Fall 1211

Uncertain Priors and Learning In many privacy scenarios, statistician (advertiser, epidemiologist, machine learning operator, …) is the adversary. But statistician does not have a certain prior … (otherwise no learning) Maybe, we can model a class of “realistic” adversaries using uncertain priors. Lecture 17: Fall 1212

E-Privacy Counterfactual privacy with realistic adversaries Consider a set of data evolution scenarios (uncertain adversaries) {π} For every property s Bob about Bob, and output of mechanism w |log P(s Bob | π, M(D) = w) - log P(s Bob | π, M(D -Bob ) = w) | ≤ ε where P(s Bob | π, M(D) = w) = ∫ θ P(s Bob | θ, M(D) = w) P(θ | π) dθ Lecture 17: Fall 1213

Realistic Adversaries Suppose your domain has 3 values green (g), red (r) and blue (b). Suppose individuals are assumed to be drawn from some common distribution θ = {p g, p r, p b } Lecture 17: Fall 1214 {p g = 1, p r = 0, p b = 0} {p g = 0, p r = 1, p b = 0} {p g = 0, p r = 0, p b = 1}

Modeling Realistic Adversaries E.g., Dirichlet D(α g, α r, α b ) priors to model uncertainty. – Maximum probability given to {p* g, p* r, p* b }, where p* g = α g / (α g + α r + α b ) D(6,2,2) D(3,7,5) D(6,2,6) D(2,3,4) {p* g =0.2, p r =0.47, p b = 0.53} {p* g =0.43, p r =0.14, p b = 0.43} 15Lecture 17: Fall 12

E.g., Dirichlet Prior Dirichlet D(α g, α r, α b ). – Call α = (α g + α r + α b ) the stubbornness of the prior. – As α increases, more probability is given to {p* g, p* r, p* b }. – When α  ∞, {p* g, p* r, p* b } has probability 1 and we get the independence assumption. Lecture 17: Fall 1216 D(2,3,4) D(6,2,6) α = 10α = 14

Better Utility … Suppose we consider a class of uncertain adversaries, who are characterized by Dirichlet distributions with known stubborness (α), but unknown shape (α g, α b, α r ). Algorithm: Generalization – Size of each group > α/ε-1. – If size of a group = δ(α/ ε-1), then most frequent sensitive value appears in at most 1-1/(ε+δ) fraction of tuples in the group. Hence, we now also have a sliding scale to assess the power of adversaries (based on stubbornness) – More α implies more coarsening implies less utility Lecture 17: Fall 1217

Summary Lots of Open Questions: What is the relationship between counterfactual privacy and Pufferfish? In what other ways can causality theory be used? For defining correlations? What are other interesting ways to instantiate E-privacy, and what are efficient algorithms for E-privacy? … Lecture 17: Fall 1218

DIFFERENTIAL PRIVACY IN US CENSUS Lecture 17: Fall 1219

OnTheMap: A Census application that plots commuting patterns of workers Workplace (Public) Residences Residences (Sensitive) 20 Lecture 17: Fall 12

OnTheMap: A Census application that plots commuting patterns of workers Worker IDOriginDestination 1223MD11511DC MD2123DC VA11211DC PA12121DC PA11122DC MD1121DC DC22122 Census Blocks Residence (Sensitive) Workplace (Quasi-identifier) 21Lecture 17: Fall 12

Why publish commute patterns? To compute Quarterly Workforce Indicators – Total employment – Average Earnings – New Hires & Separations – Unemployment Statistics E.g., Missouri state used this data to formulate a method allowing QWI to suggest industrial sectors where transitional training might be most effective … to proactively reduce time spent on unemployment insurance … 22Lecture 17: Fall 12

A Synthetic Data Generator (Dirichlet resampling) += Multi-set of Origins for workers in Washington DC. Noise (fake workers) Step 1: Noise Addition (for each destination) D (7, 5, 4)A (2, 3, 3)D+A (9, 8, 7) Washington DC SomersetFuller Noise added to an origin with at least 1 worker is > 0 23 Noise infused data Lecture 17: Fall 12

A Synthetic Data Generator (Dirichlet resampling) Step 2: Dirichlet Resampling (for each destination) (9, 8, 7)(9, 7, 7) Draw a point at random Replace two of the same kind. (9, 9, 7) S : Synthetic Data frequency of block b in D+A = 0  frequency of b in S = 0 i.e., block b is ignored by the algorithm. frequency of block b in D+A = 0  frequency of b in S = 0 i.e., block b is ignored by the algorithm. 24Lecture 17: Fall 12

How should we add noise? Intuitively, more noise yields more privacy … How much noise should we add ? To which blocks should we add noise? Currently this is poorly understood. – Total amount of noise added is a state secret – Only 3-4 people in the US know this value in the current implementation of OnTheMap. 25Lecture 17: Fall 12

How should we add noise? Intuitively, more noise yields more privacy … How much noise should we add ? To which blocks should we add noise? Currently this is poorly understood. – Total amount of noise added is a state secret – Only 3-4 people in the US know this value in the current implementation of OnTheMap. 1. How much noise should we add? 2. To which blocks should we add noise? 1. How much noise should we add? 2. To which blocks should we add noise? 26Lecture 17: Fall 12

Privacy of Synthetic Data Theorem 1: The Dirichlet resampling algorithm preserves privacy if and only if for every destination d, the noise added to each block is at least where m(d) is the size of the synthetic population for destination d and ε is the privacy parameter. 27 m(d) ε - 1 Lecture 17: Fall 12

1. How much noise should we add? Noise required per block: (differential privacy) Add noise to every block on the map. There are 8 million Census blocks on the map! 1 million original workers and 16 billion fake workers!!! 28 1 million original and synthetic workers. lesser privacy 2. To which blocks should we add noise? Lecture 17: Fall 12

Intuition behind Theorem Two possible inputs blue and red are two different origin blocks. Adversary knows individual 1 is Either blue or red. Adversary knows individual 1 is Either blue or red. Adversary knows individuals [2..n] are blue. D2D2 D1D1 Lecture 17: Fall 12

Intuition behind Theorem Two possible inputs blue and red are two different origin blocks. Noise Addition D2D2 D1D1 Lecture 17: Fall 12

Intuition behind Theorem Noise infused inputs blue and red are two different origin blocks. For every output … O Dirichlet Resampling D2D2 D1D1 Pr[D 1  O] = 1/10 * 2/11 * 3/12 * 4/13 * 5/14 * 6/15 Pr[D 2  O] = 2/10 * 3/11 * 4/12 * 5/13 * 6/14 * 7/15 Pr[D 1  O] = 1/10 * 2/11 * 3/12 * 4/13 * 5/14 * 6/15 Pr[D 2  O] = 2/10 * 3/11 * 4/12 * 5/13 * 6/14 * 7/15 = 7 Pr[D 1  O] Pr[D 2  O] Lecture 17: Fall 12

Intuition behind Theorem Noise infused inputs blue and red are two different origin blocks. For every output … O Adversary infers that it is very likely individual 1 is red … … unless noise added is very large. Dirichlet Resampling D2D2 D1D1 Lecture 17: Fall 12

Privacy Analysis: Summary Chose differential privacy. – Guards against powerful adversaries. – Measures privacy as a distance between prior and posterior. Derived necessary and sufficient conditions when OnTheMap preserves privacy. The above conditions make the data published by OnTheMap useless. 33Lecture 17: Fall 12

But, breach occurs with very low probability. 34 Noise infused inputs blue and red are two different origin blocks. For every output … O Dirichlet Resampling D2D2 D1D1 Probability of O ≈ Lecture 17: Fall 12

Negligible function Definition: f(x) is negligible if it goes to 0 faster than the inverse of any polynomial. e.g., 2 -x and e -x are negligible functions. 35Lecture 17: Fall 12

(ε,δ)-Indistinguishability Pr[D 1  T] ≤ e ε Pr[D 2  T] + δ(|D 2 |) 36 For any subset of outputs T O1O1 D2D2 D1D1 For every pair of inputs that differ in one value O2O2 O3O3 O4O4 If T occurs with negligible probability, the adversary is allowed to distinguish between D1 and D2 by a factor > ε using O i in T. Lecture 17: Fall 12

Conditions for (ε,δ)-Indistinguishability Theorem 2: The Dirichlet resampling algorithm preserves (ε,δ)- indistinguishability if for every destination d, the noise added to each block is at least where n(d) is the number of workers commuting to d and m(d) ≤ n(d). 37 log n(d) Lecture 17: Fall 12

Probabilistic Differential Privacy (ε,δ)-Indistinguishability is an asymptotic measure – May not guarantee privacy when number of workers at a destination is small. Definition (Disclosure Set Disc(D, ε)): The set of output tables that breach ε-differential privacy for D and some other table D’ that differs from D in one value. 38Lecture 17: Fall 12

Probabilistic Differential Privacy Adversary may distinguish between D 1 and D 2 based on a set of unlikely outputs with probability at most δ 39 For every probable output OD2D2 D1D1 For every pair of inputs that differ in one value Pr[O | 1 - δ Pr[D 1  O] Pr[D 2  O] Lecture 17: Fall 12

1. How much noise should we add? Noise required per block: 40 1 million original and synthetic workers. lesser privacy Differential Privacy Probabilistic Differential Privacy (δ = ) Lecture 17: Fall 12

Prob. Differential Privacy: Summary Ignoring privacy breaches that occur due to low probability outputs drastically reduces noise. Two ways to bound low probability outputs – (ε,δ)-Indistinguishability and Negligible functions. Noise required for privacy ≥ (log n(d)) per block – (ε,δ)-Probabilistic differential privacy and Disclosure sets. Efficient algorithm to calculate noise per block (see paper). Does probabilistic differential privacy allow useful information to be published? 41Lecture 17: Fall 12

1. How much noise should we add? Noise required per block: 42 1 million original and synthetic workers. lesser privacy Differential Privacy Probabilistic Differential Privacy (δ = ) 2. To which blocks should we add noise? Why not add noise to every block? Lecture 17: Fall 12

Why not add noise to every block? Noise required per block: (probabilistic differential privacy) There are about 8 million blocks on the map! – Total noise added is about 6 million. Causes non-trivial spurious commute patterns. – Roughly 1 million fake workers from West Coast (out of a total 7 million points in D+A). – Hence, 1/7 of the synthetic data have residences in West Coast and work in Washington DC. lesser privacy 1 million original and synthetic workers. 43Lecture 17: Fall 12

2. To which blocks should we add noise? Noise required per block: (probabilistic differential privacy) Adding noise to all blocks creates spurious commute patterns. lesser privacy 44 1 million original and synthetic workers. Why not add noise only to blocks that appear in the original data? Why not add noise only to blocks that appear in the original data? Lecture 17: Fall 12

Theorem 3: Adding noise only to blocks that appear in the data breaches privacy. If a block b does not appear in the original data and no noise is added to b then b cannot appear in the synthetic data. If a block b does not appear in the original data and no noise is added to b then b cannot appear in the synthetic data. 45Lecture 17: Fall 12

Theorem 3: Adding noise only to blocks that appear in the data breaches privacy. Worker W comes from Somerset or Fayette. No one else comes from there. If S has a synthetic worker from Somerset Then W comes from Somerset!! Worker W comes from Somerset or Fayette. No one else comes from there. If S has a synthetic worker from Somerset Then W comes from Somerset!! Somerset 1 Fayette 0 Somerset 1 Fayette 0 Somerset 0 Fayette 1 Somerset 0 Fayette 1 46Lecture 17: Fall 12

Ignoring outliers degrades utility Each of these points are outliers. Contribute to about half the workers. Each of these points are outliers. Contribute to about half the workers. 47Lecture 17: Fall 12

Our solution to “Where to add noise?” Step 1 : Coarsen the domain – Based on an existing public dataset (Census Transportation Planning Package, CTPP). 48Lecture 17: Fall 12

Our solution to “Where to add noise?” Step 1 : Coarsen the domain Step 2: Probabilistically drop blocks with 0 support – Pick a function f: {b 1, …, b k }  (0,1] (based on external data) – For every block b with 0 support, ignore b with probability f(b) Theorem 4: Parameter ε increases by 49 b max ( max ( 2 noise per block, f(b) ) ) Lecture 17: Fall 12

Utility of the provably private algorithm Experimental Setup: OTM: Currently published OnTheMap data used as original data. All destinations in Minnesota. 120, 690 origins per destination. – chosen by pruning out blocks that are > 100 miles from the destination. ε = 100, δ = Additional leakage due to probabilistic pruning = 4 (min f(b) = ) Utility measured by average commute distance for each destination block. 50Lecture 17: Fall 12

Utility of the provably private algorithm Utility measured by average commute distance for each destination block. Short commutes have low error in both sparse and dense regions. 51Lecture 17: Fall 12

Utility of the provably private algorithm 52 Long commutes in sparse regions are overestimated. Lecture 17: Fall 12

OnTheMap: Summary OnTheMap: A real census application. – Synthetically generated data published for economic research. – Currently, privacy implications are poorly understood. Parameters to the algorithm are state secret. First formal privacy analysis of this application. – Analyzed the privacy of OnTheMap using variants of Differential Privacy. – First solutions to publish useful information despite sparse data. 53Lecture 17: Fall 12

Next Class No class on Wednesday, Oct 31 Guest Lecture: Prof. Jerry Reiter, (Duke Stats) Friday Nov 2, “Privacy in the U.S. Census and Synthetic Data Generation” Lecture 17: Fall 1254

References [M et al ICDE ‘08] A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, L. Vilhuber, “Privacy: From Theory to Practice on the Map”, ICDE 2008 [M et al VLDB ‘09] A. Machanavajjhala, J. Gehrke, M. Gotz, “Data Publishing against Realistic Adversaries”, PVLDB 2(1) 2009 Lecture 17: Fall 1255