Private Data Management with Verification Yan Chen Duke University Advisor: Ashwin Machanavajjhala
Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary
Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary
Data Privacy
Differential Privacy Definition 1 : ε-Differential Privacy A randomized algorithm M satisfies ε-Differential Privacy if for any two neighboring datasets D1 and D2, any output S, [C.Dwork etc. ICALP 2006]
Differential Privacy Property 1 (Sequential Composition) M1 and M2 satisfy ε1 and ε2-differential privacy. Releasing the results of both M1(D) and M2(D) will satisfy (ε1+ε2)-differential privacy. Property 2 (Parallel Composition) If D1, D2 are subsets of D and D1∩D2 = Φ. Then releasing M1(D1) and M2(D2) will satisfy max(ε1,ε2)-differential privacy. Property 3 (Post-processing) If M3 is any algorithm, releasing M3(M1(D)) will still ε1-differential privacy.
Laplace Mechanism Definition 2 : Laplace Mechanism For any function f: D -> R^n, the Laplace Mechanism M: M(D) = f(D) + η. η is a vector of independent random variables drawn from a Laplace distribution with parameter = Δ(f) / ε. Δ(f): global sensitivity of f [C.Dwork etc. ICALP 2006]
Private Data Management Framework Data Curator Data Synthesizer Querier Verifier
Framework - Open Questions Differentially Private Algorithms for private verification on different tasks Protection for Data Synthesis
Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary
Differentially Private Regression Diagnostics Generate Model Evaluate Model (Regression Diagnostics) Algorithms for linear/logistic regression while ensuring privacy No privacy-preserving techniques for regression diagnostics
Differentially Private Regression Diagnostics PriRP – Residual Plot (an error measure for linear regression) PriROC – ROC curve (an error measure for logistic regression)
Residual Plot Linear Regression models the outcome: Suppose b is the estimate model, the residual of each point: Residual Plot: residuals v.s. predicted values
Residual Plot
Private Residual Plot - PriRP Private Bounds Computation Residual Plots Perturbation
Private Residual Plot - PriRP Private Bounds Computation Real bounds contain sensitive info of data The sensitivity of the bound is infinity. Q: Identify the bounds (-b,b) such that at least θ fraction of the points are contained in (-b,b) with high probability? SVT based algorithm [C. Dwork 14] qi : how many points within the bound (-u*2^i, u*2^i) ?
Private Residual Plot - PriRP Residual Plots Perturbation Q: Estimate 2D probability density inside a bounded region? 1. Discretization 2. Perturbation 3. Sampling
Private Residual Plot - PriRP Empirical Evaluation (data scale = 5000)
Private Residual Plot - PriRP Empirical Evaluation Define similarity between real RP and perturbed RP: Discretize the bound of real RP into 10*10 equal-width grid cells Compute the distribution of residuals among all grids cells c in real RP and perturbed RP, denoted as P(c) and P’(c)
Private Residual Plot - PriRP Empirical Evaluation
ROC curve
ROC curve ROC curve: TPR v.s. FPR in terms of all possible θ AUC: area under the curve
Private ROC Curve - PriROC Choosing Thresholds Computing TPRs and FPRs Ensuring Monotonicity
Private ROC Curve - PriROC Choosing Thresholds 1. data independent strategy: fix |Θ| = N+1, Θ = {0,1/N,…,N-1/N,1} Problem: Bad for the skewed predictions 2. data dependent strategy: Ideas: iteratively choose thresholds evenly dividing the data => iteratively finding medians (as thresholds) (smooth sensitivity & deal with invalid thresholds)
Private ROC Curve - PriROC Computing TPRs and FPRs Compute TPRs from computing prefix range queries on Similarly for computing FPRs
Private ROC Curve - PriROC Ensuring Monotonicity To ensure monotonicity, applying method from [Hay. VLDB 10]
Private ROC Curve - PriROC Empirical Evaluation
Private ROC Curve - PriROC Empirical Evaluation AUC Symmetric Difference
Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary
Future Work - Verification Counting queries 1. Data Independent Algorithms (easy) e.g. Laplace Mechanism 2. Data Dependent Algorithms (hard) err is data dependent
Future Work - Verification Definition: Sensitivity of Randomized Algorithm For any randomized algorithm A: D -> R with random variable stream N, we say the randomized algorithm A has sensitivity Δ, if for any two neighboring datasets D and D’, any fixed values of N, Theorem: If randomized algorithm A has sensitivity Δ, then satisfies ε-differential privacy and
Future Work - Verification Another interesting problem: Given an error bound, offer the output only when its error is bounded by the error bound w.h.p.
Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary
Future Work - Data Synthesis Queries on the synthetic data release the information of the synthetic data. Differentially Private Data Synthesis good in terms of the privacy for the whole system, but too much noise Weaker privacy definition? Data synthesis process should be protected
Future Work - Data Synthesis What kind of weaker privacy definition we can use for generating synthetic data? Can the chosen weaker privacy definition composed with differential privacy? How the whole system is protected? Even if the weaker privacy definition is composed with differential privacy, what is the tightest composition result? More complex data synthesis algorithms: Can we empirically evaluate what they protect?
Outlines Motivation Private Verification – differentially private regression diagnostics Future work (ongoing): private verification on counting queries for data dependent algorithms Future work (idea): private data synthesis Summary
Summary We present the framework for private data management with verification and propose some open questions We start with query verification on differentially private regression diagnostics. We propose the first differentially private algorithms PriRP (for linear regression) and PriROC (for logistic regression) We present our initial works on verification of data dependent algorithms for counting queries. We briefly show the idea of private data synthesis as another future direction.