Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009
The Problem Given a database d, consisted of records about individual users, wish to release some statistical information f(d) without compromising individual’s privacy
Our Results Main stream approach relies on additive noise. We show that this alone is neither sufficient, nor, for some type of queries, necessary for privacy The inherent uncertainty associated with unknown quantities is enough to provide the same privacy without external noise Provide the first mathematical proof, and conditions, for the widely accepted heuristic that aggregates are private
Preliminaries A database is, D is an arbitrary domain d i is drawn i.i.d. from a public distribution Hamming distance H(d, d') between two databases d, d' = the number of entries on which they differ Query: g ( d i )=[ g 1 (d i ),…, g m (d i ) ] T, g j (d i ): D [0, 1]
The Power of Addition A large number of popular algorithms can be run with addition-only steps –Linear algorithms: voting and summation, nonlinear algorithm: regression, classification, SVD, PCA, k-means, ID3, EM etc –All algorithms in the statistical query model –Many other gradient-based numerical algorithms Addition-only framework has very efficient private implementation in cryptography and admits efficient zero-knowledge proofs (ZKPs)
Notions of Privacy But what do we mean by privacy? I don’t know how much you weigh but I can find out its highest digit is 2 Or, I don’t know whether you drink or not but I can find that drinking people are happier The definition must meet people’s expectation And allow for rigorous mathematical reasoning
Differential Privacy The risk to my privacy should not substantially increase as a result of participating in a statistical database:
A gives -differential privacy if for all values of DB and Me and all transcripts t : Pr [ t ] Differential Privacy
No perceptible risk is incurred by joining DB. Any info adversary can obtain, it could obtain without Me (my data). Differential Privacy Pr [ t ]
Differential Privacy w/ Additive Noise Σ f(d)f(d) Noise Response Noise must be: (1) independently generated for each query; (2) has sufficiently large variance. Can be Laplace, Gaussian, Binomial But … The variance of independent noise can be reduced via averaging. Fix: Restrict the total number of queries, i.e., the dimensionality of f,(to m )
But It Is Not effective djdj m queries 2m queries If a user profile is shared among multiple databases, one could get more queries about the user than differential privacy allows
And It Is Not Necessary Either There is another source of randomness that could provide similar protection as external noise – the data itself Some functions are insensitive to small perturbation to the input
Aggregates of n Random Variables Probability theory has many established results on the asymptotic behavior of aggregates of n random variables Under certain conditions, when n is sufficiently large, the aggregates converge in some way to a distribution independent of the individual samples except for a few distributional parameters.
Central Limit Theorem
Differential Privacy: An individual’s Perspective Privacy is defined in terms of perturbation to individual data record Existing solutions achieve this via external noise Each element is independently perturbed
Sum Queries With sum queries, when n is large, for each k, the quantity converges in distribution to gaussian (CLT) Since for every k, can Δ k provide similar protection? Compared against Lemma 1, the difference is that the perturbations to each element of g(d k ) are not independent
Privacy without Noise σ x1x1 x2x2 g(dk)g(dk) σ x1x1 x2x2 g(dk)g(dk) (a)Independent and (b) non-independent gaussian perturbations in 2-dimensional case. (b) has variance σ 2 along its minor axis. Note how the perturbation in (b) “envelops” that in (a).
Main Result where is the smallest eigenvalue of V
A Simple Necessary Condition Suppose we have answered k queries which are all deemed safe For the ( k+1 )-th query to be safe, the condition is Adding a new row is
A Simple Necessary Condition We know σ k+1 ( ) = 0 x k+1 must be “large” enough to perturb the singular value away from 0 by sufficient amount. Using matrix perturbation theory (Weyl theorem), we have
Query Auditing Instead of perturbing the responses, query auditing restricts the queries that can cause privacy breach Must be careful with denials q(d) or DENY q
Simulatability Key idea: if the adversary can simulate the output of the auditor using only public information, then nothing more is leaked Denials: if the decision to deny or grant query answers is based on information that can be approximated by the adversary, then the decision itself does not reveal more info
Simulatable Query Auditing Previous schemes achieve simulatablity by not using the data Using our condition to verify privacy in online query auditing is simulatable Even though the data is used in the decision making process, the information is still simulatable
Simulatable Query Auditing The auditor: The simulator:
Simulatable Query Auditing Using law of large numbers, and Weyl’s theorem (again!), we can prove that when n is large, for any
Issue of Shared Records We are not totally immune to this vulnerability, but our privacy condition is actually stronger than simply restricting the number of queries, even though we do not add noise An adversary gets less information about individual records from the same number of queries
More info: Full version of the paper: /pwn-full.pdf