Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.

Slides:



Advertisements
Similar presentations
Polylogarithmic Private Approximations and Efficient Matching
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Statistics review of basic probability and statistics.
Sampling Distributions
Privacy Enhancing Technologies
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
An brief tour of Differential Privacy Avrim Blum Computer Science Dept Your guide:
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.
1 Deriving Private Information from Randomized Data Zhengli Huang Wenliang (Kevin) Du Biao Chen Syracuse University.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Practical Private Computation and Zero- Knowledge Tools for Privacy-Preserving Distributed Data Mining Yitao Duan and John Canny
Determining the Size of
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
Statistical Hypothesis Testing. Suppose you have a random variable X ( number of vehicle accidents in a year, stock market returns, time between el nino.
All of Statistics Chapter 5: Convergence of Random Variables Nick Schafer.
Statistical Databases – Query Auditing Li Xiong CS573 Data Privacy and Anonymity Partial slides credit: Vitaly Shmatikov, Univ Texas at Austin.
CS573 Data Privacy and Security Statistical Databases
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
Slide 1 Differential Privacy Xintao Wu slides (P2-20) from Vitaly Shmatikove, then from Adam Smith.
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.
Yaomin Jin Design of Experiments Morris Method.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Additive Data Perturbation: the Basic Problem and Techniques.
TobiasEcon 472 Law of Large Numbers (LLN) and Central Limit Theorem (CLT)
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
BioSS reading group Adam Butler, 21 June 2006 Allen & Stott (2003) Estimating signal amplitudes in optimal fingerprinting, part I: theory. Climate dynamics,
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Geo-Indistinguishability: Differential Privacy for Location Based Services Miguel Andres, Nicolas Bordenabe, Konstantinos Chatzikokolakis, Catuscia Palamidessi.
1 We will now look at the properties of the OLS regression estimators with the assumptions of Model B. We will do this within the context of the simple.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Joint Moments and Joint Characteristic Functions.
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
STA347 - week 91 Random Vectors and Matrices A random vector is a vector whose elements are random variables. The collective behavior of a p x 1 random.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Chapter 8: Probability: The Mathematics of Chance Probability Models and Rules 1 Probability Theory  The mathematical description of randomness.  Companies.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
SENG521 (Fall SENG 521 Software Reliability & Testing Preparing for Test (Part 6a) Department of Electrical & Computer Engineering,
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Chapter 6: Sampling Distributions
Chapter 6: Sampling Distributions
Privacy-preserving Release of Statistics: Differential Privacy
Sample Mean Distributions
Chapter 7: Sampling Distributions
Spatial Online Sampling and Aggregation
Differential Privacy in Practice
EC 331 The Theory of and applications of Maximum Likelihood Method
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Summarizing Data by Statistics
Gentle Measurement of Quantum States and Differential Privacy
Scott Aaronson (UT Austin) UNM, Albuquerque, October 18, 2018
CS639: Data Management for Data Science
Gentle Measurement of Quantum States and Differential Privacy *
Some contents are borrowed from Adam Smith’s slides
Differential Privacy (1)
Presentation transcript:

Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009

The Problem Given a database d, consisted of records about individual users, wish to release some statistical information f(d) without compromising individual’s privacy

Our Results Main stream approach relies on additive noise. We show that this alone is neither sufficient, nor, for some type of queries, necessary for privacy The inherent uncertainty associated with unknown quantities is enough to provide the same privacy without external noise Provide the first mathematical proof, and conditions, for the widely accepted heuristic that aggregates are private

Preliminaries A database is, D is an arbitrary domain d i is drawn i.i.d. from a public distribution Hamming distance H(d, d') between two databases d, d' = the number of entries on which they differ Query: g ( d i )=[ g 1 (d i ),…, g m (d i ) ] T, g j (d i ): D  [0, 1]

The Power of Addition A large number of popular algorithms can be run with addition-only steps –Linear algorithms: voting and summation, nonlinear algorithm: regression, classification, SVD, PCA, k-means, ID3, EM etc –All algorithms in the statistical query model –Many other gradient-based numerical algorithms Addition-only framework has very efficient private implementation in cryptography and admits efficient zero-knowledge proofs (ZKPs)

Notions of Privacy But what do we mean by privacy? I don’t know how much you weigh but I can find out its highest digit is 2 Or, I don’t know whether you drink or not but I can find that drinking people are happier The definition must meet people’s expectation And allow for rigorous mathematical reasoning

Differential Privacy The risk to my privacy should not substantially increase as a result of participating in a statistical database:

A gives -differential privacy if for all values of DB and Me and all transcripts t : Pr [ t ] Differential Privacy

No perceptible risk is incurred by joining DB. Any info adversary can obtain, it could obtain without Me (my data). Differential Privacy Pr [ t ]

Differential Privacy w/ Additive Noise Σ f(d)f(d) Noise Response Noise must be: (1) independently generated for each query; (2) has sufficiently large variance. Can be Laplace, Gaussian, Binomial But … The variance of independent noise can be reduced via averaging. Fix: Restrict the total number of queries, i.e., the dimensionality of f,(to m )

But It Is Not effective djdj m queries 2m queries If a user profile is shared among multiple databases, one could get more queries about the user than differential privacy allows

And It Is Not Necessary Either There is another source of randomness that could provide similar protection as external noise – the data itself Some functions are insensitive to small perturbation to the input

Aggregates of n Random Variables Probability theory has many established results on the asymptotic behavior of aggregates of n random variables Under certain conditions, when n is sufficiently large, the aggregates converge in some way to a distribution independent of the individual samples except for a few distributional parameters.

Central Limit Theorem

Differential Privacy: An individual’s Perspective Privacy is defined in terms of perturbation to individual data record Existing solutions achieve this via external noise Each element is independently perturbed

Sum Queries With sum queries, when n is large, for each k, the quantity converges in distribution to gaussian (CLT) Since for every k, can Δ k provide similar protection? Compared against Lemma 1, the difference is that the perturbations to each element of g(d k ) are not independent

Privacy without Noise σ x1x1 x2x2 g(dk)g(dk) σ x1x1 x2x2 g(dk)g(dk) (a)Independent and (b) non-independent gaussian perturbations in 2-dimensional case. (b) has variance σ 2 along its minor axis. Note how the perturbation in (b) “envelops” that in (a).

Main Result where is the smallest eigenvalue of V

A Simple Necessary Condition Suppose we have answered k queries which are all deemed safe For the ( k+1 )-th query to be safe, the condition is Adding a new row is

A Simple Necessary Condition We know σ k+1 ( ) = 0 x k+1 must be “large” enough to perturb the singular value away from 0 by sufficient amount. Using matrix perturbation theory (Weyl theorem), we have

Query Auditing Instead of perturbing the responses, query auditing restricts the queries that can cause privacy breach Must be careful with denials q(d) or DENY q

Simulatability Key idea: if the adversary can simulate the output of the auditor using only public information, then nothing more is leaked Denials: if the decision to deny or grant query answers is based on information that can be approximated by the adversary, then the decision itself does not reveal more info

Simulatable Query Auditing Previous schemes achieve simulatablity by not using the data Using our condition to verify privacy in online query auditing is simulatable Even though the data is used in the decision making process, the information is still simulatable

Simulatable Query Auditing The auditor: The simulator:

Simulatable Query Auditing Using law of large numbers, and Weyl’s theorem (again!), we can prove that when n is large, for any

Issue of Shared Records We are not totally immune to this vulnerability, but our privacy condition is actually stronger than simply restricting the number of queries, even though we do not add noise An adversary gets less information about individual records from the same number of queries

More info: Full version of the paper: /pwn-full.pdf