Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases.

Slides:



Advertisements
Similar presentations
Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions.
Advertisements

Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Chapter 8 Random-Variate Generation
Privacy Enhancing Technologies
Learning using Graph Mincuts Shuchi Chawla Carnegie Mellon University 1/11/2003.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
An brief tour of Differential Privacy Avrim Blum Computer Science Dept Your guide:
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Toward Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee Work Done at Microsoft Research.
Machine Learning CMPT 726 Simon Fraser University
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Towards Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee Work Done at Microsoft Research.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal,
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Statistical Decision Theory
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Dimensions of Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Additive Data Perturbation: the Basic Problem and Techniques.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Geo-Indistinguishability: Differential Privacy for Location Based Services Miguel Andres, Nicolas Bordenabe, Konstantinos Chatzikokolakis, Catuscia Palamidessi.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Privacy-preserving data publishing
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.
Differential Privacy (1). Outline  Background  Definition.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Privacy Preserving in Social Network Based System PRENTER: YI LIANG.
Database Privacy (ongoing work) Shuchi Chawla, Cynthia Dwork, Adam Smith, Larry Stockmeyer, Hoeteck Wee.
Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee work done at Microsoft Research, SVC From Idiosyncratic to Stereotypical:
Towards Privacy in Public Databases
Lesson 8: Basic Monte Carlo integration
Oliver Schulte Machine Learning 726
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
Differential Privacy in Practice
Vitaly (the West Coast) Feldman
Summarizing Data by Statistics
Generalization in deep learning
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
CS639: Data Management for Data Science
Differential Privacy (1)
Presentation transcript:

Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla 2 Database Privacy  Census data – a prototypical example Individuals provide information Census bureau publishes sanitized records Privacy is legally mandated; what utility can we achieve?  Our Goal: What do we mean by preservation of privacy? Characterize the trade-off between privacy and utility – disguise individual identifying information – preserve macroscopic properties Develop a “good” sanitizing procedure with theoretical guarantees

Shuchi Chawla 3 An outline of this talk  A mathematical formalism What do we mean by privacy? Prior work An abstract model of datasets Isolation; Good sanitizations  A candidate sanitization A brief overview of results General argument for privacy of n-point datasets  Open issues and concluding remarks

Shuchi Chawla 4 Privacy… a philosophical view-point  [Ruth Gavison] … includes protection from being brought to the attention of others … Matches intuition; inherently desirable Attention invites further loss of privacy Privacy is assured to the extent that one blends in with the crowd  Appealing definition; can be converted into a precise mathematical statement!

Shuchi Chawla 5 Database Privacy  Statistical approaches Alter the frequency ( PRAN/DS/PERT ) of particular features, while preserving means. Additionally, erase values that reveal too much  Query-based approaches involve a permanent trusted third party Query monitoring: dissallow queries that breach privacy Perturbation: Add noise to the query output [Dinur Nissim’03, Dwork Nissim’04]  Statistical perturbation + adversarial analysis [Evfimievsky et al ’03] combine statistical techniques with analysis similar to query-based approaches

Shuchi Chawla 6 Everybody’s First Suggestion  Learn the distribution, then output: A description of the distribution, or, Samples from the learned distribution  Want to reflect facts on the ground Statistically insignificant facts can be important for allocating resources

Shuchi Chawla 7 A geometric view  Abstraction : Points in a high dimensional metric space – say R d ; drawn i.i.d. from some distribution Points are unlabeled; you are your collection of attributes Distance is everything  Real Database (RDB) – private n unlabeled points in d-dimensional space.  Sanitized Database (SDB) – public n’ new points possibly in a different space.

Shuchi Chawla 8 The adversary or Isolator  Using SDB and auxiliary information (AUX), outputs a point q  q “isolates” a real point x, if it is much closer to x than to x’s neighbors,  T-radius of x – distance to its T-nearest neighbor  x is “safe” if  x > (T-radius of x)/(c-1) B(q, c  x ) contains x’s entire T-neighborhood (c-1)  c – privacy parameter; eg. 4 q x  cc large T and small c is good i.e., if B(q,c  ) contains less than T RDB points

Shuchi Chawla 9 A good sanitization  Sanitizing algorithm compromises privacy if the adversary is able to considerably increase his probability of isolating a point by looking at its output  A rigorous (and too ideal) definition  D  I  I ’ w.o.p RDB 2 R D n  aux z  x 2 RDB : | Pr[ I (SDB,z) isolates x] – Pr[ I ’ (z) isolates x] | ·  /n  Definition of  can be forgiving, say, 2 -  (d) or (1 in a 1000)  Quantification over x : If aux reveals info about some x, the privacy of some other y should still be preserved  Provides a framework for describing the power of a sanitization method, and hence for comparisons

Shuchi Chawla 10 The Sanitizer  The privacy of x is linked to its T-radius  Randomly perturb it in proportion to its T-radius  x’ = San(x)  R S(x,T-rad(x))  Intuition: We are blending x in with its crowd If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one. We are adding random noise with mean zero to x, so several macroscopic properties should be preserved.

Shuchi Chawla 11 Results on privacy.. An overview DistributionNum. of points Revealed to adversaryAuxiliary information Uniform on surface of sphere 2Both sanitized pointsDistribution, 1-radius Uniform over a bounding box or surface of sphere nOne sanitized point, all other real points Distribution, all real points Gaussian2 o(d) n sanitized pointsDistribution Gaussian2  (d) Work under progress

Shuchi Chawla 12 Results on utility… An overview Distributional/ Worst-case ObjectiveAssumptionsResult Worst-caseFind K clusters minimizing largest diameter - Optimal diameter as well as approximations increase by at most a factor of 3 DistributionalFind k maximum likelihood clusters Mixture of k Gaussians Correct clustering with high probability as long as means are pairwise sufficiently far

Shuchi Chawla 13 A special case - one sanitized point  RDB = {x 1,…,x n }  The adversary is given n-1 real points x 2,…,x n and one sanitized point x’ 1 ; T = 1; c=4; “flat” prior  Recall: x’ 1 2 R S(x 1,|x 1 -y|) where y is the nearest neighbor of x 1  Main idea: Consider the posterior distribution on x 1 Show that the adversary cannot isolate a large probability mass under this distribution

Shuchi Chawla 14  Let Z = { p  R d | p is a legal pre-image for x’ 1 } Q = { p | if x 1 =p then x 1 is isolated by q }  We show that Pr[ Q ∩ Z | x’ 1 ] ≤ 2 -  (d) Pr[ Z | x’ 1 ] Pr[x 1 in Q ∩ Z | x’ 1 ] = prob mass contribution from Q ∩ Z / contribution from Z = 2 1-d /(1/4) A special case - one sanitized point Q q x’ 1 x2x2 x3x3 x4x4 x5x5 Z Q∩ZQ∩Z x6x6 |p-q| · 1/3 |p-x’ 1 |

Shuchi Chawla 15 Contribution from Z  Pr[x 1 =p | x’ 1 ]  Pr[x’ 1 | x 1 =p]  1/r d (r = |x’ 1 -p|) Increase in r  x’ 1 gets randomized over a larger area – proportional to r d. Hence the inverse dependence.  Pr[x’ 1 | x 1 2 S]  s S 1/r d  solid angle subtended at x’ 1  Z subtends a solid angle equal to at least half a sphere at x’ 1 x’ 1 x2x2 x3x3 x4x4 x5x5 Z x6x6 S r p

Shuchi Chawla 16 Contribution from Q Å Z  The ellipsoid is roughly as far from x’ 1 as its longest radius  Contribution from ellipsoid is  2 -d x total solid angle  Therefore, Pr[ x 1 2 Q Å Z ] / Pr[ x 1 2 Z ]  2 -d Q q x’ 1 x2x2 x3x3 x4x4 x5x5 Z Q∩ZQ∩Z x6x6 rr

Shuchi Chawla 17 The general case… n sanitized points  Initial intuition is wrong: Privacy of x 1 given x 1 ’ and all the other points in the clear does not imply privacy of x 1 given x 1 ’ and sanitizations of others! Sanitization is non-oblivious – Other sanitized points reveal information about x, if x is their nearest neighbor  Where we are now Consider some example of safe sanitization (not necessarily using perturbations) Density regions? Histograms? Relate perturbations to the safe sanitization Uniform distribution; histogram over fixed-size cells  exponentially low probability of isolation

Shuchi Chawla 18 Future directions  Extend the privacy argument to other “nice” distributions For what distributions is there no meaningful privacy— utility trade-off?  Characterize acceptable auxiliary information Think of auxiliary information as an a priori distribution  The low-dimensional case – Is it inherently impossible?  Discrete-valued attributes Our proofs require a “spread” in all attributes  Extend the utility argument to other interesting macroscopic properties – e.g. correlations

Shuchi Chawla 19 Conclusions  A first step towards understanding the privacy- utility trade-off  A general and rigorous definition of privacy  A work in progress!

Shuchi Chawla 20 Questions?