Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee work done at Microsoft Research, SVC From Idiosyncratic to Stereotypical:

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

General Linear Model With correlated error terms  =  2 V ≠  2 I.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Dimensionality Reduction PCA -- SVD
Visual Recognition Tutorial
Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
The 1’st annual (?) workshop. 2 Communication under Channel Uncertainty: Oblivious channels Michael Langberg California Institute of Technology.
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Toward Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee Work Done at Microsoft Research.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Unsupervised Learning
Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
Visual Recognition Tutorial
Calibrating Noise to Sensitivity in Private Data Analysis
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Dimensionality Reduction
Towards Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee Work Done at Microsoft Research.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.
1 Introduction to Quantum Information Processing QIC 710 / CS 667 / PH 767 / CO 681 / AM 871 Richard Cleve DC 2117 Lecture 16 (2011)
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Additive Data Perturbation: the Basic Problem and Techniques.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
HMM - Part 2 The EM algorithm Continuous density HMM.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Database Privacy (ongoing work) Shuchi Chawla, Cynthia Dwork, Adam Smith, Larry Stockmeyer, Hoeteck Wee.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Towards Privacy in Public Databases
Deep Feedforward Networks
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
Differential Privacy in Practice
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
CSCI B609: “Foundations of Data Science”
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Summarizing Data by Statistics
Generalization in deep learning
Lecture 15: Least Square Regression Metric Embeddings
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
CS639: Data Management for Data Science
Differential Privacy (1)
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee work done at Microsoft Research, SVC From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla 2 Database Privacy  Census data – a prototypical example Individuals provide information Census bureau publishes sanitized records Privacy is legally mandated; what utility can we achieve?  Inherent Privacy vs Utility trade-off One extreme – complete privacy; no information Other extreme – complete information; no privacy  Goals: Find a middle path preserve macroscopic properties “disguise” individual identifying information Change the nature of discourse Establish framework for meaningful comparison of techniques

Shuchi Chawla 3 Current solutions  Statistical approaches Alter the frequency ( PRAN/DS/PERT ) of particular features, while preserving means. Additionally, erase values that reveal too much  Query-based approaches Perturb output or disallow queries that breach privacy  Unsatisfying Overly constrained definitions; ad-hoc techniques Ad-hoc treatment of external sources of info Erasure can disclose information; Refusal to answer may be revelatory

Shuchi Chawla 4 Our Approach  Crypto-flavored definitions Mathematical characterization of Adversary’s goal Precise definition of when sanitization procedure fails Intuition: seeing sanitized DB gives Adversary an “advantage”  Statistical Techniques Perturbation of attribute values Differs from previous work: perturbation amounts depend on local densities of points  Highly abstracted version of problem If we can’t understand this, we can’t understand real life. If we get negative results here, the world is in trouble.

Shuchi Chawla 5 An outline of this talk  A mathematical formalism What do we mean by privacy? An abstract model of datasets Isolation Good sanitizations  A candidate sanitization Privacy for the 2-point case General argument for privacy of n-point datasets A brief overview of results  Open issues; moving on to real-world applications

Shuchi Chawla 6 What do WE mean by privacy?  [Ruth Gavison] Protection from being brought to the attention of others inherently valuable attention invites further privacy loss  Privacy is assured to the extent that one blends in with the crowd  Appealing definition; can be converted into a precise mathematical statement…

Shuchi Chawla 7 A geometric view  Abstraction : Points in a high dimensional metric space – say R d ; drawn i.i.d. from some distribution Points are unlabeled; you are your collection of attributes Distance is everything points are similar if and only if they are close (L 2 norm)  Real Database (RDB) – private n unlabeled points in d-dimensional space.  Sanitized Database (SDB) – public n’ new points possibly in a different space.

Shuchi Chawla 8 The adversary or Isolator  Using SDB and auxiliary information (AUX), outputs a point q  q “isolates” a real point x, if it is much closer to x than to x’s neighbors. Even if q looks similar to x, it may fail to isolate x if it looks as similar to x’s neighbors as well. Tightly clustered points have a smaller radius of isolation RDB Non-isolating Isolating

Shuchi Chawla 9 (c-1)   I(SDB,AUX) = q  x is isolated if B(q,c  ) contains less than T points  T-radius of x – distance to its T-nearest neighbor  x is “safe” if  x > (T-radius of x)/(c-1) B(q,c  x ) contains x’s entire T-neighborhood c – privacy parameter; eg. 4 q x  cc The adversary or Isolator large T and small c is good

Shuchi Chawla 10 A good sanitization  No way of obtaining privacy if AUX already reveals too much!  Sanitizing algorithm compromises privacy if the adversary is able to increase his probability of isolating a point considerably by looking at its output  Definition of “considerably” can be forgiving, say, n -2  A rigorous definition  I  D  aux z  x  I’ | Pr[I(SDB,z) succeeds on x ] – Pr[I’(z) succeeds on x] | is small Provides a framework for describing the power of a sanitization method, and hence for comparisons

Shuchi Chawla 11 The Sanitizer  The privacy of x is linked to its T-radius  Randomly perturb it in proportion to its T-radius  x’ = San(x)  R B(x,T-rad(x)) T=1

Shuchi Chawla 12 The Sanitizer  The privacy of x is linked to its T-radius  Randomly perturb it in proportion to its T-radius  x’ = San(x)  R B(x,T-rad(x))  Intuition: We are blending x in with its crowd If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one. We are adding random noise with mean zero to x, so several macroscopic properties should be preserved.

Shuchi Chawla 13 Flavor of Results (Preliminary)  Assumptions Data arises from a mixture of Gaussians dimensions d, num of points n are large; d =  (log n)  Results Privacy: An adversary who knows the Gaussians and some auxiliary information cannot isolate any point with probability more than 2 -  (d) (Several special cases; General result not yet proved) Utility: An honest user who does not know the Gaussians, can compute the means with a high probability

Shuchi Chawla 14 The “simplest” interesting case  RDB = {x, y} x, y 2 R B(o,  ) where o – “origin”  T=1; c=4; SDB = { x’, y’ }  The adversary knows x’, y’,  and  = |x-y|  We show: There are m=2  (d) “decoy” pairs (x i,y i ) (x i,y i ) are legal pre-images of (x’,y’) that is, |x i -y i |=  and Pr[ x i,y i | x’,y’ ] = Pr[ x,y | x’,y’ ] Adversary cannot know which of the (x i, y i ) represents reality The adversary can only isolate one point in {x 1,y 1, … x m, y m } at a time

Shuchi Chawla 15 The “simplest” interesting case  Consider a hyperplane H through x’, y’ and o  x H, y H – mirror reflections of x, y through H Note: reflections preserve distances!  The world of x H, y H looks identical to the world of x, y x y y’ x’ xHxH yHyH Pr[ x H,y H | x’,y’ ] = Pr[ x,y | x’,y’ ] H

Shuchi Chawla 16 The “simplest” interesting case  Consider a hyperplane H through x’, y’ and o  x H, y H – mirror reflections of x, y through H Note: reflections preserve distances!  The world of x H, y H looks identical to the world of x, y  How many different H such that the corresponding x H are pairwise distant? 2r sin  r 22 Sufficient to pick r= 2/3  and  = 30° Fact: There are 2  (d) vectors in d-dim, at angle 60° from each other.  Probability that adversary wins ≤ 2 -  (d) x x1x1 x2x2 = 2/3  r

Shuchi Chawla 17 The general case… n points  The adversary is given n-1 real points x 2,…,x n and one sanitized point x’ 1 ; T = 1; flat prior  Reflections do not work – too many constraints  A more direct argument – examine posterior distribution on x 1  Let Z = { p  R d | p is a legal pre-image for x’ 1 } Q = { p | if x 1 =p then x 1 is isolated by q }  We show that Pr[ Q ∩ Z | x’ 1 ] ≤ 2 -  (d) Pr[ Z | x’ 1 ] Pr[x 1 in Q ∩ Z | x’ 1 ] = prob mass contribution from Q ∩ Z / contribution from Z = 2 1-d /(1/4)

Shuchi Chawla 18 Q The general case… n points Z = { p | p is a legal pre-image for x’ 1 } Q = { p | x 1 =p is isolated by q } q x’ x2x2 x3x3 x4x4 x5x5 Z Key observation: As |q-x’| increases, Q becomes larger. But, larger distance from x’ implies smaller probability mass, because x is randomized over a larger area Q∩ZQ∩Z x6x6 Probability depends only on the solid angle subtended at x’

Shuchi Chawla 19 The general case… n sanitized points  Privacy does not follow immediately from the previous analysis with real points!  Problem: Sanitization is non-oblivious Other sanitized points reveal information about x, if x is their nearest neighbor  Solution: Decouple the two kinds of information – from x’ and x’ i L R

Shuchi Chawla 20 The general case… n sanitized points  Claim 1 (Privacy for L): Given all sanitizations, all points in R, and all but one point in L, adversary cannot isolate last point Follows from the proof for n-1 real points  Claim 2 (Privacy for R): Given all sanitizations, all points in L and all but one point in R, adversary cannot isolate last point Work under progress L R Idea: Show that the adversary cannot distinguish between whether R contains some point x or not. (Information-theoretic argument)

Shuchi Chawla 21 Results on privacy.. An overview DistributionNum. of points Revealed to adversary Auxiliary information Uniform on surface of sphere 2Both sanitized pointsDistribution, 1-radius Uniform over a bounding box or surface of sphere nOne sanitized point, all other real points Distribution Uniform over a bounding box nn/2 sanitized pointsDistribution, all but one real points

Shuchi Chawla 22 Results on utility… An overview Distributional/ Worst-case ObjectiveAssumptionsResult Worst-caseFind K clusters minimizing largest diameter - Optimal diameter as well as approximations increase by at most a factor of 3 DistributionalFind k maximum likelihood clusters Mixture of k Gaussians Correct clustering with high probability as long as means are pairwise sufficiently far Skip

Shuchi Chawla 23 Learning mixtures of Gaussians (Spectral methods)  Observation: Top eigenvectors of a matrix span a low- dimensional space that yields a good approximation of complex data sets, in particular Gaussian data.  Intuition Sampled points are “close” to means of the corresponding Gaussians in any subspace Span of top k singular vectors approximates span of the means Distances between means of Gaussians are preserved Other distances shrink by a factor of √ (k/n)  Our goal: show that the same algorithm works for clustering sanitized data.

Shuchi Chawla 24 Spectral techniques for perturbed data  A sanitized point is the sum of two Gaussian variables – sample + noise  w.h.p. the 1-radius of a point is less than the “radius” of its Gaussian  Variance of the noise is small  Sanitized points are still close to their means (uses independence of direction)  Span of top k singular vectors still approximates the span of means of Gaussians  Distances between means are preserved; others shrink

Shuchi Chawla 25 Future directions  Extend the privacy argument to other “nice” distributions Can revealing the distribution hurt privacy?  Characterize the kind of auxiliary information that is acceptable Depends on the distribution on the datapoints  The low-dimensional case Is it inherently impossible? Dinur & Nissim show impossibility for the 1-dimensional case  Extend the utility argument to other interesting macroscopic properties

Shuchi Chawla 26 What about the real world?  Lessons from the abstract model High dimensionality is our friend Gaussian/spherically symmetric perturbations seem to be the right thing to do Need to scale different attributes appropriately, so that data is well rounded  Moving towards real data Outliers – Our notion of c-isolation deals with them - Existence of outlier may be disclosed Discrete attributes – Convert them into real-valued attributes - e.g. Convert a binary variable into a probability

Shuchi Chawla 27 Questions?

Shuchi Chawla 28 So far…  A rigorous definition of privacy  Candidate sanitization procedure that displays resistance to breach of privacy preserves utility to a large extent  Rest of this talk… open issues and food for thought The low-dimensional case Inherently “bad” distributions Privacy and Kernel Density Estimation Lessons learnt from the abstract case

Shuchi Chawla 29 The low-dimensional case  High-dimensional data  more randomness in perturbation  Crucial to our approach of showing “many possible pre-images”  Supporting evidence – Dinur-Nissim lower bound on one-dimensional data  Likewise, we require data to be well-rounded Preprocess data to increase spread in all dimensions

Shuchi Chawla 30 Inherently “bad” distributions  Can knowing the distribution hurt privacy?  Utility implies knowledge of the distribution

Shuchi Chawla 31  The low-dimensional case seems problematic because there’s very little randomness is there an inherent difficulty? Same problem with data lying in a low dimensional manifold we should first preprocess it so that it becomes well rounded. but what we are really doing is to add noise in proportion to the spread in each dimension

Shuchi Chawla 32  can we even achieve privacy and utility simultaneously?  what if the distribution lets you isolate a point  what kind of AUX can we handle?  if AUX + D => isolation, then we cannot sanitize D and allow AUX because sanitization may reveal D while AUX does not contain it. So modify the definition to say that Pr[I(SDB,AUX)=1]-Pr[I’(D,AUX, utility)=1]=negligible

Shuchi Chawla 33  outliers

Shuchi Chawla 34  kernel density estimation?  our goal is similar to kde because we also reconstruct the distribution and sample from it. if we could show this sampling did not depend on the actual points, we would be done

Shuchi Chawla 35 A good sanitization: candidate definition   Aux  D  I  I ’ ( RDB  R D ) Pr[ I( SDB, Aux ) ] – Pr[ I ’ ( Aux ) ] ≤ 1/poly(n)  Probability is over the choices of the sanitizing algorithm, and random process that picks RDB.  May need to restrict Aux somewhat.

Shuchi Chawla 36 Some notation  Points in RDB – x 1,x 2,…  Points in SDB – x’ 1,x’ 2,…  Adversary’s guess – q  T-radius of x –  x  We use T=1; c=4 i.e. We perturb points randomly to within their 1-radius The isolating adversary must get closer to a point x than its distance to its nearest neighbor. i.e. |q-x| ≤  x /3

Shuchi Chawla 37 The general case… n points  The adversary is given n-1 real points x 2,…,x n and one sanitized point x’ 1  Reflections do not work – too many constraints  A more direct argument – examine posterior distribution on x 1  Show that for any point q, the total probability mass of points isolated by q is exponentially small

Shuchi Chawla 38 Results on privacy.. An overview DistributionNum. of points Revealed to adversary Auxiliary information Uniform on surface of sphere 2Both sanitized pointsDistribution, 1-radius Uniform over a bounding box or surface of sphere nOne sanitized point, all other real points Distribution Uniform over a bounding box nn/2 sanitized pointsDistribution, all but one real points

Shuchi Chawla 39 The rest of this talk…  A geometric view; Isolation  Our contributions The sanitizing algorithm Proof sketch for privacy – n=2 Extending the proof to larger n Results on utility  Future directions

Shuchi Chawla 40 The general case… n sanitized points  Privacy does not follow immediately from the previous analysis with real points!  Problem: Sanitization is non-oblivious Other sanitized points reveal information about x, if x is their nearest neighbor  Idea: Decouple the two kinds of information – from x’ and x’ i