Towards Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee Work Done at Microsoft Research.

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Lecture 3 Nonparametric density estimation and classification
Sketching for M-Estimators: A Unified Approach to Robust Regression
Privacy Enhancing Technologies
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Toward Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee Work Done at Microsoft Research.
Evaluating Hypotheses
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
CHAPTER 6 Statistical Analysis of Experimental Data
Calibrating Noise to Sensitivity in Private Data Analysis
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Inferential Statistics
Right Protection via Watermarking with Provable Preservation of Distance-based Mining Spyros Zoumpoulis Joint work with Michalis Vlachos, Nick Freris and.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
Provable Protocols for Unlinkability Ron Berman, Amos Fiat, Amnon Ta-Shma Tel Aviv University.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.
Systems and Internet Infrastructure Security (SIIS) LaboratoryPage Systems and Internet Infrastructure Security Network and Security Research Center Department.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Dimensions of Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Additive Data Perturbation: the Basic Problem and Techniques.
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
CHAPTER Basic Definitions and Properties  P opulation Characteristics = “Parameters”  S ample Characteristics = “Statistics”  R andom Variables.
Hedonic Clustering Games Moran Feldman Joint work with: Seffi Naor and Liane Lewin-Eytan.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Differential Privacy (1). Outline  Background  Definition.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Privacy Preserving in Social Network Based System PRENTER: YI LIANG.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Privacy (ongoing work) Shuchi Chawla, Cynthia Dwork, Adam Smith, Larry Stockmeyer, Hoeteck Wee.
Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee work done at Microsoft Research, SVC From Idiosyncratic to Stereotypical:
Towards Privacy in Public Databases
University of Texas at El Paso
Private Data Management with Verification
Stochastic Streams: Sample Complexity vs. Space Complexity
Hans Bodlaender, Marek Cygan and Stefan Kratsch
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
Differential Privacy in Practice
Analyzing Reliability and Validity in Outcomes Assessment Part 1
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
When are Fuzzy Extractors Possible?
When are Fuzzy Extractors Possible?
Analyzing Reliability and Validity in Outcomes Assessment
CS639: Data Management for Data Science
Some contents are borrowed from Adam Smith’s slides
Differential Privacy (1)
Presentation transcript:

Towards Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee Work Done at Microsoft Research

2 Database Privacy  Think “Census” Individuals provide information Census Bureau publishes sanitized records Privacy is legally mandated; what utility can we achieve?  Inherent Privacy vs Utility tension One extreme – complete privacy; no information Other extreme – complete information; no privacy  Goals: Find a middle path preserve macroscopic properties “disguise” individual identifying information Change the nature of discourse Establish framework for meaningful comparison of techniques

3 Outline  Definitions privacy, defined in the breach sanitization requirements utility goals  Example: Recursive Histogram Sanitizations description of technique a robust proof of privacy  Example: “Round” Sanitizations nice learning properties privacy via cross-training  Setting the Real World Context dealing with auxiliary information

4 Outline  Definitions privacy, defined in the breach sanitization requirements utility goals  Example: Recursive Histogram Sanitizations description of technique a robust proof of privacy  Example: “Round” Sanitizations nice learning properties privacy via cross-training  Setting the Real World Context dealing with auxiliary information

5 What do WE mean by privacy?  [Ruth Gavison] Protection from being brought to the attention of others inherently valuable attention invites further privacy loss  Privacy is assured to the extent that one blends in with the crowd  Appealing definition; can be converted into a precise mathematical statement…

6 A geometric view  Abstraction: Database consists of points in high dimensional space R d Points are unlabeled you are your collection of attributes Distance is everything points are more similar if and only if they are closer  Real Database (RDB), private n unlabeled points in d-dimensional space think of d as number of sensitive attributes  Sanitized Database (SDB), public n’ new points, possibly in a different space

7 The adversary or Isolator - Intuition  On input SDB and auxiliary information, adversary outputs a point q  R d  q “isolates” a real DB point x, if it is much closer to x than to x’s near neighbors q fails to isolate x if q looks roughly as much like everyone in x’s neighborhood as it looks like x itself Tightly clustered points have a smaller radius of isolation RDB

8 (c,T)-Isolation – the definition  I(SDB,aux) = q  x is (c,T)-isolated if B(q,c  ) contains fewer than T other points from RDB c – privacy parameter; eg, 4 q x  cc p

9 Requirements for the sanitizer  No way of obtaining privacy if AUX already reveals too much!  Sanitization procedure compromises privacy if giving the adversary access to the SDB considerably increases its probability of success  Definition of “considerably” can be forgiving  Formally, quantify over distributions, adversaries, choice of database, auxiliary information:  D  I  I’ w.h.p. D  aux  x |Pr[I(SDB,aux) isolates x] – Pr[I’(aux) isolates x]| is small probabilities over choices made by sanitizer and I, I’ Provides a framework for describing the power of a sanitization method, and hence for comparisons Aux is going to cause trouble. Ignore it for now.

10 Utility Goals  Pointwise proofs of specific utilities averages, medians, clusters, regressions,…  Prove there is a large class of interesting utilities for which there are good approximation procedures using sanitized data

11 Outline  Definitions privacy, defined in the breach sanitization requirements utility goals  Example: Recursive Histogram Sanitizations description of technique a robust proof of privacy  Example: “Round” Sanitizations nice learning properties privacy via cross-training  Setting the Real World Context dealing with auxiliary information

12 Recursive Histogram Sanitization  U = d-dim cube, side = 2  Cut into 2 d subcubes split along each axis subcube has side = 1  For each subcube if number of RDB points > 2T then recurse  Output: list of cells and counts

13 Recursive Histogram Sanitization  Theorem: 9 c s.t. if n points are drawn uniformly from U, then recursive histogram sanitizations are safe with respect to c-isolation: Pr[I(SDB) succeeds] · exp(-d).

14 Safety of Recursive Histogram Sanitization  Rough Intuition Expected distance ||q-x|| is ≈ diameter of cell. Distances tightly concentrated around mean. Multiplying radius by c captures almost all the parent cell - contains at least 2T points.

15 For Very Large Values of n  Wlog can switch to ball adversaries: (q,r) I wins if B(q,r) contains at least one RDB point and B(q,cr) contains fewer than T RDB points  Define a probability density f(x) that captures adversary’s view of the RDB To win with probability , I needs: Pr f [B(q,r)] ¸  /n Pr f [B(q,cr)] · (2T + O(log  -1 ))/n Pr f [B(q,r)]/Pr f [B(q,cr)] ¸  /(2T + O(log  -1 ))  Bound  by bounding ratio, · 2 -  d,  < 1

16 Pr f [B(q,r)]/Pr f [B(q,cr)]  f(x) = (n C /n) (1 / Vol(C)) fraction of RDB points landing in cell C, spread uniformly within C  If r is sufficiently small, the bigger ball captures exp(d) more mass in each subcube than does the smaller ball yields  < 2 -  (d)

17 Pr f [B(q,r)]/Pr f [B(q,cr)]  f(x) = (n C /n) (1 / Vol(C)) fraction of RDB points landing in cell C, spread uniformly within C  If r is sufficiently small, the bigger ball captures exp(d) more mass in each subcube than does the smaller ball  If r is large, the small ball captures nothing or the bigger ball captures parent cube  Either way isolation cannot occur (c = 16)

18 Proof is Very Robust  Extends to many interesting cases non-uniform but bounded-ratio density fns isolator knows constant fraction of attribute vals isolator knows lots of RDB points isolation in few attributes very weak bounds  Can be adapted to “round” distributions balls, spheres, mixtures of Gaussians, with effort; [work in progress w/ K. Talwar]  More General Distributions “good” islands in a sea of zero probability

19 Outline  Definitions privacy, defined in the breach sanitization requirements utility goals  Example: Recursive Histogram Sanitizations description of technique a robust proof of privacy  Example: “Round” Sanitizations nice learning properties privacy via cross-training  Setting the Real World Context dealing with auxiliary information

20 Round Sanitizations  The privacy of x is linked to its T-radius  Randomly perturb it in proportion to its T-radius  x’ = San(x)  R B(x,T-rad(x)) alternatively: S(x, T-rad(x)) or d-dim Gaussian  Intuition: We are blending x in with its crowd We are adding to x random noise with mean zero, so several macroscopic properties should be preserved.

21 Nice Learning Properties  Known algorithm for learning mixtures of Gaussians works for clustering sanitized Gaussian data Original distribution (mixture of Gaussians) is recovered Technical issue: added noise is a function of the data Subject of another talk  Diameter increases by at most x3 when finding k clusters minimizing the largest diameter

22 Privacy for n Sanitized Points?  Given n-1 points in the clear, the probability of isolating the nth is O(exp(-d))  Intuition for extension to n points is wrong! Privacy of x n given x n ’ and all the other points in the clear does not imply privacy of x n given x n ’ and sanitizations of others! Sanitization of other points reveals information about x n Worry is for safety of the reference point (the neighbor defining the T-radius), not the principal

23 Combining the Two Sanitizations  Partition RDB into two sets A and B  Cross-training Compute histogram sanitization for B v 2 A:  v = f(side length of C containing v) Output GSan(v,  v )

24 Cross-Training Privacy  Privacy for B: only histogram information about B is used  Privacy for A: enough variance for enough coordinates of v, even given C containing v and sanitization v’ of v. current proof works only for |A| = 2 o(d)

25 Additional Results *  Impossibility Results 9 interesting utilities that have no sanitization protecting against isolation (cf. SFE) Impossibility of all-purpose sanitizers There is always a choice of aux that defeats a certain natural version of privacy Contrived, but places a limit on what can be proved Poly-time bounded adversary? Connection to obfuscation.  Utility Exploit literature on power of randomized histograms for algorithms for data streams (eg, Indyk) * with assorted collaborators, eg, N, N, S, T

26 Outline  Definitions privacy, defined in the breach sanitization requirements utility goals  Example: Recursive Histogram Sanitizations description of technique a robust proof of privacy  Example: “Round” Sanitizations nice learning properties privacy via cross-training  Setting the Real World Context dealing with auxiliary information

27 A Standard Technique: Cell Suppression  Gestalt: Tabular Data (many, possibly linked, tables) entries are cells frequency (count) data magnitude data (income, sales, etc.)  Disclosure = small counts Provides key for population unique, or almost-unique Can be used as a key into a different database  Enormous literature on suppressing “safely”

28 Connection to Our Definitions  Protection against isolation yields protection against learning a key for a population unique isolation on a subspace does not imply isolation in the full-dimensional space … … but aux may contain other DBs that can be queried to learn remaining attributes definition mandates protection against all possible aux satisfy def ) can’t learn key

29 Connection to Our Definitions  Seems very hard to provide good sanitization in the presence of arbitrary aux Provably impossible in general Anyway, can probably already isolate people based solely on aux Suggests we need to control aux  How should we redesign the world?

30 Two Tools  Secure Function Evaluation [Yao, GMW] Technique permitting Alice, Bob, Carol, and their friends to collaboratively compute a function f of their private inputs  =f(a,b,c,…). eg,  = sum(a,b,c, …) Each player learns only what can be deduced from  and her own input to f  SuLQ databases [Dwork, Nissim] Provably preserves privacy of attributes when the rows of the database are mutually independent Powerful [DwNi; Blum, Dwork, McSherry, Nissim]

31 Statistical Database Query (S, f) S  [n] f : {0,1} d  {0,1} Exact Answer  r  S f(row r) n persons d attributes Database DB f f f f  Row distribution D (D 1,D 2,…,D n )

32 Sub-Linear Query (SuLQ) Databases n persons d attributes f f f f  + noise If the number of queries is << n, then privacy can be protected with little noise (per query): E(noise) = 0; standard dev << √n Much less than sampling error!

33 Our Data, Ourselves

34 Our Data, Ourselves  Individuals maintain their own data records join a DB by setting an appropriate attribute  Statistical queries via a SFE(SuLQ) privacy of SuLQ query ) this SFE is “safe”  Individuals ensure data take part in sufficiently few queries sufficient random noise is added 0463…10…

35 Summary  Definitions defined isolation and sanitization  Recursive Histogram Sanitizations described approach and sketched a robust proof of privacy for a special distribution proof exploits high dimensionality (# columns)  Sanitization via perturbations utility and privacy via cross-training  Setting the Real World Context discussed a radical view of how data might be organized to prevent a powerful class of attacks based on auxiliary data SuLQ tool exploits large membership (# rows)

36 Larry Joseph Stockmeyer November 13, July 31, 2004

37 Larry Stockmeyer Commemoration May 21-22, 2005 Baltimore, Maryland (in conjunction with STOC 2005) May 21:, Tutorial by Nick Pippenger (Princeton) on some of Stockmeyer's fundamental results in complexity theory Lectures by Miki Ajtai (IBM), Anne Condon (UBC), Cynthia Dwork (Microsoft), Richard Karp (UC Berkeley), Albert Meyer (MIT), and Chris Umans (CalTech). Some time will be reserved for personal remarks. Contact Cynthia Dwork if you want to participate in this part of the commemoration. May 22: Lance Fortnow gives first keynote address to STOC.

38 Larry Stockmeyer Larry Stockmeyer, theoretical computer scientist and a founder of the field of complexity theory -- that part of computer science exploring the inherent difficulty of solving computational problems -- died Saturday, July 31, 2004, of pancreatic cancer. Born in Evansville, Indiana, in 1948, Stockmeyer was educated at MIT, where he received a bachelor's of science in mathematics and a master's of science in electrical engineering in 1972, followed by a doctorate in computer science in Stockmeyer is famous for his groundbreaking work proving the extreme difficulty of solving naturally occurring computational problems. His pioneering contributions were soon incorporated into textbooks on computational complexity. Stockmeyer joined IBM Research in 1974, working first at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York. A founding member of the Theory Group at the IBM Almaden Research Center in the early 1980s, Stockmeyer was elevated to Fellow of the Association of Computing Machinery in He remained at Almaden until he took a bridge to retirement from IBM in November After this, Stockmeyer enjoyed a brief affiliation with the University of California at Santa Cruz until his death, at age 55. Stockmeyer is survived by his father Robert Stockmeyer, his sister Mary Karen Walker, and his former wife, dear friend, and colleague Cynthia Dwork.

39 Larry Joseph Stockmeyer November 13, July 31, 2004