Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Slides:



Advertisements
Similar presentations
Introduction Simple Random Sampling Stratified Random Sampling
Advertisements

Estimation of Means and Proportions
Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
Hierarchical Dirichlet Processes
SPH 247 Statistical Analysis of Laboratory Data 1 April 2, 2013 SPH 247 Statistical Analysis of Laboratory Data.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
Probabilistic Inference Protection on Anonymized Data
Privacy Preserving Serial Data Publishing By Role Composition Yingyi Bu 1, Ada Wai-Chee Fu 1, Raymond Chi-Wing Wong 2, Lei Chen 2, Jiuyong Li 3 The Chinese.
Statistics. Large Systems Macroscopic systems involve large numbers of particles.  Microscopic determinism  Macroscopic phenomena The basis is in mechanics.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.
Inference.ppt - © Aki Taanila1 Sampling Probability sample Non probability sample Statistical inference Sampling error.
L-Diversity: Privacy Beyond K-Anonymity
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Estimation 8.
Modular 13 Ch 8.1 to 8.2.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Chapter 5 Several Discrete Distributions General Objectives: Discrete random variables are used in many practical applications. These random variables.
The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,
Lecture II.  Using the example from Birenens Chapter 1: Assume we are interested in the game Texas lotto (similar to Florida lotto).  In this game,
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Chapter Nine: Evaluating Results from Samples Review of concepts of testing a null hypothesis. Test statistic and its null distribution Type I and Type.
Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.
Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.
Chapter 1 Basics of Probability.
Slide Slide 1 Chapter 8 Sampling Distributions Mean and Proportion.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
1 Sampling Distributions Lecture 9. 2 Background  We want to learn about the feature of a population (parameter)  In many situations, it is impossible.
Inferential Statistics 2 Maarten Buis January 11, 2006.
Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.
Tuning Privacy-Utility Tradeoffs in Statistical Databases using Policies Ashwin Machanavajjhala cs.duke.edu Collaborators: Daniel Kifer (PSU),
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Refined privacy models
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Ch Counting Techniques Product Rule If the first element or object of an ordered pair can be used in n 1 ways, and for each of these n1 ways.
Introduction to Behavioral Statistics Probability, The Binomial Distribution and the Normal Curve.
9.3: Sample Means.
Privacy of Correlated Data & Relaxations of Differential Privacy CompSci Instructor: Ashwin Machanavajjhala 1Lecture 16: Fall 12.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Hybrid l-Diversity* Mehmet Ercan NergizMuhammed Zahit GökUfuk Özkanlı
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Confidence Interval & Unbiased Estimator Review and Foreword.
Natural Language Processing Giuseppe Attardi Introduction to Probability IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein.
MaskIt: Privately Releasing User Context Streams for Personalized Mobile Applications SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference.
Privacy-preserving data publishing
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 7 Sampling Distributions 7.1 What Is A Sampling.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Introduction to Inference Sampling Distributions.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Hypothesis Testing  Test for one and two means  Test for one and two proportions.
No Free Lunch in Data Privacy CompSci Instructor: Ashwin Machanavajjhala 1Lecture 15: Fall 12.
Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies Florian Tramèr, Zhicong Huang, Erman Ayday,
A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has cancer. -suppose.
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Private Data Management with Verification
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
CONCEPTS OF HYPOTHESIS TESTING
Differential Privacy in Practice
Refined privacy models
Presentation transcript:

Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala Yahoo! Reasearch Santa Clara, CA Amedeo D’Ascanio, University Of Bologna, Italy

Outline Introduction Є-privacy Adversary knowledge Adversary Classes Apply Є-privacy to Generalization Experimental evaluation Conclusion Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Introduction Many reasons to Publish Data: requirements  Preserve aggregate information about population  Preserve privacy of sensitive information Privacy  How much information can an adversary deduce from released data? Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Example Alice knows that Rachel is 35 and she lives in Alice knows that Rachel is 20 and she has very low probability of Hart Disease Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Previous Definitions L-diversity  The adversary knows l-2 information about sensitive attribute  The informations are equally like T-closeness  Alice knows the distribution of sensitive values  Rachel’s chances of having a disease follow the same odds Differential privacy  Alice knows exact disease about every patient but Rachel’s one “It’s flu season, a lot of elderly people will be in the hospital with flu symptoms”  How do we model such background knowledge with l-diversty or t- closeness?  Does Alice knows everything about 1Billion patients? Unrealistic assumptions! Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Є-privacy Flexible language to define information about each individual Privacy as difference of adversary’s belief between published table with and without the “victim” Different class of adversary (either realistic or unrealistic) modeled based on their knowledge Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Modeling sensitive information Positive disclosure  Alice knows that Rachel has flu Negative disclosure  Alice knows that Rachel has not flu Sensitive information using positive discloser on a set of sensitive predicates Φ Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Modeling sensitive information Example Negative discloser each takes the form where dom(S) is the domain of sensitive attribute Rachel can protect against any kind of disclosures for flu, cancer and any stomach disease if for each subset Positive discloser False True Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Adversaries Knowledge Knowledge from other sources Usually modeled as the joint distribution P over N and S attributes. If the adversary has no preference for any value of i Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Adversary Knowledge  Two problems Where does the adversary learn their knowledge?  If population with cancer is 10% (s i = s/10)  For each i, p i =s i /s=0.1 What if T pub has only 10 enties? Can the adversary change his prior?  The probability that a woman has cancer is p i =0.5 based on a sample of 100 women  An adversary read another table with 20k tuples where s i is 2k (so that p i =0.1) If her prior is not strong p i will change accordingly Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Adversary Knowledge To model adversaries we assume that  The adversary knows more priors  The tuples are not independent each other Exchangeability: a sequence of random variable X1,X2,..,Xn is exchangeable if every finite permutation of these random variables has the same joint probability distribution  If H is healty an S is Sick, the probability of seeing the sequence SSHSH is the same as the probability of HHSSS Accordin to deFinetti’s representation Theorem, an exchangeable sequence of random variables is mathematically equivalent to  Choosing a data-generating distribution θ at random  Creating the data by independently sampling from this chosen distribution θ Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Adversary Knowledge Example Assume two populations of equal size, Ω 1 with only healty people and Ω 2 with only sick people. Table T is drawn only form Ω 1 or Ω 2 If the adversary doesn’t know which population has been chosen: If the adversary knows that just one t is healthy then: If tuples are independent from each other? Still Pr[t=H] =0.5 Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Dirichlet Distribution More generally:  T (of size n) is generate in two steps: One of probability vector p is drawn from a distribution D Then n elements are drawn i.i.d. from the probability vector D encode the adversary knowledge  If the adversary has no prior is drawn from D equally like  If an adversary know that 999 people over 1k have cancer, he should model D in order to draw p no (cancer) = and p yes (cancer) =0.999 Dirichlet Distribution to model prior over Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Dirichlet Distribution belief that the probabilities of k rival events are x i given that each event has been observed σ i − 1 times.  Adversary without knowledge: D(σ 1,…, σ k ) = D(1,…,1);  After reading dataset whit counts (σ 1 -1,…, σ k -1) the adversary may update his prior to D(σ 1,…, σ k ). In this case not all are equally like Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Dirichlet Distribution The vector with the maximum likelihood is As we increase σ the becomes more likely If is the only possible probability distribution Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Other Adversary Knowledge Knowledge from individuals inside the published table  Full knowledge about a subset B of tuples in T Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Definition After Tpub is published the adversary belief in a sensitive predicate about an individual u in T is If the individual u is remove from T, the belief becomes Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Definition The p in should not be much greater than p out  The greater it is, the more information about an individual’s sensitive predicate the adversary learns A Table does not respect epsilon-privacy if Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Adversary Classes Defined based on their prior built over the distribution of sensitive values  Class I:  Class II:  Class III:  Class IV: Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Adversary classes - Examples Suppose to have another dataset with tuples: with flu and cancer Class I: σ= 30k, D(12k,18k) Class II: σ= 30k arbitrary shape Class III: arbitrary σ, distribution (.4,.6) Class IV: arbitrary prior Rachel is in the table. p in (flu) =.9 for all adversaries (depends only from published table) p out (flu) changes for each adversary Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Adversary classes - Examples Class I : p out (flu) = (18k+12k)/(20k+30k) =.6 Class II : p out (flu) = (18k+1)/(20k+30k)= Class III: p out (flu) =.4 Class IV = every value  So that Rachel is granted.4, 6.4, 6 and no privacy against respectively class I,II,III,IV adversary Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Generalization and epsilon privacy Set of sensitive predicates for each individual u is We can define a set of constraint that have to be checked during the generalization process Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Check for Class I R1 and R2 has to be respected Combination of  Anonymity  closeness Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Check for Class II R1 and R2 has to be respected Combination of  anonymity  diversity Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Check for Class III R1 and R2 has to be respected Only closeness  epsilon-privacy doesn’t guarantee privacy against class IV adversary Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Montonicity T1 and T2 generalization of T such that if T1 satisfies epsilon-privacy, then T2 also satisfies epsilon-privacy  Useful for algorithms such as Incognito, Mondrian, PET algorithm  All checks shown before can has a time complexity O(N) Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Choosing Parameter The choice is application dependent: US Census  Stubbornness: number of individuals  Shape: distribution of sensitive values  Epsilon: between 10 and 100 WHY? Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Experimental results  The more stubbornness we have, the grater epsilon we need to achieve privacy  With small values of σ the cost function is better  The average group size increases according to σ Data from Minnesota Population Center with nearly 3M tuples Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Embedding prior work Epsilon-privacy can cover some instantiation of  Recursive diversity (c,2)-diversity  Differential privacy  T-closeness Data Publishing against Realistic Adversaries Amedeo D’Ascanio

Conclusions Definition of epsilon-privacy Definition of Realistic Adversaries How to cover scenarios not taken in account in previous works Epsilon-privacy in generalization process Future work:  Considering correlation between sensitive and non sensitive values  apply epsilon privacy to other algorithm Data Publishing against Realistic Adversaries Amedeo D’Ascanio