Tuning Privacy-Utility Tradeoffs in Statistical Databases using Policies Ashwin Machanavajjhala cs.duke.edu Collaborators: Daniel Kifer (PSU), Bolin Ding (MSR), Xi He (Duke) Census, 8/15/2013
Overview of the talk An inherent trade-off between privacy (confidentiality) of individuals and utility of statistical analyses over data collected from individuals. Differential privacy has revolutionized how we reason about privacy – Nice tuning knob ε for trading off privacy and utility Census, 8/15/20132
Overview of the talk However, differential privacy only captures a small part of the privacy-utility trade-off space – No Free Lunch Theorem – Differentially private mechanisms may not ensure sufficient utility – Differentially private mechanisms may not ensure sufficient privacy Census, 8/15/20133
Overview of the talk I will present a new privacy framework that allows data publishers to more effectively tradeoff privacy for utility – Better control on what to keep secret and who the adversaries are – Can ensure more utility than differential privacy in many cases – Can ensure privacy where differential privacy fails Census, 8/15/20134
Outline Background – Differential privacy No Free Lunch [Kifer-M SIGMOD ’11] – No `one privacy notion to rule them all’ Pufferfish Privacy Framework [Kifer-M PODS’12] – Navigating the space of privacy definitions Blowfish: Practical privacyusing policies [ongoing work] Census, 8/15/20135
Data Privacy Problem 6 Individual 1 r1r1 Individual 2 r2r2 Individual 3 r3r3 Individual N rNrN Server DBDB Utility: Privacy: No breach about any individual Utility: Privacy: No breach about any individual Census, 8/15/2013
Data Privacy in the real world Census, 8/15/20137 ApplicationData CollectorThird Party (adversary) Private Information Function (utility) MedicalHospitalEpidemiologistDiseaseCorrelation between disease and geography Genome analysis HospitalStatistician/ Researcher GenomeCorrelation between genome and disease AdvertisingGoogle/FB/Y!AdvertiserClicks/Brows ing Number of clicks on an ad by age/region/gender … Social Recommen- dations FacebookAnother userFriend links / profile Recommend other users or ads to users based on social network
T-closeness Li et. al ICDE ‘07 K-Anonymity Sweeney et al. IJUFKS ‘02 Many definitions Linkage attack Background knowledge attack Minimality /Reconstruction attack de Finetti attack Composition attack Census, 8/15/ L-diversity Machanavajjhala et. al TKDD ‘07 E-Privacy Machanavajjhala et. al VLDB ‘09 & several attacks Differential Privacy Dwork et. al ICALP ‘06
Differential Privacy For every output … OD2D2 D1D1 Adversary should not be able to distinguish between any D 1 and D 2 based on any O Pr[A(D 1 ) = O] Pr[A(D 2 ) = O]. Adversary should not be able to distinguish between any D 1 and D 2 based on any O Pr[A(D 1 ) = O] Pr[A(D 2 ) = O]. For every pair of inputs that differ in one value 0) log Census, 8/15/2013
Algorithms No deterministic algorithm guarantees differential privacy. Random sampling does not guarantee differential privacy. Randomized response satisfies differential privacy. Census, 8/15/201310
Laplace Mechanism Database D Researcher Query q True answer q(D) q(D) + η η h(η) α exp(-η / λ) Privacy depends on the λ parameter Mean: 0, Variance: 2 λ 2 Census, 8/15/2013
Laplace Mechanism Thm: If sensitivity of the query is S, then the following guarantees ε- differential privacy. λ = S/ε Sensitivity: Smallest number s.t. for any D,D’ differing in one entry, || q(D) – q(D’) || 1 ≤ S(q) Census, 8/15/2013 [Dwork et al., TCC 2006]
Contingency tables D Count(, ) Each tuple takes k=4 different values Census, 8/15/2013
Laplace Mechanism for Contingency Tables Lap(2/ε) 8 + Lap(2/ε) D Mean : 8 Variance : 8/ε 2 Sensitivity = 2 Census, 8/15/2013
Composition Property If algorithms A 1, A 2, …, A k use independent randomness and each A i satisfies ε i -differential privacy, resp. Then, outputting all the answers together satisfies differential privacy with ε = ε 1 + ε 2 + … + ε k Census, 8/15/ Privacy Budget
Differential Privacy Privacy definition that is independent of the attacker’s prior knowledge. Tolerates many attacks that other definitions are susceptible to. – Avoids composition attacks – Claimed to be tolerant against adversaries with arbitrary background knowledge. Allows simple, efficient and useful privacy mechanisms – Used in LEHD’s OnTheMap [M et al ICDE ‘08] Census, 8/15/201316
Outline Background – Differential privacy No Free Lunch [Kifer-M SIGMOD ’11] – No `one privacy notion to rule them all’ Pufferfish Privacy Framework [Kifer-M PODS’12] – Navigating the space of privacy definitions Blowfish: Practical privacyusing policies [ongoing work] Census, 8/15/201317
Differential Privacy & Utility Differentially private mechanisms may not ensure sufficient utility for many applications. Sparse Data: Integrated Mean Square Error due to Laplace mechanism can be worse than returning a random contingency table for typical values of ε (around 1) Social Networks [M et al PVLDB 2011] Census, 8/15/201318
Differential Privacy & Privacy Differentially private algorithms may not limit the ability of an adversary to learn sensitive information about individuals when records in the data are correlated Correlations across individuals occur in many ways: – Social Networks – Data with pre-released constraints – Functional Dependencies Census, 8/15/201319
Laplace Mechanism and Correlations Lap(2/ε) Lap(2/ε)10 4 D Does Laplace mechanism still guarantee privacy? Auxiliary marginals published for following reasons: 1.Legal: 2002 Supreme Court case Utah v. Evans 2.Contractual: Advertisers must know exact demographics at coarse granularities Census, 8/15/2013
Laplace Mechanism and Correlations Lap(2/ε) Lap(2/ε)10 4 D 2 + Lap(2/ε) Count (, ) = 8 + Lap(2/ε) Count (, ) = 8 – Lap(2/ε) Count (, ) = 8 + Lap(2/ε) Census, 8/15/2013
Mean : 8 Variance : 8/ke 2 Laplace Mechanism and Correlations Lap(1/ε) Lap(2/ε)10 4 D 2 + Lap(2/ε) can reconstruct the table with high precision for large k Census, 8/15/2013
No Free Lunch Theorem It is not possible to guarantee any utility in addition to privacy, without making assumptions about the data generating distribution the background knowledge available to an adversary 23 [Kifer-M SIGMOD ‘11] Census, 8/15/2013 [Dwork-Naor JPC ‘10]
To sum up … Differential privacy only captures a small part of the privacy-utility trade-off space – No Free Lunch Theorem – Differentially private mechanisms may not ensure sufficient privacy – Differentially private mechanisms may not ensure sufficient utility Census, 8/15/201324
Outline Background – Differential privacy No Free Lunch [Kifer-M SIGMOD ’11] – No `one privacy notion to rule them all’ Pufferfish Privacy Framework [Kifer-M PODS’12] – Navigating the space of privacy definitions Blowfish: Practical privacyusing policies [ongoing work] Census, 8/15/201325
Pufferfish Framework Census, 8/15/201326
Pufferfish Semantics What is being kept secret? Who are the adversaries? How is information disclosure bounded? – (similar to epsilon in differential privacy) Census, 8/15/201327
Sensitive Information Secrets: S be a set of potentially sensitive statements – “individual j’s record is in the data, and j has Cancer” – “individual j’s record is not in the data” Discriminative Pairs: Mutually exclusive pairs of secrets. – (“Bob is in the table”, “Bob is not in the table”) – (“Bob has cancer”, “Bob has diabetes”) Census, 8/15/201328
Adversaries We assume a Bayesian adversary who is can be completely characterized by his/her prior information about the data – We do not assume computational limits Data Evolution Scenarios: set of all probability distributions that could have generated the data ( … think adversary’s prior). – No assumptions: All probability distributions over data instances are possible. – I.I.D.: Set of all f such that: P(data = {r 1, r 2, …, r k }) = f(r 1 ) x f(r 2 ) x…x f(r k ) Census, 8/15/201329
Information Disclosure Mechanism M satisfies ε-Pufferfish(S, Spairs, D), if Census, 8/15/201330
Pufferfish Semantic Guarantee Census, 8/15/ Prior odds of s vs s’ Posterior odds of s vs s’
Applying Pufferfish to Differential Privacy Spairs: – “record j is in the table” vs “record j is not in the table” – “record j is in the table with value x” vs “record j is not in the table” Data evolution: – Probability record j is in the table: π j – Probability distribution over values of record j: f j – For all θ = [ f 1, f 2, f 3, …, f k, π 1, π 2, …, π k ] Census, 8/15/201332
Applying Pufferfish to Differential Privacy Spairs: – “record j is in the table” vs “record j is not in the table” – “record j is in the table with value x” vs “record j is not in the table” Data evolution: – For all θ = [ f 1, f 2, f 3, …, f k, π 1, π 2, …, π k ] A mechanism M satisfies differential privacy if and only if it satisfies Pufferfish instantiated using Spairs and {θ} (as defined above) Census, 8/15/201333
Pufferfish & Differential Privacy Spairs: – s i x : record i takes the value x – Attackers should not be able to significantly distinguish between any two values from the domain for any individual record. Census, 8/15/201334
Pufferfish & Differential Privacy Data evolution: – For all θ = [ f 1, f 2, f 3, …, f k ] Adversary’s prior may be any distribution that makes records independent Census, 8/15/201335
Pufferfish & Differential Privacy Spairs: – s i x : record i takes the value x – Data evolution: – For all θ = [ f 1, f 2, f 3, …, f k ] A mechanism M satisfies differential privacy if and only if it satisfies Pufferfish instantiated using Spairs and {θ} Census, 8/15/201336
Summary of Pufferfish A semantic approach to defining privacy – Enumerates the information that is secret and the set of adversaries. – Bounds the odds ratio of pairs of mutually exclusive secrets Helps understand assumptions under which privacy is guaranteed – Differential privacy is one specific choice of secret pairs and adversaries How should a data publisher use this framework? Algorithms? Census, 8/15/201337
Outline Background – Differential privacy No Free Lunch [Kifer-M SIGMOD ’11] – No `one privacy notion to rule them all’ Pufferfish Privacy Framework [Kifer-M PODS’12] – Navigating the space of privacy definitions Blowfish: Practical privacyusing policies [ongoing work] Census, 8/15/201338
Blowfish Privacy A special class of Pufferfish instantiations Both pufferfish and blowfish are marine fish of the Tetraodontidae family Census, 8/15/201339
Blowfish Privacy A special class of Pufferfish instantiations Extends differential privacy using policies – Specification of sensitive information Allows more utility – Specification of publicly known constraints in the data Ensures privacy in correlated data Satisfies the composition property Census, 8/15/201340
Blowfish Privacy A special class of Pufferfish instantiations Extends differential privacy using policies – Specification of sensitive information Allows more utility – Specification of publicly known constraints in the data Ensures privacy in correlated data Satisfies the composition property Census, 8/15/201341
Sensitive Information Secrets: S be a set of potentially sensitive statements – “individual j’s record is in the data, and j has Cancer” – “individual j’s record is not in the data” Discriminative Pairs: Mutually exclusive pairs of secrets. – (“Bob is in the table”, “Bob is not in the table”) – (“Bob has cancer”, “Bob has diabetes”) Census, 8/15/201342
Sensitive information in Differential Privacy Spairs: – s i x : record i takes the value x – Attackers should not be able to significantly distinguish between any two values from the domain for any individual record. Census, 8/15/201343
Other notions of Sensitive Information Medical Data – OK to infer whether individual is healthy or not. – E.g., (Bob is Healthy, Bob is Diabetes) is not a discriminative pair of secrets for any individual Partitioned Sensitive Information: Census, 8/15/201344
Other notions of Sensitive Information Geospatial Data – Do not want the attacker to distinguish between “close-by” points in the space. – May distinguish between “far-away” points Distance based Sensitive Information Census, 8/15/201345
Other notions of Sensitive Information Social Networks – Domain of individual’s record is the power set of V (nodes) Edge Privacy: Node Privacy: Census, 8/15/201346
Generalization as a graph Consider a graph G = (V, E), where V is the set of values that an individual’s record can take. E encodes the set of discriminative pairs – Same for all records. Census, 8/15/201347
Blowfish Privacy + “Policy of Secrets” A mechanism M satisfy blowfish privacy wrt policy G if – For every set of outputs of the mechanism S – For every pair of datasets that differ in one record, with values x and y s.t. (x,y) ε E Census, 8/15/201348
Blowfish Privacy + “Policy of Secrets” A mechanism M satisfy blowfish privacy wrt policy G if – For every set of outputs of the mechanism S – For every pair of datasets that differ in one record, with values x and y s.t. (x,y) ε E For any x and y in the domain, Census, 8/15/ Shortest distance between x and y in G
Blowfish Privacy + “Policy of Secrets” A mechanism M satisfy blowfish privacy wrt policy G if – For every set of outputs of the mechanism S – For every pair of datasets that differ in one record, with values x and y s.t. (x,y) ε E Adversary is allowed to distinguish between x and y that appear in different disconnected components in G Census, 8/15/201350
Algorithm1: Randomized Response Perturb each record in the table using the following distribution Non-interactive mechanism Census, 8/15/201351
Algorithms for Blowfish Consider an ordered 1-D attribute – Dom = {x 1,x 2,x 3,…,x d } – E.g., ranges of Age, Salary, etc. Suppose our policy is: Adversary should not distinguish whether an individual’s value is x j or x j+1. Census, 8/15/ x1x1 x2x2 x3x3 xdxd
Algorithms for Blowfish Suppose we want to release histogram privately – Number of individuals in each age range Any differentially private algorithm also satisfies blowfish – Can use Laplace mechanism (with sensitivity 2) Census, 8/15/ x1x1 x2x2 x3x3 xdxd C(x 1 )C(x 3 )C(x d )
Ordered Mechanism We can answer a different set of queries to get a different private estimator for the histogram. Census, 8/15/ x1x1 x2x2 x3x3 xdxd C(x 1 )C(x 3 )C(x d ) S3 S2 S1 Sd …
Ordered Mechanism We can answer each Si using Laplace mechanism … … but sensitivity for all the queries is only 1 Census, 8/15/ x1x1 x2x2 x3x3 xdxd C(x 3 ) +1 S3 S2 S1 Sd … C(x 2 ) -1 Changing one tuple from x2 to x3 only changes S2
Ordered Mechanism We can answer each Si using Laplace mechanism … … but sensitivity for all the queries is only 1 Census, 8/15/ Factor of 2 improvement
Ordered Mechanism In addition, we have the following constraint: However, the noisy counts may not satisfy this constraint. We can post-process the noisy counts to ensure this constraint: Census, 8/15/201357
Ordered Mechanism We can post-process the noisy counts to ensure this constraint: Census, 8/15/ Order of magnitude improvement for large d
Ordered Mechanism By leveraging the weaker sensitive information in the policy, we can provide significantly better utility Extends to more general policy specifications. Ordered mechanisms and other blowfish algorithms are being tested on the synthetic data generator for LODES data product. Census, 8/15/201359
Blowfish Privacy & Correlations Differentially private mechanisms may not ensure privacy when correlations exist in the data. Blowfish can handle constraints in the form of publicly known constraints. – Well know marginal counts in the data – Other dependencies Privacy definition is similar to differential privacy with a modified notion of neighboring tables Census, 8/15/201360
Other instantiations of Pufferfish All blowfish instantiations are extensions of differential privacy using – Weaker notions of sensitive information – Allowing knowledge of constraints about the data – All blowfish mechanisms satisfy composition property We can instantiate Pufferfish with other “realistic” adversary notions – Only prior distributions that are similar to the expected data distribution – Open question: Which definitions satisfy composition property? Census, 8/15/201361
Summary Differential privacy (and the tuning knob epsilon) is insufficient for trading off privacy for utility in many applications – Sparse data, Social networks, … Pufferfish framework allows more expressive privacy definitions – Can vary sensitive information, adversary priors, and epsilon Blowfish shows one way to create more expressive definitions – Can provide useful composable mechanisms There is an opportunity to correctly tune privacy by using the above expressive privacy frameworks Census, 8/15/201362
Thank you [M et al PVLDB’11] A. Machanavajjhala, A. Korolova, A. Das Sarma, “Personalized Social Recommendations – Accurate or Private?”, PVLDB 4(7) 2011 [Kifer-M SIGMOD’11] D. Kifer, A. Machanavajjhala, “No Free Lunch in Data Privacy”, SIGMOD 2011 [Kifer-M PODS’12] D. Kifer, A. Machanavajjhala, “A Rigorous and Customizable Framework for Privacy”, PODS 2012 [ongoing work] A. Machanavajjhala, B. Ding, X. He, “Blowfish Privacy: Tuning Privacy-Utility Trade-offs using Policies”, in preparation Census, 8/15/201363