Download presentation
Presentation is loading. Please wait.
1
1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science http://theory.csail.mit.edu/~asmith
2
2 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) “Census problem” Two conflicting goals Utility: Users can extract “global” statistics Privacy: Individual information stays hidden How can these be formalized? Collection and “ sanitization ”
3
3 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) “Census problem” Why privacy? Ethical & legal obligation Honest answers require respondents’ trust Collection and “ sanitization ”
4
4 Trust is important
5
5 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) Trusted collection agency Published statistics may be tables, graphs, microdata, etc May have noise or other distortions May be interactive Collection and “ sanitization ”
6
6 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) Variations on model studied in Statistics Data mining Theoretical CS Cryptography Different traditions for what “privacy” means Collection and “ sanitization ”
7
7 How can we formalize “privacy”? Different people mean different things Pin it down mathematically?
8
8 I ask them to take a poem and hold it up to the light like a color slide or press an ear against its hive. […] But all they want to do is tie the poem to a chair with rope and torture a confession out of it. They begin beating it with a hose to find out what it really means. - Billy Collins, “Introduction to poetry” Can we approach privacy scientifically? Pin down social concept No perfect definition? But lots of place for rigor Too late? (see Adi’s talk)
9
9 How can we formalize “privacy”? Different people mean different things Pin it down mathematically? Goal #1: Rigor Prove clear theorems about privacy Few exist in literature Make clear (and refutable) conjectures Sleep better at night Goal #2: Interesting science (New) Computational phenomenon Algorithmic problems Statistical problems
10
10 Overview Examples Intuitions for privacy Why crypto def’s don’t apply A Partial* Selection of Definitions Conclusions * “ partial ” = “ incomplete ” and “ biased ”
11
11 Basic Setting Database DB = table of n rows, each in domain D D can be numbers, categories, tax forms, etc This talk: D = {0,1} d E.g.: Married?, Employed?, Over 18?, … xnxn x n-1 x3x3 x2x2 x1x1 San Users (government, researchers, marketers, … ) query 1 answer 1 query T answer T DB= random coins ¢¢¢
12
12 Examples of sanitization methods Input perturbation Change data before processing E.g. Randomized response flip each bit of table with probability p Summary statistics Means, variances Marginal totals (# people with blue eyes and brown hair) Regression coefficients Output perturbation Summary statistics with noise Interactive versions of above: Auditor decides which queries are OK, type of noise
13
13 Two Intuitions for Privacy “If the release of statistics S makes it possible to determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place.” [Dalenius] Learning more about me should be hard Privacy is “protection from being brought to the attention of others.” [Gavison] Safety is blending into a crowd Remove Gavison def?
14
14 Why not use crypto definitions? Attempt #1: Def’n: For every entry i, no information about x i is leaked (as if encrypted) Problem: no information at all is revealed! Tradeoff privacy vs utility Attempt #2: Agree on summary statistics f( DB ) that are safe Def’n: No information about DB except f( DB ) Problem: how to decide that f is safe? Tautology trap (Also: how do you figure out what f is? --Yosi) C C CC C C
15
15 Overview Examples Intuitions for privacy Why crypto def’s don’t apply A Partial* Selection of Definitions Two straw men Blending into the Crowd An impossibility result Attribute Disclosure and Differential Privacy Conclusions * “ partial ” = “ incomplete ” and “ biased ” Criteria Understandable Clear adversary’s goals & prior knowledge / side information I am a co-author...
16
16 xnxn x n-1 x3x3 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T random coins ¢¢¢ Straw man #1: Exact Disclosure Def’n: safe if adversary cannot learn any entry exactly leads to nice (but hard) combinatorial problems Does not preclude learning value with 99% certainty or narrowing down to a small interval Historically: Focus: auditing interactive queries Difficulty: understanding relationships between queries E.g. two queries with small difference
17
17 Straw man #2: Learning the distribution Assume x 1,…,x n are drawn i.i.d. from unknown distribution Def’n: San is safe if it only reveals distribution Implied approach: learn the distribution release description of distrib or re-sample points from distrib Problem: tautology trap estimate of distrib. depends on data… why is it safe?
18
18 Blending into a Crowd Intuition: I am safe in a group of k or more k varies (3… 6… 100… 10,000 ?) Many variations on theme: Adv. wants predicate g such that 0 < # { i | g(x i )=true } < k g is called a breach of privacy Why? Fundamental: R. Gavison: “protection from being brought to the attention of others” Rare property helps me re-identify someone Implicit: information about a large group is public e.g. liver problems more prevalent among diabetics
19
19 Blending into a Crowd Intuition: I am safe in a group of k or more k varies (3… 6… 100… 10,000 ?) Many variations on theme: Adv. wants predicate g such that 0 < # { i | g(x i )=true } < k g is called a breach of privacy Why? Fundamental: R. Gavison: “protection from being brought to the attention of others” Rare property helps me re-identify someone Implicit: information about a large group is public e.g. liver problems more prevalent among diabetics How can we capture this? Syntactic definitions Bayesian adversary “Crypto-flavored” definitions Two variants: frequency in DB frequency in underlying population
20
20 “Syntactic” Definitions Given sanitization S, look at set of all databases consistent with S Def’n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data Partition D into bins B 1 [ B 2 [ [ B t Output cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j ¸ k or f j =0 Cell bound methods [statistics, 1990’s] Sanitization consists of marginal sums Let f z = #{i : x i =z}. Then San(DB) = various sums of f z Safe if for all z, either 9 cons’t DB with f z ¸ k or 8 cons’t DB’s, f z =0 Large literature using algebraic and combinatorial techniques brownblue blond21012 brown12618 1416 brownblue blond [0,12] 12 brown [0,14][0,16] 18 1416
21
21 brownblue blond [0,12] 12 brown [0,14][0,16] 18 1416 “Syntactic” Definitions Given sanitization S, look at set of all databases consistent with S Def’n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data Partition D into bins B 1 [ B 2 [ [ B t Output cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j ¸ k or f j =0 Cell bound methods [statistics, 1990’s] Sanitization consists of marginal sums if f z = #{i : x i =z} then output = various sums of f z Safe if for all z, either 9 cons’t DB with f z ¸ k or 8 cons’t DB’s, f z =0 large literature using algebraic and combinatorial techniques Issues: If k is small: “all three Canadians at Weizmann sing in a choir.” Semantics? Probability not considered What if I have side information? Algorithm for making decisions not considered What adversary does this apply to?
22
22 Security for “Bayesian” adversaries Goal: Adversary outputs point z 2 D Score = 1 / f z if f z > 0 0 otherwise Def’n: sanitization safe if E(score) · Procedure: Assume you know adversary’s prior distribution over databases Given a candidate output (e.g. set of marginal sums) Update prior conditioned on output (via Bayes’ rule) If max z E( score | output ) < then release Else consider new set of marginal sums Extensive literature on computing expected value (see Yosi’s talk) Issues: Restricts the type of predicates adversary can choose Must know prior distribution Can 1 scheme work for many distributions? Sanitizer works harder than adversary Conditional probabilities don’t consider previous iterations “Simulatability” [KMN’05] Can this be fixed (with efficient computations)?
23
23 Crypto-flavored Approach [CDMSW,CDMT,NS] “If the release of statistics S makes it possible to determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place.” [Dalenius]
24
24 2 C Crypto-flavored Approach [CDMSW,CDMT,NS] [CDMSW]: Compare to “simulator”: 8 distributions on databases DB 8 adversaries A, 9 A’ such that 8 subsets J µ DB: Pr DB,S [ A(S) = breach in J ] – Pr DB [ A’() = breach in J ] · Definition says nothing if adversary knows x 1 Require that it hold for all subsets of DB No non-trivial examples satisfying this definition Restrict family of distributions to some class C of distribs Try to make as large as possible Sufficient: i.i.d. from “smooth” distribution
25
25 Crypto-flavored Approach [CDMSW,CDMT,NS] [CDMSW]: Compare to “simulator”: 8 distributions on databases DB 2 C 8 adversaries A, 9 A’ such that 8 subsets J µ DB: Pr DB,S [ A(S) = breach in J ] – Pr DB [ A’() = breach in J ] · [CDMSW,CDMT] Geometric data Assume x i 2 R d Relax definition: Ball predicates g z,r = {x : ||x-z|| · r} g’ z,r = {x : ||x-z|| · C ¢ r} Breach if # (DB Å g z,r ) >0 and #(DB Å g’ z,r ) < k Several types of histograms can be released Sufficient for “metric” problems: clustering, min. span tree,… 2 23 1 22
26
26 Crypto-flavored Approach [CDMSW,CDMT,NS] [CDMSW]: Compare to “simulator”: 8 distributions on databases DB 2 C 8 adversaries A, 9 A’ such that 8 subsets J µ DB: Pr DB,S [ A(S) = breach in J ] – Pr DB [ A’() = breach in J ] · [NS] No geometric restrictions A lot of noise Almost erase data! Strong privacy statement Very weak utility [CDMSW,CDMT,NS]: proven statements! Issues: Works for a large class of prior distributions and side information But not for all Not clear if it helps with “ordinary” statistical calculations Interesting utility requires geometric restrictions Too messy?
27
27 Blending into a Crowd Intuition: I am safe in a group of k or more pros: appealing intuition for privacy seems fundamental mathematically interesting meaningful statements are possible! cons does it rule out learning facts about particular individual? all results seem to make strong assumptions on adversary’s prior distribution is this necessary? (yes…)
28
28 Overview Examples Intuitions for privacy Why crypto def’s don’t apply A Partial* Selection of Definitions Two Straw men Blending into the Crowd An impossibility result Attribute Disclosure and Differential Privacy Conclusions
29
29 an impossibility result An abstract schema: Define a privacy breach 8 distributions on databases 8 adversaries A, 9 A’ such that Pr( A(San) = breach ) – Pr( A’() = breach ) · Theorem: [Dwork-Naor] For reasonable “breach”, if San(DB) contains information about DB then some adversary breaks this definition Example: Adv. knows Alice is 2 inches shorter than average Lithuanian but how tall are Lithuanians? With sanitized database, probability of guessing height goes up Theorem: this is unavoidable
30
30 proof sketch Suppose If DB is uniform then entropy I( DB ; San(DB) ) > 0 “breach” is predicting a predicate g( DB ) Pick hash function h: { databases } ! {0,1} H(DB|San) Prior distrib. is uniform conditioned on h( DB )=z Then h(DB)=z gives no info on g( DB ) San(DB) and h(DB)=z together determine DB [DN] vastly generalize this
31
31 xnxn x n-1 x3x3 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T random coins ¢¢¢ Preventing Attribute Disclosure Large class of definitions safe if adversary can’t learn “too much” about any entry E.g.: Cannot narrow X i down to small interval For uniform X i, mutual information I(X i ; San( DB ) ) · How can we decide among these definitions?
32
32 Differential Privacy Lithuanians example: Adv. learns height even if Alice not in DB Intuition [DM]: “Whatever is learned would be learned regardless of whether or not Alice participates” Dual: Whatever is already known, situation won’t get worse xnxn x n-1 x3x3 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T random coins ¢¢¢
33
33 Differential Privacy xnxn x n-1 0 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T random coins ¢¢¢ Define n+1 games “Game 0”: Adv. interacts with San(DB) For each i, let DB -i = (x 1,…,x i-1,0,x i+1,…,x n ) “Game i”: Adv. interacts with San(DB -i ) Bayesian adversary: Given S and prior distrib p() on DB, define n+1 posterior distrib’s
34
34 Differential Privacy xnxn x n-1 0 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T random coins ¢¢¢ Definition: San is safe if 8 prior distributions p( ¢ ) on DB, 8 transcripts S, 8 i =1,…,n StatDiff( p 0 ( ¢ |S), p i ( ¢ | S) ) · Note that the prior distribution may be far from both How can we satisfy this?
35
35 Approach: Indistinguishability [DiNi,EGS,BDMN] xnxn x n-1 xnxn x3x3 x2x2 x1x1 San query 1 answer 1 query T answer T DB= transcript S random coins ¢¢¢ x3x3 San query 1 answer 1 query T answer T DB ’ = transcript S ’ random coins ¢¢¢ x2’x2’ x1x1 Distributions at “ distance ” · Choice of distance measure is important Differ in 1 row
36
36 Approach: Indistinguishability [DiNi,EGS,BDMN] xnxn x n-1 xnxn x3x3 x2x2 x1x1 San query 1 answer 1 query T answer T DB= transcript S random coins ¢¢¢ x3x3 San query 1 answer 1 query T answer T DB ’ = transcript S ’ random coins ¢¢¢ x2’x2’ x1x1 Distributions at “ distance ” · Choice of distance measure is important
37
37 Approach: Indistinguishability [DiNi,EGS,BDMN] San DB= query 1 query T S ¢¢¢ San DB ’ = query 1 query T S’S’ ¢¢¢ Distrib ’ s distance · Problem: must be large By hybrid argument: Any two databases induce transcripts at distance · n To get utility, > 1/n Statistical difference 1/n is not meaningful Example: Release random point in database San(x 1,…,x n ) = ( j, x j ) for random j For every i, changing x i induces statistical difference 1/n But some x i is revealed with probability 1
38
38 ? Formalizing Indistinguishability Definition: San is -indistinguishable if 8 A, 8 DB, DB ’ which differ in 1 row, 8 sets of transcripts E Adversary A query 1 answer 1 transcript S query 1 answer 1 transcript S ’ Equivalently, 8 S: p( San( DB ) = S ) p( San( DB ’ )= S ) 2 1 ± p( San( DB ) 2 E ) 2 e ± (1 ± p( San( DB ’ ) 2 E )
39
39 Indistinguishability ) Differential Privacy Definition: San is safe if 8 prior distributions p( ¢ ) on DB, 8 transcripts S, 8 i =1,…,n StatDiff( p 0 ( ¢ |S), p i ( ¢ | S) ) · We can use indistinguishability: For every S and DB: This implies StatDiff( p 0 ( ¢ |S), p i ( ¢ | S) ) ·
40
40 Why does this help? With relatively little noise: Averages Histograms Matrix decompositions Certain types of clustering … See Kobbi’s talk
41
41 Preventing Attribute Disclosure Various ways to capture “no particular value should be revealed” Differential Criterion: “Whatever is learned would be learned regardless of whether or not person i participates” Satisfied by indistinguishability Also implies protection from re-identification? Two interpretations: A given release won’t make privacy worse Rational respondent will answer if there is some gain Can we preserve enough utility?
42
42 Overview Examples Intuitions for privacy Why crypto def’s don’t apply A Partial* Selection of Definitions Two Straw men Blending into the Crowd An impossibility result Attribute Disclosure and Differential Privacy * “ partial ” = “ incomplete ” and “ biased ”
43
43 Things I Didn’t Talk About Economic Perspective [KPR] Utility of providing data = value – cost May depend on whether others participate When is it worth my while? Specific methods for re-identification Various other frameworks (e.g. “L-diversity”) Other pieces of big “data privacy” picture Access Control Implementing trusted collection center
44
44 Conclusions Pinning down social notion in particular context Biased survey of approaches to definitions A taste of techniques along the way Didn’t talk about utility Question has different flavor from usual crypto problems statisticians’ traditional conception Meaningful statements are possible! Practical? Do they cover everything? No
45
45 Conclusions How close are we to converging? e.g. s.f.e., encryption, Turing machines,… But we’re after a social concept? Silver bullet? What are the big challenges? Need “cryptanalysis” of these systems (Adi…?)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.