Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science

Similar presentations


Presentation on theme: "1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science"— Presentation transcript:

1 1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science http://theory.csail.mit.edu/~asmith

2 2 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) “Census problem” Two conflicting goals Utility: Users can extract “global” statistics Privacy: Individual information stays hidden How can these be formalized? Collection and “ sanitization ” 

3 3 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) “Census problem” Why privacy? Ethical & legal obligation Honest answers require respondents’ trust Collection and “ sanitization ” 

4 4 Trust is important

5 5 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) Trusted collection agency Published statistics may be tables, graphs, microdata, etc May have noise or other distortions May be interactive Collection and “ sanitization ” 

6 6 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) Variations on model studied in Statistics Data mining Theoretical CS Cryptography Different traditions for what “privacy” means Collection and “ sanitization ” 

7 7 How can we formalize “privacy”? Different people mean different things Pin it down mathematically?

8 8 I ask them to take a poem and hold it up to the light like a color slide or press an ear against its hive. […] But all they want to do is tie the poem to a chair with rope and torture a confession out of it. They begin beating it with a hose to find out what it really means. - Billy Collins, “Introduction to poetry” Can we approach privacy scientifically? Pin down social concept No perfect definition? But lots of place for rigor Too late? (see Adi’s talk)

9 9 How can we formalize “privacy”? Different people mean different things Pin it down mathematically? Goal #1: Rigor  Prove clear theorems about privacy Few exist in literature  Make clear (and refutable) conjectures  Sleep better at night Goal #2: Interesting science  (New) Computational phenomenon  Algorithmic problems  Statistical problems

10 10 Overview Examples Intuitions for privacy  Why crypto def’s don’t apply A Partial* Selection of Definitions Conclusions * “ partial ” = “ incomplete ” and “ biased ”

11 11 Basic Setting Database DB = table of n rows, each in domain D  D can be numbers, categories, tax forms, etc  This talk: D = {0,1} d  E.g.: Married?, Employed?, Over 18?, … xnxn x n-1  x3x3 x2x2 x1x1 San Users (government, researchers, marketers, … ) query 1 answer 1 query T answer T  DB= random coins ¢¢¢

12 12 Examples of sanitization methods Input perturbation  Change data before processing  E.g. Randomized response flip each bit of table with probability p Summary statistics  Means, variances  Marginal totals (# people with blue eyes and brown hair)  Regression coefficients Output perturbation  Summary statistics with noise Interactive versions of above:  Auditor decides which queries are OK, type of noise

13 13 Two Intuitions for Privacy “If the release of statistics S makes it possible to determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place.” [Dalenius] Learning more about me should be hard Privacy is “protection from being brought to the attention of others.” [Gavison] Safety is blending into a crowd Remove Gavison def?

14 14 Why not use crypto definitions? Attempt #1:  Def’n: For every entry i, no information about x i is leaked (as if encrypted)  Problem: no information at all is revealed!  Tradeoff privacy vs utility Attempt #2:  Agree on summary statistics f( DB ) that are safe  Def’n: No information about DB except f( DB )  Problem: how to decide that f is safe?  Tautology trap  (Also: how do you figure out what f is? --Yosi) C C CC C C

15 15 Overview Examples Intuitions for privacy  Why crypto def’s don’t apply A Partial* Selection of Definitions  Two straw men  Blending into the Crowd  An impossibility result  Attribute Disclosure and Differential Privacy Conclusions * “ partial ” = “ incomplete ” and “ biased ” Criteria Understandable Clear adversary’s goals & prior knowledge / side information I am a co-author...

16 16 xnxn x n-1  x3x3 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T  random coins ¢¢¢ Straw man #1: Exact Disclosure Def’n: safe if adversary cannot learn any entry exactly  leads to nice (but hard) combinatorial problems  Does not preclude learning value with 99% certainty or narrowing down to a small interval Historically:  Focus: auditing interactive queries  Difficulty: understanding relationships between queries  E.g. two queries with small difference

17 17 Straw man #2: Learning the distribution Assume x 1,…,x n are drawn i.i.d. from unknown distribution Def’n: San is safe if it only reveals distribution Implied approach:  learn the distribution  release description of distrib  or re-sample points from distrib Problem: tautology trap  estimate of distrib. depends on data… why is it safe?

18 18 Blending into a Crowd Intuition: I am safe in a group of k or more  k varies (3… 6… 100… 10,000 ?) Many variations on theme:  Adv. wants predicate g such that 0 < # { i | g(x i )=true } < k  g is called a breach of privacy Why?  Fundamental: R. Gavison: “protection from being brought to the attention of others”  Rare property helps me re-identify someone  Implicit: information about a large group is public e.g. liver problems more prevalent among diabetics

19 19 Blending into a Crowd Intuition: I am safe in a group of k or more  k varies (3… 6… 100… 10,000 ?) Many variations on theme:  Adv. wants predicate g such that 0 < # { i | g(x i )=true } < k  g is called a breach of privacy Why?  Fundamental: R. Gavison: “protection from being brought to the attention of others”  Rare property helps me re-identify someone  Implicit: information about a large group is public e.g. liver problems more prevalent among diabetics How can we capture this? Syntactic definitions Bayesian adversary “Crypto-flavored” definitions Two variants: frequency in DB frequency in underlying population

20 20 “Syntactic” Definitions Given sanitization S, look at set of all databases consistent with S Def’n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data  Partition D into bins B 1 [ B 2 [  [ B t  Output cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j ¸ k or f j =0 Cell bound methods [statistics, 1990’s] Sanitization consists of marginal sums  Let f z = #{i : x i =z}. Then San(DB) = various sums of f z Safe if for all z, either 9 cons’t DB with f z ¸ k or 8 cons’t DB’s, f z =0 Large literature using algebraic and combinatorial techniques brownblue  blond21012 brown12618  1416 brownblue  blond [0,12] 12 brown [0,14][0,16] 18  1416

21 21 brownblue  blond [0,12] 12 brown [0,14][0,16] 18  1416 “Syntactic” Definitions Given sanitization S, look at set of all databases consistent with S Def’n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data  Partition D into bins B 1 [ B 2 [  [ B t  Output cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j ¸ k or f j =0 Cell bound methods [statistics, 1990’s] Sanitization consists of marginal sums  if f z = #{i : x i =z} then output = various sums of f z Safe if for all z, either 9 cons’t DB with f z ¸ k or 8 cons’t DB’s, f z =0 large literature using algebraic and combinatorial techniques Issues: If k is small: “all three Canadians at Weizmann sing in a choir.” Semantics?  Probability not considered  What if I have side information?  Algorithm for making decisions not considered  What adversary does this apply to?

22 22 Security for “Bayesian” adversaries Goal: Adversary outputs point z 2 D Score = 1 / f z if f z > 0 0 otherwise Def’n: sanitization safe if E(score) ·  Procedure: Assume you know adversary’s prior distribution over databases Given a candidate output (e.g. set of marginal sums)  Update prior conditioned on output (via Bayes’ rule)  If max z E( score | output ) <  then release  Else consider new set of marginal sums Extensive literature on computing expected value (see Yosi’s talk) Issues: Restricts the type of predicates adversary can choose Must know prior distribution  Can 1 scheme work for many distributions?  Sanitizer works harder than adversary Conditional probabilities don’t consider previous iterations  “Simulatability” [KMN’05]  Can this be fixed (with efficient computations)?

23 23 Crypto-flavored Approach [CDMSW,CDMT,NS] “If the release of statistics S makes it possible to determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place.” [Dalenius]

24 24 2 C Crypto-flavored Approach [CDMSW,CDMT,NS] [CDMSW]: Compare to “simulator”: 8 distributions on databases DB 8 adversaries A, 9 A’ such that 8 subsets J µ DB: Pr DB,S [ A(S) = breach in J ] – Pr DB [ A’() = breach in J ] ·  Definition says nothing if adversary knows x 1  Require that it hold for all subsets of DB No non-trivial examples satisfying this definition  Restrict family of distributions to some class C of distribs  Try to make as large as possible  Sufficient: i.i.d. from “smooth” distribution

25 25 Crypto-flavored Approach [CDMSW,CDMT,NS] [CDMSW]: Compare to “simulator”: 8 distributions on databases DB 2 C 8 adversaries A, 9 A’ such that 8 subsets J µ DB: Pr DB,S [ A(S) = breach in J ] – Pr DB [ A’() = breach in J ] ·  [CDMSW,CDMT] Geometric data  Assume x i 2 R d  Relax definition: Ball predicates g z,r = {x : ||x-z|| · r} g’ z,r = {x : ||x-z|| · C ¢ r} Breach if # (DB Å g z,r ) >0 and #(DB Å g’ z,r ) < k  Several types of histograms can be released  Sufficient for “metric” problems: clustering, min. span tree,… 2 23 1 22

26 26 Crypto-flavored Approach [CDMSW,CDMT,NS] [CDMSW]: Compare to “simulator”: 8 distributions on databases DB 2 C 8 adversaries A, 9 A’ such that 8 subsets J µ DB: Pr DB,S [ A(S) = breach in J ] – Pr DB [ A’() = breach in J ] ·  [NS] No geometric restrictions  A lot of noise Almost erase data!  Strong privacy statement  Very weak utility [CDMSW,CDMT,NS]: proven statements! Issues: Works for a large class of prior distributions and side information  But not for all Not clear if it helps with “ordinary” statistical calculations Interesting utility requires geometric restrictions Too messy?

27 27 Blending into a Crowd Intuition: I am safe in a group of k or more pros:  appealing intuition for privacy  seems fundamental  mathematically interesting  meaningful statements are possible! cons  does it rule out learning facts about particular individual?  all results seem to make strong assumptions on adversary’s prior distribution  is this necessary? (yes…)

28 28 Overview Examples Intuitions for privacy  Why crypto def’s don’t apply A Partial* Selection of Definitions  Two Straw men  Blending into the Crowd  An impossibility result  Attribute Disclosure and Differential Privacy Conclusions

29 29 an impossibility result An abstract schema:  Define a privacy breach  8 distributions on databases 8 adversaries A, 9 A’ such that Pr( A(San) = breach ) – Pr( A’() = breach ) ·  Theorem: [Dwork-Naor]  For reasonable “breach”, if San(DB) contains information about DB then some adversary breaks this definition Example:  Adv. knows Alice is 2 inches shorter than average Lithuanian but how tall are Lithuanians?  With sanitized database, probability of guessing height goes up  Theorem: this is unavoidable

30 30 proof sketch Suppose  If DB is uniform then entropy I( DB ; San(DB) ) > 0  “breach” is predicting a predicate g( DB ) Pick hash function h: { databases } ! {0,1} H(DB|San)  Prior distrib. is uniform conditioned on h( DB )=z Then  h(DB)=z gives no info on g( DB )  San(DB) and h(DB)=z together determine DB [DN] vastly generalize this

31 31 xnxn x n-1  x3x3 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T  random coins ¢¢¢ Preventing Attribute Disclosure Large class of definitions  safe if adversary can’t learn “too much” about any entry  E.g.: Cannot narrow X i down to small interval For uniform X i, mutual information I(X i ; San( DB ) ) ·  How can we decide among these definitions?

32 32 Differential Privacy Lithuanians example:  Adv. learns height even if Alice not in DB Intuition [DM]:  “Whatever is learned would be learned regardless of whether or not Alice participates”  Dual: Whatever is already known, situation won’t get worse xnxn x n-1  x3x3 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T  random coins ¢¢¢

33 33 Differential Privacy xnxn x n-1  0 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T  random coins ¢¢¢ Define n+1 games  “Game 0”: Adv. interacts with San(DB)  For each i, let DB -i = (x 1,…,x i-1,0,x i+1,…,x n )  “Game i”: Adv. interacts with San(DB -i ) Bayesian adversary:  Given S and prior distrib p() on DB, define n+1 posterior distrib’s

34 34 Differential Privacy xnxn x n-1  0 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T  random coins ¢¢¢ Definition: San is safe if 8 prior distributions p( ¢ ) on DB, 8 transcripts S, 8 i =1,…,n StatDiff( p 0 ( ¢ |S), p i ( ¢ | S) ) ·  Note that the prior distribution may be far from both How can we satisfy this?

35 35 Approach: Indistinguishability [DiNi,EGS,BDMN] xnxn x n-1 xnxn  x3x3 x2x2 x1x1 San query 1 answer 1 query T answer T  DB= transcript S random coins ¢¢¢  x3x3 San query 1 answer 1 query T answer T  DB ’ = transcript S ’ random coins ¢¢¢ x2’x2’ x1x1 Distributions at “ distance ” ·  Choice of distance measure is important Differ in 1 row

36 36 Approach: Indistinguishability [DiNi,EGS,BDMN] xnxn x n-1 xnxn  x3x3 x2x2 x1x1 San query 1 answer 1 query T answer T  DB= transcript S random coins ¢¢¢  x3x3 San query 1 answer 1 query T answer T  DB ’ = transcript S ’ random coins ¢¢¢ x2’x2’ x1x1 Distributions at “ distance ” ·  Choice of distance measure is important

37 37 Approach: Indistinguishability [DiNi,EGS,BDMN] San DB= query 1 query T  S ¢¢¢ San DB ’ = query 1 query T  S’S’ ¢¢¢ Distrib ’ s distance ·  Problem:  must be large By hybrid argument: Any two databases induce transcripts at distance · n  To get utility,  > 1/n Statistical difference 1/n is not meaningful Example: Release random point in database  San(x 1,…,x n ) = ( j, x j ) for random j  For every i, changing x i induces statistical difference 1/n  But some x i is revealed with probability 1

38 38 ? Formalizing Indistinguishability Definition: San is  -indistinguishable if 8 A, 8 DB, DB ’ which differ in 1 row, 8 sets of transcripts E Adversary A query 1 answer 1 transcript S query 1 answer 1 transcript S ’ Equivalently, 8 S: p( San( DB ) = S ) p( San( DB ’ )= S ) 2 1 ±  p( San( DB ) 2 E ) 2 e ±  (1 ±  p( San( DB ’ ) 2 E )

39 39 Indistinguishability ) Differential Privacy Definition: San is safe if 8 prior distributions p( ¢ ) on DB, 8 transcripts S, 8 i =1,…,n StatDiff( p 0 ( ¢ |S), p i ( ¢ | S) ) ·  We can use indistinguishability:  For every S and DB:  This implies StatDiff( p 0 ( ¢ |S), p i ( ¢ | S) ) · 

40 40 Why does this help? With relatively little noise: Averages Histograms Matrix decompositions Certain types of clustering … See Kobbi’s talk

41 41 Preventing Attribute Disclosure Various ways to capture “no particular value should be revealed” Differential Criterion:  “Whatever is learned would be learned regardless of whether or not person i participates” Satisfied by indistinguishability  Also implies protection from re-identification? Two interpretations:  A given release won’t make privacy worse  Rational respondent will answer if there is some gain Can we preserve enough utility?

42 42 Overview Examples Intuitions for privacy  Why crypto def’s don’t apply A Partial* Selection of Definitions  Two Straw men  Blending into the Crowd  An impossibility result  Attribute Disclosure and Differential Privacy * “ partial ” = “ incomplete ” and “ biased ”

43 43 Things I Didn’t Talk About Economic Perspective [KPR]  Utility of providing data = value – cost  May depend on whether others participate  When is it worth my while? Specific methods for re-identification Various other frameworks (e.g. “L-diversity”) Other pieces of big “data privacy” picture  Access Control  Implementing trusted collection center

44 44 Conclusions Pinning down social notion in particular context Biased survey of approaches to definitions  A taste of techniques along the way  Didn’t talk about utility Question has different flavor from  usual crypto problems  statisticians’ traditional conception Meaningful statements are possible!  Practical?  Do they cover everything? No

45 45 Conclusions How close are we to converging?  e.g. s.f.e., encryption, Turing machines,…  But we’re after a social concept?  Silver bullet? What are the big challenges? Need “cryptanalysis” of these systems (Adi…?)


Download ppt "1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science"

Similar presentations


Ads by Google