1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science

1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science http://theory.csail.mit.edu/~asmith

2 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) “Census problem” Two conflicting goals Utility: Users can extract “global” statistics Privacy: Individual information stays hidden How can these be formalized? Collection and “ sanitization ” 

3 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) “Census problem” Why privacy? Ethical & legal obligation Honest answers require respondents’ trust Collection and “ sanitization ” 

4 Trust is important

5 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) Trusted collection agency Published statistics may be tables, graphs, microdata, etc May have noise or other distortions May be interactive Collection and “ sanitization ” 

6 Database Privacy You Bob Alice Users (government, researchers, marketers, … ) Variations on model studied in Statistics Data mining Theoretical CS Cryptography Different traditions for what “privacy” means Collection and “ sanitization ” 

7 How can we formalize “privacy”? Different people mean different things Pin it down mathematically?

8 I ask them to take a poem and hold it up to the light like a color slide or press an ear against its hive. […] But all they want to do is tie the poem to a chair with rope and torture a confession out of it. They begin beating it with a hose to find out what it really means. - Billy Collins, “Introduction to poetry” Can we approach privacy scientifically? Pin down social concept No perfect definition? But lots of place for rigor Too late? (see Adi’s talk)

9 How can we formalize “privacy”? Different people mean different things Pin it down mathematically? Goal #1: Rigor  Prove clear theorems about privacy Few exist in literature  Make clear (and refutable) conjectures  Sleep better at night Goal #2: Interesting science  (New) Computational phenomenon  Algorithmic problems  Statistical problems

10 Overview Examples Intuitions for privacy  Why crypto def’s don’t apply A Partial* Selection of Definitions Conclusions * “ partial ” = “ incomplete ” and “ biased ”

11 Basic Setting Database DB = table of n rows, each in domain D  D can be numbers, categories, tax forms, etc  This talk: D = {0,1} d  E.g.: Married?, Employed?, Over 18?, … xnxn x n-1  x3x3 x2x2 x1x1 San Users (government, researchers, marketers, … ) query 1 answer 1 query T answer T  DB= random coins ¢¢¢

12 Examples of sanitization methods Input perturbation  Change data before processing  E.g. Randomized response flip each bit of table with probability p Summary statistics  Means, variances  Marginal totals (# people with blue eyes and brown hair)  Regression coefficients Output perturbation  Summary statistics with noise Interactive versions of above:  Auditor decides which queries are OK, type of noise

13 Two Intuitions for Privacy “If the release of statistics S makes it possible to determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place.” [Dalenius] Learning more about me should be hard Privacy is “protection from being brought to the attention of others.” [Gavison] Safety is blending into a crowd Remove Gavison def?

14 Why not use crypto definitions? Attempt #1:  Def’n: For every entry i, no information about x i is leaked (as if encrypted)  Problem: no information at all is revealed!  Tradeoff privacy vs utility Attempt #2:  Agree on summary statistics f( DB ) that are safe  Def’n: No information about DB except f( DB )  Problem: how to decide that f is safe?  Tautology trap  (Also: how do you figure out what f is? --Yosi) C C CC C C

15 Overview Examples Intuitions for privacy  Why crypto def’s don’t apply A Partial* Selection of Definitions  Two straw men  Blending into the Crowd  An impossibility result  Attribute Disclosure and Differential Privacy Conclusions * “ partial ” = “ incomplete ” and “ biased ” Criteria Understandable Clear adversary’s goals & prior knowledge / side information I am a co-author...

16 xnxn x n-1  x3x3 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T  random coins ¢¢¢ Straw man #1: Exact Disclosure Def’n: safe if adversary cannot learn any entry exactly  leads to nice (but hard) combinatorial problems  Does not preclude learning value with 99% certainty or narrowing down to a small interval Historically:  Focus: auditing interactive queries  Difficulty: understanding relationships between queries  E.g. two queries with small difference

17 Straw man #2: Learning the distribution Assume x 1,…,x n are drawn i.i.d. from unknown distribution Def’n: San is safe if it only reveals distribution Implied approach:  learn the distribution  release description of distrib  or re-sample points from distrib Problem: tautology trap  estimate of distrib. depends on data… why is it safe?

18 Blending into a Crowd Intuition: I am safe in a group of k or more  k varies (3… 6… 100… 10,000 ?) Many variations on theme:  Adv. wants predicate g such that 0 < # { i | g(x i )=true } < k  g is called a breach of privacy Why?  Fundamental: R. Gavison: “protection from being brought to the attention of others”  Rare property helps me re-identify someone  Implicit: information about a large group is public e.g. liver problems more prevalent among diabetics

19 Blending into a Crowd Intuition: I am safe in a group of k or more  k varies (3… 6… 100… 10,000 ?) Many variations on theme:  Adv. wants predicate g such that 0 < # { i | g(x i )=true } < k  g is called a breach of privacy Why?  Fundamental: R. Gavison: “protection from being brought to the attention of others”  Rare property helps me re-identify someone  Implicit: information about a large group is public e.g. liver problems more prevalent among diabetics How can we capture this? Syntactic definitions Bayesian adversary “Crypto-flavored” definitions Two variants: frequency in DB frequency in underlying population

20 “Syntactic” Definitions Given sanitization S, look at set of all databases consistent with S Def’n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data  Partition D into bins B 1 [ B 2 [  [ B t  Output cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j ¸ k or f j =0 Cell bound methods [statistics, 1990’s] Sanitization consists of marginal sums  Let f z = #{i : x i =z}. Then San(DB) = various sums of f z Safe if for all z, either 9 cons’t DB with f z ¸ k or 8 cons’t DB’s, f z =0 Large literature using algebraic and combinatorial techniques brownblue  blond21012 brown12618  1416 brownblue  blond [0,12] 12 brown [0,14][0,16] 18  1416

21 brownblue  blond [0,12] 12 brown [0,14][0,16] 18  1416 “Syntactic” Definitions Given sanitization S, look at set of all databases consistent with S Def’n: Safe if no predicate is a breach for all consistent databases k-anonymity [L. Sweeney] Sanitization is histogram of data  Partition D into bins B 1 [ B 2 [  [ B t  Output cardinalities f j = # ( DB Å B j ) Safe if for all j, either f j ¸ k or f j =0 Cell bound methods [statistics, 1990’s] Sanitization consists of marginal sums  if f z = #{i : x i =z} then output = various sums of f z Safe if for all z, either 9 cons’t DB with f z ¸ k or 8 cons’t DB’s, f z =0 large literature using algebraic and combinatorial techniques Issues: If k is small: “all three Canadians at Weizmann sing in a choir.” Semantics?  Probability not considered  What if I have side information?  Algorithm for making decisions not considered  What adversary does this apply to?

22 Security for “Bayesian” adversaries Goal: Adversary outputs point z 2 D Score = 1 / f z if f z > 0 0 otherwise Def’n: sanitization safe if E(score) ·  Procedure: Assume you know adversary’s prior distribution over databases Given a candidate output (e.g. set of marginal sums)  Update prior conditioned on output (via Bayes’ rule)  If max z E( score | output ) <  then release  Else consider new set of marginal sums Extensive literature on computing expected value (see Yosi’s talk) Issues: Restricts the type of predicates adversary can choose Must know prior distribution  Can 1 scheme work for many distributions?  Sanitizer works harder than adversary Conditional probabilities don’t consider previous iterations  “Simulatability” [KMN’05]  Can this be fixed (with efficient computations)?

23 Crypto-flavored Approach [CDMSW,CDMT,NS] “If the release of statistics S makes it possible to determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place.” [Dalenius]

24 2 C Crypto-flavored Approach [CDMSW,CDMT,NS] [CDMSW]: Compare to “simulator”: 8 distributions on databases DB 8 adversaries A, 9 A’ such that 8 subsets J µ DB: Pr DB,S [ A(S) = breach in J ] – Pr DB [ A’() = breach in J ] ·  Definition says nothing if adversary knows x 1  Require that it hold for all subsets of DB No non-trivial examples satisfying this definition  Restrict family of distributions to some class C of distribs  Try to make as large as possible  Sufficient: i.i.d. from “smooth” distribution

25 Crypto-flavored Approach [CDMSW,CDMT,NS] [CDMSW]: Compare to “simulator”: 8 distributions on databases DB 2 C 8 adversaries A, 9 A’ such that 8 subsets J µ DB: Pr DB,S [ A(S) = breach in J ] – Pr DB [ A’() = breach in J ] ·  [CDMSW,CDMT] Geometric data  Assume x i 2 R d  Relax definition: Ball predicates g z,r = {x : ||x-z|| · r} g’ z,r = {x : ||x-z|| · C ¢ r} Breach if # (DB Å g z,r ) >0 and #(DB Å g’ z,r ) < k  Several types of histograms can be released  Sufficient for “metric” problems: clustering, min. span tree,… 2 23 1 22

26 Crypto-flavored Approach [CDMSW,CDMT,NS] [CDMSW]: Compare to “simulator”: 8 distributions on databases DB 2 C 8 adversaries A, 9 A’ such that 8 subsets J µ DB: Pr DB,S [ A(S) = breach in J ] – Pr DB [ A’() = breach in J ] ·  [NS] No geometric restrictions  A lot of noise Almost erase data!  Strong privacy statement  Very weak utility [CDMSW,CDMT,NS]: proven statements! Issues: Works for a large class of prior distributions and side information  But not for all Not clear if it helps with “ordinary” statistical calculations Interesting utility requires geometric restrictions Too messy?

27 Blending into a Crowd Intuition: I am safe in a group of k or more pros:  appealing intuition for privacy  seems fundamental  mathematically interesting  meaningful statements are possible! cons  does it rule out learning facts about particular individual?  all results seem to make strong assumptions on adversary’s prior distribution  is this necessary? (yes…)

28 Overview Examples Intuitions for privacy  Why crypto def’s don’t apply A Partial* Selection of Definitions  Two Straw men  Blending into the Crowd  An impossibility result  Attribute Disclosure and Differential Privacy Conclusions

29 an impossibility result An abstract schema:  Define a privacy breach  8 distributions on databases 8 adversaries A, 9 A’ such that Pr( A(San) = breach ) – Pr( A’() = breach ) ·  Theorem: [Dwork-Naor]  For reasonable “breach”, if San(DB) contains information about DB then some adversary breaks this definition Example:  Adv. knows Alice is 2 inches shorter than average Lithuanian but how tall are Lithuanians?  With sanitized database, probability of guessing height goes up  Theorem: this is unavoidable

30 proof sketch Suppose  If DB is uniform then entropy I( DB ; San(DB) ) > 0  “breach” is predicting a predicate g( DB ) Pick hash function h: { databases } ! {0,1} H(DB|San)  Prior distrib. is uniform conditioned on h( DB )=z Then  h(DB)=z gives no info on g( DB )  San(DB) and h(DB)=z together determine DB [DN] vastly generalize this

31 xnxn x n-1  x3x3 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T  random coins ¢¢¢ Preventing Attribute Disclosure Large class of definitions  safe if adversary can’t learn “too much” about any entry  E.g.: Cannot narrow X i down to small interval For uniform X i, mutual information I(X i ; San( DB ) ) ·  How can we decide among these definitions?

32 Differential Privacy Lithuanians example:  Adv. learns height even if Alice not in DB Intuition [DM]:  “Whatever is learned would be learned regardless of whether or not Alice participates”  Dual: Whatever is already known, situation won’t get worse xnxn x n-1  x3x3 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T  random coins ¢¢¢

33 Differential Privacy xnxn x n-1  0 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T  random coins ¢¢¢ Define n+1 games  “Game 0”: Adv. interacts with San(DB)  For each i, let DB -i = (x 1,…,x i-1,0,x i+1,…,x n )  “Game i”: Adv. interacts with San(DB -i ) Bayesian adversary:  Given S and prior distrib p() on DB, define n+1 posterior distrib’s

34 Differential Privacy xnxn x n-1  0 x2x2 x1x1 DB= Adversary A San query 1 answer 1 query T answer T  random coins ¢¢¢ Definition: San is safe if 8 prior distributions p( ¢ ) on DB, 8 transcripts S, 8 i =1,…,n StatDiff( p 0 ( ¢ |S), p i ( ¢ | S) ) ·  Note that the prior distribution may be far from both How can we satisfy this?

35 Approach: Indistinguishability [DiNi,EGS,BDMN] xnxn x n-1 xnxn  x3x3 x2x2 x1x1 San query 1 answer 1 query T answer T  DB= transcript S random coins ¢¢¢  x3x3 San query 1 answer 1 query T answer T  DB ’ = transcript S ’ random coins ¢¢¢ x2’x2’ x1x1 Distributions at “ distance ” ·  Choice of distance measure is important Differ in 1 row

36 Approach: Indistinguishability [DiNi,EGS,BDMN] xnxn x n-1 xnxn  x3x3 x2x2 x1x1 San query 1 answer 1 query T answer T  DB= transcript S random coins ¢¢¢  x3x3 San query 1 answer 1 query T answer T  DB ’ = transcript S ’ random coins ¢¢¢ x2’x2’ x1x1 Distributions at “ distance ” ·  Choice of distance measure is important

37 Approach: Indistinguishability [DiNi,EGS,BDMN] San DB= query 1 query T  S ¢¢¢ San DB ’ = query 1 query T  S’S’ ¢¢¢ Distrib ’ s distance ·  Problem:  must be large By hybrid argument: Any two databases induce transcripts at distance · n  To get utility,  > 1/n Statistical difference 1/n is not meaningful Example: Release random point in database  San(x 1,…,x n ) = ( j, x j ) for random j  For every i, changing x i induces statistical difference 1/n  But some x i is revealed with probability 1

38 ? Formalizing Indistinguishability Definition: San is  -indistinguishable if 8 A, 8 DB, DB ’ which differ in 1 row, 8 sets of transcripts E Adversary A query 1 answer 1 transcript S query 1 answer 1 transcript S ’ Equivalently, 8 S: p( San( DB ) = S ) p( San( DB ’ )= S ) 2 1 ±  p( San( DB ) 2 E ) 2 e ±  (1 ±  p( San( DB ’ ) 2 E )

39 Indistinguishability ) Differential Privacy Definition: San is safe if 8 prior distributions p( ¢ ) on DB, 8 transcripts S, 8 i =1,…,n StatDiff( p 0 ( ¢ |S), p i ( ¢ | S) ) ·  We can use indistinguishability:  For every S and DB:  This implies StatDiff( p 0 ( ¢ |S), p i ( ¢ | S) ) · 

40 Why does this help? With relatively little noise: Averages Histograms Matrix decompositions Certain types of clustering … See Kobbi’s talk

41 Preventing Attribute Disclosure Various ways to capture “no particular value should be revealed” Differential Criterion:  “Whatever is learned would be learned regardless of whether or not person i participates” Satisfied by indistinguishability  Also implies protection from re-identification? Two interpretations:  A given release won’t make privacy worse  Rational respondent will answer if there is some gain Can we preserve enough utility?

42 Overview Examples Intuitions for privacy  Why crypto def’s don’t apply A Partial* Selection of Definitions  Two Straw men  Blending into the Crowd  An impossibility result  Attribute Disclosure and Differential Privacy * “ partial ” = “ incomplete ” and “ biased ”

43 Things I Didn’t Talk About Economic Perspective [KPR]  Utility of providing data = value – cost  May depend on whether others participate  When is it worth my while? Specific methods for re-identification Various other frameworks (e.g. “L-diversity”) Other pieces of big “data privacy” picture  Access Control  Implementing trusted collection center

44 Conclusions Pinning down social notion in particular context Biased survey of approaches to definitions  A taste of techniques along the way  Didn’t talk about utility Question has different flavor from  usual crypto problems  statisticians’ traditional conception Meaningful statements are possible!  Practical?  Do they cover everything? No

45 Conclusions How close are we to converging?  e.g. s.f.e., encryption, Turing machines,…  But we’re after a social concept?  Silver bullet? What are the big challenges? Need “cryptanalysis” of these systems (Adi…?)

1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science

Similar presentations

Presentation on theme: "1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science

Similar presentations

Presentation on theme: "1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science"— Presentation transcript:

Similar presentations

About project

Feedback