Presentation is loading. Please wait.

Presentation is loading. Please wait.

Differential Privacy: What, Why and When

Similar presentations


Presentation on theme: "Differential Privacy: What, Why and When"— Presentation transcript:

1 Differential Privacy: What, Why and When
Moni Naor Weizmann Institute of Science The Brussels Privacy Symposium November 8th 2016

2 What is Differential Privacy
Differential Privacy is a concept Motivation Rigorous mathematical definition Properties A measurable quantity Set of algorithmic techniques for achieving it First defined in: Dwork, McSherry, Nissim, and Smith, Calibrating Noise to Sensitivity in Private Data Analysis, in Third Theory of Cryptography Conference, TCC 2006. Earlier roots: Warner, Randomized Response 1965

3 Why Differential Privacy?
DP: Strong, quantifiable, composable mathematical privacy guarantee Provably resilient to known and unknown attack modes! Theoretically, DP enables many computations with personal data while preserving personal privacy Practicality in first stages of validation No snake oil Not a panacea

4 Lots of Data Recent years: a lot of data is available to companies and government agencies Census data Huge databases collected by companies Data deluge Public Surveillance Information CCTV RFIDs Social Networks Data contains personal and confidential information

5 Social benefits from analyzing large collections of data
Cholera cases Water Pump on Broad Street  John Snow’s map Cholera cases in London epidemic of 1854  

6 More Utility Word Completion

7 Social benefits from analyzing large collections of data
What about Privacy? Better Privacy Better Data Almost any usage of the data that is not carefully crafted will leak something about it

8 Glorious Failures of Traditional Approaches to Data Privacy
Re-identification [Sweeney ’00, …] Auditors [Kenthapadi, Mishra, Nissim ’05] Genome-Wide association studies (GWAS) [Homer et al. ’08] Netflix Prize [Narayanan, Shmatikov ‘08] Social networks [Backstrom, Dwork, Kleinberg ‘11] Attack on statistical aggregates [Dwork, Smith, Steinke, Vadhan ‘15]

9

10 “BellKor's Pragmatic Chaos team”
The Netflix Prize Netflix Recommends Movies to its Subscribers Seek an improved recommendation system Offered $1,000,000 for “10% improvement” Published training data Prize won in September 2009 “BellKor's Pragmatic Chaos team” Very influential competition in machine learning

11 From the Netflix Prize Rules Page…
“The training data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles.” “The ratings are on a scale from 1 to 5 (integral) stars. To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. The date of each rating and the title and year of release for each movie are provided.”

12 Netflix Data Release [Narayanan-Shmatikov 2008]
Item 1 Item 2 Item M Ratings for subset of movies and users Usernames replaced with random IDs Some additional perturbation User 1 User 2 Transactions, purchases, preferences, behavior User N Credit: Arvind Narayanan via Adam Smith

13 A Source of Auxiliary Information
Internet Movie Database (IMDb) Individuals may register for an account and rate movies Need not be anonymous Probably want to create some web presence Visible material includes ratings, dates, comments

14 Use Public Reviews from IMDb.com
Alice Bob Charlie Danielle Erica Frank Anonymized NetFlix data Public, incomplete IMDB data Alice Bob = Charlie Danielle Erica Frank Identified NetFlix Data Credit: Arvind Narayanan via Adam Smith

15 De-anonymizing the Netflix Dataset
of which 2 may be completely wrong Results “With 8 movie ratings and dates that may have a 3-day error, 96% of Netflix subscribers whose records have been released can be uniquely identified in the dataset.” “For 89%, 2 ratings and dates are enough to reduce the set of plausible records to 8 out of almost 500,000, which can then be inspected by a human for further deanonymization.” Consequences? Learn about movies that IMDB users didn’t want to tell the world about... Sexual orientation, religious beliefs Subject of lawsuits Settled, March 2010 Video Privacy Protection Act 1988 Credit: Arvind Narayanan via Adam Smith

16 Do People Care About Privacy?
Technology changes Attitudes differ Attitudes change in time

17 Do people care about privacy?
Idea: Arvind Narayanan

18 Do people care about privacy?
What about proctologists?

19 Voting Confidentiality
George Caleb Bingham "The County Election" Voting Confidentiality George Caleb Bingham's "The County Election"

20 Voting Confidentiality
Tallying the vote Judge swears the voter

21 Privacy of Public Data Analysis
The holy grail: Get utility of statistical analysis while protecting privacy of every individual participant Ideally: Privacy-preserving sanitization should allow reasonably accurate answers to meaningful information Is it possible to phrase the goal in a meaningful and achievable manner?

22 Setting Database Released data Curator/ Sanitizer Global vs. local

23 Setting: Interactive case
query 1 ? query 2 Curator/ Sanitizer Data Give guidelines/tools to the curator Multiple queries, chosen adaptively

24 “Pure” Privacy Problem
? C Difficult Even if Curator is Angel Data are in Vault Nevertheless: tight connection to problems in cryptography You can run but you can’t hide Credit: Cynthia Dwork

25 Databases that Teach Database teaches that smoking causes cancer.
Smoker S’s insurance premiums rise. This is true even if S not in database! Learning that smoking causes cancer is the whole point. Smoker S enrolls in a smoking cessation program… Differential privacy: limit the harm to the teachings, not participation Outcome of any analysis is essentially equally likely, independent of whether any individual joins, or refrains from joining, the dataset.

26 The Statistics Masquerade
Differencing Attack Suppose Kobbi is unique in the community in speaking ≥ 5 languages How many people in the community have sickle cell trait? How many in the community speak ≤ 4 languages and have sickle cell trait? “Blatant non-privacy” Dinur and Nissim 2003, et sequelae Overly accurate answers to too many questions destroys privacy.

27 Differential Privacy Dwork, McSherry Nissim & Smith 2006
bn bn-1 b3 b2 b1 b= M M(b) Distributions at “distance” < ε b’= bn bn-1 b3 b2’ b1 M Neighboring: One entry modified M(b’) Slide credit: Kobbi Nissim

28 Pr[M(D1) ϵ A] ≤ e Pr[M(D2) ϵ A]
Differential Privacy Differing in one user Sanitizer M gives  -differential privacy if: for all adjacent D1 and D2 All subsets A ⊆range(M): Pr[M(D1) ϵ A] ≤ e Pr[M(D2) ϵ A] Probability taken over randomness of M ratio bounded Pr [response] Responses: Z Participation in the data set poses no additional risk Adversary does not distinguish whether I supplied real or fake input

29 Differential Privacy is a Success
Algorithms in many setting and for many tasks Important Properties: Group privacy: k privacy for a group of size k Composability Applying the sanitization several time yields a graceful degradation proportional to number of applications even prop. to squareroot of number of applications. Robustness to side information No need to specify exactly what the adversary knows Programmable! Hard to quantify

30 Sometimes talk about fraction
Counting Queries Database x of size n Counting-queries Q is a set of predicates q: U  {0,1} Query: how many x participants satisfy q? Relaxed accuracy: answer query within α additive error w.h.p Not so bad: some error anyway inherent in statistical analysis Query q n individuals, each contributing a single point in U U Sometimes talk about fraction

31 Laplacian Mechanism for Counting Queries
Given query 𝑞: Compute true answer 𝑞(𝑥) Output 𝑞 𝑥 +𝐿𝑎𝑝 1 𝜖 1 2 3 4 5 -1 -2 -3 -4 Handle t online queries by adding Lap(1/) independently Privacy loss 𝑂 𝑡 𝜖 Can do better with (,)-DP - 𝑂 √𝑡 𝜖 Need bound on t which is o(n) (=0) and o(n2) Question: Can we handle number of queries >> n?

32 Key Insight to increase # of queries: Use Coordinated Noise
Starting from Blum-Ligget-Roth 2008: If noise is added in with careful coordination, rather than independently can answer hugely many queries Wave of results showing: Differential Privacy for every set Q of counting queries Error is Õ(n1/2 log|Q|) Even in the interactive case – Private Multiplicative Weights Algorithm #queries >> DB size ? M

33 Maintaining State Query q State = Distribution D

34 Multiplicative Weights
The PMW Algorithm Hardt & Rothblum 2010 Maintain a distribution D on universe U This is the state. Is completely public! Initialize D to be uniform on U Repeat up to k times Set T ← T + Lap() Repeat while no update occurs: Receive query q ∈ Q Let 𝑎 = x(q) + Lap() Test: If |q(D)- 𝑎 | ≤ T output q(D). Else (update): Output 𝑎 Update D[i] / D[i] e±T/4q[i] and re-weight. Algorithm fails if more than k updates Multiplicative Weights Powerful tool in algorithms design Learn a Probability Distribution iteratively In each round: either current distribution is good or get a lot of information on distribution Update distribution The true value the plus or minus are according to the sign of the error

35 Privacy and Security Connection to Tracing Traitors
Deduplication in Cloud Storage Collaborative Security

36 Issues How big can epsilon be? What level are we protecting?
0.01? 0.1? 10? How big can epsilon be? It may add up over a lifetime… Possible answer: design your system so that epsilon is bounded at all. This already removes some pernicious attacks What level are we protecting? User level or event level Gene or individual?

37 Applications/Implementations of Differential Privacy
Census Bureau OnTheMap: gives researchers access to agency data. Google’s RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response Open source Tackling Urban Mobility with aggregate traffic data Apple: big news coverage, commitment to privacy Applications to multiple hypothesis testing Enabled collection of data Chrome avoided before Global vs. local How to hack Kaggle competitions

38 Public policy California Public Utilities Commission
Smart meters Interpreting and implementing FERPA by Differential Privacy Family Educational Rights and Privacy Act Nissim and Wood The hope is that this line of work would result in (I) solid arguments that DP can be used with personal information protected by various regulations (II) that in cases where it would not be possible to make such a statement that it would be possible to come up with modifications to DP to satisfy the regulation and (III) (maybe most importantly) that this would help shape the future regulation. Understand how DP fits with existing regulatory framework. Problem: regulatory framework is not mathematically precise, Idea of de-identification is hard wired in it.

39 Challenges Small Datasets
Massive Composition – global epsilon: event level vs user level Work in conjunction with Secure Function Evaluation Winning the hearts and minds of policy makers…

40 Winning the hearts and minds of policy makers…
Widen scope of implementation and use.  Should identify what are the next good use cases for DP. Construct DP tools matching best the practices and education of users; Explain shortcomings of other methods and benefits of DP Need to figure how DP works as one of the layers in a suit of privacy protections. Less straightforward and intuitive than anonymity/de-identification and its variants


Download ppt "Differential Privacy: What, Why and When"

Similar presentations


Ads by Google