Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin.

Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Browsing history Medical and genetic data Web searches Tastes Purchases slide 2

Web tracking slide 3

Social aggregation Database marketing Universal data accessibility Aggregation slide 4

Electronic medical records (EMR) – Cerner, Practice Fusion … Health-care datasets – Clinical studies, hospital discharge databases … Increasingly accompanied by DNA information – PatientsLikeMe.com Medical data slide 5

High-dimensional datasets Row = user record Column = dimension – Example: purchased items Thousands or millions of dimensions – Netflix movie ratings: 35,000 – Amazon purchases: 10 7 slide 6

similarity Netflix Prize dataset: Considering just movie names, for 90% of records there isn’t a single other record which is more than 30% similar Netflix Prize dataset: Considering just movie names, for 90% of records there isn’t a single other record which is more than 30% similar Average record has no “similar” records Sparsity and “Long Tail” slide 7

Graph-structured social data Node attributes – Interests – Group membership – Sexual orientation Edge attributes – Date of creation – Strength – Type of relationship slide 8

“Jefferson High”: romantic and sexual network Real data! slide 9

Whose data is it, anyway? Social networks – Information about relationships is shared Genome – Shared with all blood relatives Recommender systems – Complex algorithms make it impossible to trace origin of data Traditional notion: everyone owns and should control their personal data slide 10

Search Mini-feed Beacon Applications Famous privacy breaches Why did they happen? slide 11

Data release today Datasets are “scrubbed” and published Why not interactive computation? – Infrastructure cost – Overhead of online privacy enforcement – Resource allocation and competition – Client privacy What about privacy of data subjects? – Answer: data have been ANONYMIZED slide 12

The crutch of anonymity (U.S) (U.K) Deals with ISPs to collect anonymized browsing data for highly targeted advertising. Users not notified. Court ruling over YouTube user log data causes major privacy uproar. Deal to anonymize viewing logs satisfies all objections. slide 13

Targeted advertising “… breakthrough technology that uses social graph data to dramatically improve online marketing … "Social Engagement Data" consists of anonymous information regarding the relationships between people” “The critical distinction … between the use of personal information for advertisements in personally-identifiable form, and the use, dissemination, or sharing of information with advertisers in non-personally-identifiable form.” slide 14

Data are “scrubbed” by removing personally identifying information (PII) – Name, Social Security number, phone number, email, address… what else? Problem: PII has no technical meaning – Defined in disclosure notification laws If certain information is lost, consumer must be notified – In privacy breaches, any information can be personally identifying The myth of the PII slide 15

More reading Narayanan and Shmatikov. “Myths and Fallacies of ‘Personally Identifiable Information’ ” (CACM 2010) slide 16

De-identification Tries to achieve “privacy” by syntactic transformation of the data - Scrubbing of PII, k-anonymity, l-diversity… Fatally flawed! Insecure against attackers with external information Does not compose (anonymize twice  reveal data) No meaningful notion of privacy No meaningful notion of utility slide 17

Latanya Sweeney’s attack (1997) Massachusetts hospital discharge dataset Public voter dataset slide 18

Closer look at two records Age (70) ZIP code (78705) Sex (Male) Age (70) ZIP code (78705) Sex (Male) Name (Vitaly) Disease (Jetlag) Voter registrationPatient record Identifiable, no sensitive data Anonymized, contains sensitive data slide 19

Database join Name (Vitaly) Age (70) Zip code (78705) Sex (Male) Disease (Jetlag) Vitaly suffers from jetlag! slide 20

Observation #1: data joins Attacker learns sensitive data by joining two datasets on common attributes – Anonymized dataset with sensitive attributes Example: age, race, symptoms – “Harmless” dataset with individual identifiers Example: name, address, age, race Demographic attributes (age, ZIP code, race, etc.) are very common slide 21

Observation #2: quasi-identifiers Sweeney’s observation: (birthdate, ZIP code, gender) uniquely identifies 87% of US population – Side note: actually, only 63% Publishing a record with a quasi-identifier is as bad as publishing it with an explicit identity Eliminating quasi-identifiers is not desirable – For example, users of the dataset may want to study distribution of diseases by age and ZIP code slide 22 [Golle WPES ‘06]

k-anonymity Proposed by Samarati and Sweeney – First appears in an SRI tech report (1998) Hundreds of papers since then – Extremely popular in the database and data-mining communities (SIGMOD, ICDE, KDD, VLDB) Many k-anonymization algorithms, most based on generalization and suppression of quasi- identifiers slide 23

Anonymization in a nutshell Dataset is a relational table Attributes (columns) are divided into quasi-identifiers and sensitive attributes Generalize/suppress quasi-identifiers, but don’t touch sensitive attributes (keep them “truthful”) RaceAgeSymptomsBlood type Medical history …………… …………… slide 24

k-anonymity: definition Any (transformed) quasi-identifier must appear in at least k records in the anonymized dataset – k is chosen by the data owner (how?) – Example: any age-race combination from original DB must appear at least 10 times in anonymized DB Guarantees that any join on quasi-identifiers with the anonymized dataset will contain at least k records for each quasi-identifier slide 25

Membership disclosure: cannot tell that a given person in the dataset Sensitive attribute disclosure: cannot tell that a given person has a certain sensitive attribute Identity disclosure: cannot tell which record corresponds to a given person This interpretation is correct (assuming the attacker only knows quasi-identifiers) This interpretation is correct (assuming the attacker only knows quasi-identifiers) Two (and a half) interpretations Does not imply any privacy! Example: k clinical records, all HIV+ slide 26

Curse of dimensionality Generalization fundamentally relies on spatial locality – Each record must have k close neighbors Real-world datasets are very sparse – Netflix Prize dataset: 17,000 dimensions – Amazon: several million dimensions – “Nearest neighbor” is very far Projection to low dimensions loses all info  k-anonymized datasets are useless Aggarwal VLDB ‘05 slide 27

k-anonymity: definition Any (transformed) quasi-identifier must appear in at least k records in the anonymized dataset Does not mention sensitive attributes at all! Does not say anything about the computations to be done on the data Assumes that attacker will be able to join only on quasi-identifiers... or how not to define privacy slide 28

Sensitive attribute disclosure Intuitive reasoning: k-anonymity prevents attacker from telling which record corresponds to which person Therefore, attacker cannot tell that a certain person has a particular value of a sensitive attribute This reasoning is fallacious! slide 29

3-anonymization Caucas78712Flu Asian78705Shingles Caucas78754Flu Asian78705Acne AfrAm78705Acne Caucas78705Flu Caucas787XXFlu Asian/AfrAm 78705Shingles Caucas787XXFlu Asian/AfrAm 78705Acne Asian/AfrAm 78705Acne Caucas787XXFlu This is 3-anonymous, right? slide 30

Joining with external database ……… VitalyCaucas78705 ……… Caucas787XXFlu Asian/AfrAm 78705Shingles Caucas787XXFlu Asian/AfrAm 78705Acne Asian/AfrAm 78705Acne Caucas787XXFlu Problem: sensitive attributes are not “diverse” within each quasi-identifier group slide 31

Another attempt: l-diversity Caucas787XXFlu Caucas787XXShingles Caucas787XXAcne Caucas787XXFlu Caucas787XXAcne Caucas787XXFlu Asian/AfrAm 78XXXFlu Asian/AfrAm 78XXXFlu Asian/AfrAm 78XXXAcne Asian/AfrAm 78XXXShingles Asian/AfrAm 78XXXAcne Asian/AfrAm 78XXXFlu Entropy of sensitive attributes within each quasi-identifier group must be at least L slide 32 Machanavajjhala et al. ICDE ‘06

Failure of l-diversity …Cancer … … …Flu …Cancer … … … … … …Flu … Original database Q1Flu Q1Cancer Q1Cancer Q1Cancer Q1Cancer Q1Cancer Q2Cancer Q2Cancer Q2Cancer Q2Cancer Q2Flu Q2Flu Anonymization B Q1Flu Q1Flu Q1Cancer Q1Flu Q1Cancer Q1Cancer Q2Cancer Q2Cancer Q2Cancer Q2Cancer Q2Cancer Q2Cancer Anonymization A 99% have cancer 50% cancer  quasi-identifier group is “diverse” This leaks a ton of information! 50% cancer  quasi-identifier group is “diverse” This leaks a ton of information! 99% cancer  quasi-identifier group is not “diverse” …yet anonymized database does not leak anything 99% cancer  quasi-identifier group is not “diverse” …yet anonymized database does not leak anything slide 33

Membership disclosure With high probability, quasi-identifier uniquely identifies an individual in the population Modifying quasi-identifiers in the dataset does not affect their frequency in the population! – Suppose anonymized dataset contains 10 records with a certain quasi-identifier … and there are 10 people in the population who match it k-anonymity may not hide whether a given person is in the dataset Nergiz et al. SIGMOD ‘07 slide 34

What does attacker know? Caucas 787XX HIV+ Flu Asian/AfrAm 787XX HIV- Flu Asian/AfrAm 787XX HIV+ Shingles Caucas 787XX HIV- Acne Caucas 787XX HIV- Shingles Caucas 787XX HIV- Acne This is against the rules! “flu” is not a quasi-identifier This is against the rules! “flu” is not a quasi-identifier Bob is Caucasian and I heard he was admitted to hospital with flu… Yes… and this is yet another problem with k-anonymity! slide 35

Other problems with k-anonymity Multiple releases of the same dataset break anonymity Mere knowledge of the k-anonymization algorithm is enough to reverse anonymization slide 36 Ganta et al. KDD ‘08 Zhang et al. CCS ‘07

k-Anonymity considered harmful Syntactic – Focuses on data transformation, not on what can be learned from the anonymized dataset – “k-anonymous” dataset can leak sensitive info “Quasi-identifier” fallacy – Assumes a priori that attacker will not know certain information about his target Relies on locality – Destroys utility of many real-world datasets slide 37

HIPAA Privacy Rule “The identifiers that must be removed include direct identifiers, such as name, street address, social security number, as well as other identifiers, such as birth date, admission and discharge dates, and five-digit zip code. The safe harbor requires removal of geographic subdivisions smaller than a State, except for the initial three digits of a zip code if the geographic unit formed by combining all zip codes with the same initial three digits contains more than 20,000 people. In addition, age, if less than 90, gender, ethnicity, and other demographic information not listed may remain in the information. The safe harbor is intended to provide covered entities with a simple, definitive method that does not require much judgment by the covered entity to determine if the information is adequately de-identified." "Under the safe harbor method, covered entities must remove all of a list of 18 enumerated identifiers and have no actual knowledge that the information remaining could be used, alone or in combination, to identify a subject of the information." slide 38

Lessons Anonymization does not work “Personally identifiable” is meaningless – Originally a legal term, unfortunately crept into technical language in terms such as “quasi-identifier” – Any piece of information is potentially identifying if it reduces the space of possibilities – Background info about people is easy to obtain Linkage of information across virtual identities allows large-scale de-anonymization slide 39

How to do it right Privacy is not a property of the data – Syntactic definitions such as k-anonymity are doomed to fail Privacy is a property of the computation carried out on the data Definition of privacy must be robust in the presence of auxiliary information – differential privacy Dwork et al. ’06-10 slide 40

Mechanism is differentially private if every output is produced with similar probability whether any given input is included or not A B C D A B D Differential privacy (intuition) similar output distributions slide 41 Risk for C does not increase much if her data are included in the computation

Computing in the year 201X  Illusion of infinite resources  Pay only for resources used  Quickly scale up or scale down … Data slide 42

Programming model in year 201X Frameworks available to ease cloud programming MapReduce: parallel processing on clusters of machines Reduc e Map Output Data Data mining Genomic computation Social networks slide 43

Programming model in year 201X Thousands of users upload their data – Healthcare, shopping transactions, clickstream… Multiple third parties mine the data Example: health-care data – Incentive to contribute: Cheaper insurance, new drug research, inventory control in drugstores… – Fear: What if someone targets my personal data? Insurance company learns something about my health and increases my premium or denies coverage slide 44

Privacy in the year 201X ? Output Information leak? Health Data Untrusted MapReduce program Data mining Genomic computation Social networks slide 45

Audit MapReduce programs for correctness? Aim: confine the code instead of auditing Also, where is the source code? Hard to do! Enlightenment? Audit untrusted code? slide 46

Airavat Framework for privacy-preserving MapReduce computations with untrusted code Untrusted program Protected Data Airavat slide 47

Airavat guarantee Bounded information leak* about any individual data after performing a MapReduce computation. *Differential privacy Untrusted program Protected Data Airavat slide 48

map(k 1,v 1 )  list(k 2,v 2 ) reduce(k 2, list(v 2 ))  list(v 2 ) Background: MapReduce Map phaseReduce phase slide 49

MapReduce example Map(input)  { if (input has iPad) print (iPad, 1) } Reduce(key, list(v))  { print (key + “,”+ SUM(v)) } Counts no. of iPads sold (ipad,1) SUM Map phaseReduce phase slide 50

Airavat model Airavat runs on the cloud infrastructure – Cloud infrastructure: Hardware + VM – Airavat: Modified MapReduce + DFS + JVM + SELinux Cloud infrastructure Airavat framework 1 Trusted slide 51

Airavat model Data provider uploads her data on Airavat – Sets up certain privacy parameters Cloud infrastructure Data provider 2 Airavat framework 1 Trusted slide 52

Airavat model Computation provider implements data mining algorithm – Untrusted, possibly malicious Cloud infrastructure Data provider 2 Airavat framework 1 3 Computation provider Output Program Trusted slide 53

Threat model Airavat runs the computation and protects the privacy of the input data Cloud infrastructure Data provider 2 Airavat framework 1 3 Computation provider Output Program Trusted Threat slide 54

Airavat programming model MapReduce program for data mining Split MapReduce into untrusted mapper + trusted reducer Data No need to audit Airavat Untrusted Mapper Trusted Reduce r Limited set of stock reducers slide 55

Airavat programming model MapReduce program for data mining Data No need to audit Airavat Untrusted Mapper Trusted Reduce r Need to confine the mappers ! Guarantee: protect the privacy of input data slide 56

Leaking via storage channels Untrusted mapper code copies data, sends it over the network Peter Meg Reduc e Map Peter Data Chris Leaks using system resources slide 57

Leaking via output Output of the computation is also an information channel Output 1 million if Peter bought Vi*gra Peter Meg Reduc e Map Data Chris slide 58

Airavat mechanisms Prevent leaks through storage channels like network connections, files… Reduc e Map Mandatory access controlDifferential privacy Prevent leaks through the output of the computation Output Data slide 59

Confining untrusted code MapReduce + DFS SELinux Untrusted program Given by the computation provider Add mandatory access control (MAC) Add MAC policy Airavat slide 60

Confining untrusted code MapReduce + DFS SELinux Untrusted program We add mandatory access control to the MapReduce framework Label input, intermediate values, output Malicious code cannot leak labeled data Access control label MapReduce slide 61

Confining untrusted code MapReduce + DFS SELinux Untrusted program SELinux policy to enforce MAC Creates trusted and untrusted domains Processes and files are labeled to restrict interaction Mappers reside in untrusted domain – Denied network access, limited file system interaction slide 62

Access control is not enough Labels can prevent the output from being read When can we remove the labels? Output leaks the presence of Peter ! Peter if (input belongs-to Peter) print (iPad, 1000000) (ipad,1000001) (ipad,1) SUM Access control label Map phaseReduce phase slide 63

Differential privacy (intuition) A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not Cynthia Dwork et al. Differential Privacy. slide 64

Differential privacy (intuition) A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not Output distribution F(x) A B C Cynthia Dwork et al. Differential Privacy. slide 65

Differential privacy (intuition) A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not Similar output distributions Bounded risk for D if she includes her data! F(x) A B C A B C D Cynthia Dwork et al. Differential Privacy. slide 66

Achieving differential privacy A simple differentially private mechanism How much random noise should be added? Tell me f(x) f(x)+noise … xnxn x1x1 slide 67

Achieving differential privacy Function sensitivity (intuition): maximum effect of any single input on the output – Aim: “mask” this effect to ensure privacy Example: average height of the people in this room has low sensitivity – Any single person’s height does not affect the final average by too much – Calculating the maximum height has high sensitivity slide 68

Achieving differential privacy Function sensitivity (intuition): maximum effect of any single input on the output – Aim: “mask” this effect to ensure privacy Example: SUM over input elements drawn from [0, M] X1X1 X2X2 X3X3 X4X4 SUM Sensitivity = M Max. effect of any input element is M slide 69

Achieving differential privacy A simple differentially private mechanism f(x)+Lap(∆(f)) … xnxn x1x1 Tell me f(x) Intuition: Noise needed to mask the effect of a single input Lap = Laplace distribution ∆(f) = sensitivity slide 70 Dwork et al.

Mapper can be any piece of Java code (“black box”) but… range of mapper outputs must be declared in advance – Used to estimate “sensitivity” (how much does a single input influence the output?) – Determines how much noise is added to outputs to ensure differential privacy Example: consider mapper range [0, M] – SUM has the estimated sensitivity of M slide 71 Enforcing differential privacy

Malicious mappers may output values outside the range If a mapper produces a value outside the range, it is replaced by a value inside the range – User not notified… otherwise possible information leak Range enforcer Noise Mapper Reducer Range enforcer Mapper Ensures that code is not more sensitive than declared slide 72

All mapper invocations must be independent Mapper may not store an input and use it later when processing another input – Otherwise, range-based sensitivity estimates may be incorrect We modify JVM to enforce mapper independence – Each object is assigned an invocation number – JVM instrumentation prevents reuse of objects from previous invocation slide 73 Enforcing sensitivity

What can we compute? Reducers are responsible for enforcing privacy – Add appropriate amount of random noise to the outputs Reducers must be trusted – Sample reducers: SUM, COUNT, THRESHOLD – Sufficient to perform data-mining algorithms, search log processing, simple statistical computations, etc. With trusted mappers, more general computations are possible – Use exact sensitivity instead of range-based estimates slide 74

More reading Roy et al. “Airavat: Security and Privacy for MapReduce” (NSDI 2010) slide 75

Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin.

Similar presentations

Presentation on theme: "Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin.

Similar presentations

Presentation on theme: "Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin."— Presentation transcript:

Similar presentations

About project

Feedback