Privacy-Preserving Data Publishing

Slides:



Advertisements
Similar presentations
CLOSENESS: A NEW PRIVACY MEASURE FOR DATA PUBLISHING
Advertisements

Jeremiah Blocki CMU Ryan Williams IBM Almaden ICALP 2010.
Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
21-1 Last time Database Security  Data Inference  Statistical Inference  Controls against Inference Multilevel Security Databases  Separation  Integrity.
Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
1 Privacy Preserving Data Publishing Prof. Ravi Sandhu Executive Director and Endowed Chair March 29, © Ravi.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
L-Diversity: Privacy Beyond K-Anonymity
Ιδιωτικότητα σε Βάσεις Δεδομένων Οκτώβρης Roadmap Motivation Core ideas Extensions 2.
Database Access Control & Privacy: Is There A Common Ground? Surajit Chaudhuri, Raghav Kaushik and Ravi Ramamurthy Microsoft Research.
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Preserving Privacy in Published Data
Ragib Hasan University of Alabama at Birmingham CS 491/691/791 Fall 2011 Lecture 16 10/11/2011 Security and Privacy in Cloud Computing.
Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)
Publishing Microdata with a Robust Privacy Guarantee
CS573 Data Privacy and Security Statistical Databases
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Topic 21: Data Privacy1 Information Security CS 526 Topic 21: Data Privacy.
Refined privacy models
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
K-Anonymity & Algorithms
Resisting Structural Re-identification in Anonymized Social Networks Michael Hay, Gerome Miklau, David Jensen, Don Towsley, Philipp Weis University of.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
PRISM: Private Retrieval of the Internet’s Sensitive Metadata Ang ChenAndreas Haeberlen University of Pennsylvania.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
The world’s libraries. Connected. Theoretical Research about Privacy Georgia State University Reporter: Zaobo He Seminar in 10 / 07 / 2015.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has some form of.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Privacy-preserving data publishing
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Topic 21: Data Privacy1 Information Security CS 526 Topic 21: Data Privacy.
Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies Florian Tramèr, Zhicong Huang, Erman Ayday,
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has cancer. -suppose.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.
Versatile Publishing For Privacy Preservation
Privacy in Database Publishing
University of Texas at El Paso
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Private Data Management with Verification
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Side-Channel Attack on Encrypted Traffic
Security and Privacy in Mobile Computing
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Lecture 27: Privacy CS /7/2018.
What is Regression Analysis?
Differential Privacy (2)
Data Anonymization – Introduction
Presented by : SaiVenkatanikhil Nimmagadda
Published in: IEEE Transactions on Industrial Informatics
TELE3119: Trusted Networks Week 4
Towards identity-anonymization on graphs
Refined privacy models
Differential Privacy (1)
Presentation transcript:

Privacy-Preserving Data Publishing Thanks to my project co-PI: YAN, Da (from UAB)

Big Data Privacy Anonymize it ! but how ? Big data is mostly collected from individuals Individuals want privacy for their data How can entities use such sensitive data? Anonymize it ! but how ?

Tradeoff: Privacy v.s. Utility A privacy notion for privacy protection guarantee Design a mechanism under such notion with high utility Privacy Utility

GIC Incidence [Sweeny 1998] Group Insurance Commissions (GIC, Massachusetts) Collected patient data for ~135,000 state employees. Gave to researchers and sold to industry. Medical record of the former state governor is identified.

GIC Incidence [Sweeny 1998]

GIC Incidence [Sweeny 1998] Patient 2 …… Patient 1 Patient n Name DoB Gender Zip code Disease Bob 1/3/45 M 47906 Cancer Carl 4/7/64 47907 Daisy 9/3/69 F 47902 Flu Emily 6/2/71 46204 Gastritis Flora 2/7/80 46208 Hepatitis Gabriel 5/5/68 46203 Bronchitis GIC, MA DB Re-identification occurs!

AOL Data Release [NYTimes 2006] In August 2006, AOL released search keywords of 650,000 users over a 3-month period User IDs are replaced by random numbers 3 days later, pulled the data from public access, but too late

AOL Data Release [NYTimes 2006]

AOL Data Release [NYTimes 2006] Thelman Arnold, a 62 year old widow who lives in Liburn GA, has three dogs, frequently searches her friends’ medical ailments. AOL searcher # 4417749 “landscapers in Lilburn, GA” queries on last name “Arnold” “homes sold in shadow lake subdivision Gwinnett County, GA” “num fingers” “60 single men” “dog that urinates on everything” NYT Re-identification occurs!

Quasi-identifier (QI) Removing unique identifiers is not sufficient QI: maximal set of attributes that could help identify individuals QI is assumed to be publicly available (e.g., voter registration lists)

k-Anonymity [Sweeney, Samarati 2002] Each released record should be indistinguishable from at least (k-1) others on its QI attributes Cardinality of any query result on released data should be at least k

k-Anonymity: Methods Generalization: replacing (recoding) a value with a less specific but semantically consistent one. Suppression: not releasing any value at all

k-Anonymity: Methods

k-Anonymity: Methods The Microdata A 3-Anonymous Table QID SA Zipcode Age Gen Disease 47677 29 F Ovarian Cancer 47602 22 47678 27 M Prostate Cancer 47905 43 Flu 47909 52 Heart Disease 47906 47 QID SA Zipcode Age Gen Disease 476** 2* * Ovarian Cancer Prostate Cancer 4790* [43,52] Flu Heart Disease

k-Anonymity: Methods Finding an optimal anonymization is not easy; NP-hard problem Heuristic solutions: DataFly, Incognito, Mondrian, TDS, ...

Attacks Against k-Anonymity Sensitive values in an equivalence class may lack diversity Homogeneity Attack Zipcode Age Disease 476** 2* Heart Disease 4790* ≥40 Flu Cancer 3* Bob Zipcode Age 47678 27

ℓ-Diversity [Machanavajjhala et al., 2006] Let a q*-block be a set of tuples such that its non-sensitive values generalize to q* A q*-block is ℓ-diverse if contains at least ℓ “well represented” values for the sensitive attribute A table is ℓ-diverse if every q*-block is ℓ- diverse

Attacks Against k-Anonymity The attacker has background knowledge Zipcode Age Disease 476** 2* Heart Disease 4790* ≥40 Flu Cancer 3* Background Knowledge Attack Umeko (Japanese) Zipcode Age 47673 36 knowing that heart attacks occur at a reduced rate in Japanese patients

Distinct ℓ-Diversity: Each equivalence class has at least ℓ “well- represented” sensitive values Probabilistic inference attacks: In one equivalent class, there are ten tuples. In the “Disease” area, one of them is “Cancer”, one is “Heart Disease” and the remaining eight are “Flu”. This satisfies 3-diversity But the attacker can affirm that the target person’s disease is “Flu” with the accuracy of 80%.

Entropy ℓ-Diversity: Each equivalence class have enough different sensitive values + the values are distributed evenly enough It means the entropy of the distribution of sensitive values in each equivalence class is at least log(ℓ) Too conservative when some values are very common: the entropy of the entire table may be very low

Recursive (c, ℓ)-Diversity: A compromise definition that ensures the most common value does not appear too often while less common values are ensured to not appear too infrequently In any q*-block, the most frequent value does not appear too frequently let ri denote the number of times the i th most frequent sensitive value appears r1 < c(rℓ + rℓ + 1 + ... + rm)

Attacks Against ℓ-Diversity ℓ-diversity does not consider semantic meanings of sensitive values Sensitive values in an equivalence class may lack diversity A 3-diverse patient table Similarity Attack Zipcode Age Salary Disease 476** 2* 20K Gastric Ulcer 30K Gastritis 40K Stomach Cancer 4790* ≥40 50K 100K Flu 70K Bronchitis 3* 60K 80K Pneumonia 90K Bob Zip Age 47678 27 Conclusion Bob’s salary is in [20k,40k], which is relative low. Bob has some stomach-related disease.

t-Closeness [Li et al., 2007] t-closeness requires that the distribution of a sensitive attribute in any eq. class is close to the distribution of a sensitive attribute in overall table Privacy is measured by the information gain of an observer Information Gain = Posterior Belief – Prior Belief

t-Closeness [Li et al., 2007] In any equivalence class, distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. Given two distributions P = (p1, p2, ..., pm), Q = (q1, q2, ..., qm), we consider two well-known distance measures.

t-Closeness [Li et al., 2007] Variational Distance: Earth Mover’s Distance min

t-Closeness [Li et al., 2007] t-closeness protects against attribute disclosure but not identity disclosure

Graph Data? An attacker knows that Dan has 4 friends and 2 of them are friends themselves Dan re-identified!

Graph Data Anonymization Identity disclosure and link disclosure Graph Anonymization Techniques Edge and vertex modification (random perturbation) Grouping vertices and edges into partitions called super-vertices and super-edges

Differential Privacy The risk to my privacy should not substantially increase as a result of participating in a statistical database. With or without including me in the database, my privacy risk should not change much

Differential Privacy

Differential Privacy

Differential Privacy: Methods We generate noise using the Laplace distribution. The Laplace distribution, denoted Lap(b), is defined with parameter b and has density function:

Laplace Distribution

Differential Privacy: Methods Go beyond the red curves

Differential Privacy: Methods Imagine f as a COUNT-ing query In this figure, the distribution on the outputs, shown in gray, is centered at the true answer of 100, where Δf = 1 and ε= ln 2. The distribution in orange is the same distribution where the true answer is 101.

Differential Privacy: Property Privacy Budget

Local Differential Privacy

Local Differential Privacy

DP Applications Differential privacy based on coin tossing is widely deployed! In Google Chrome browser, to collect browsing statistics In Apple iOS and MacOS, to collect typing statistics In Microsoft Windows to collect telemetry data over time From Snap to perform modeling of user preference This yields deployments of over 100 million users each

Try the codes! (Homework 5) (Due: Apr 16, Wed) Python library dp-stats from Rutgers University DP-Ops: Mean, Variance, Histogram, Principal Component Analysis (PCA), Support Vector Machines (SVM), Logistic Regression Ref: https://www.ece.rutgers.edu/~hi53/DPSTATS.pdf