Refined privacy models

Slides:



Advertisements
Similar presentations
CLOSENESS: A NEW PRIVACY MEASURE FOR DATA PUBLISHING
Advertisements

Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
Wang, Lakshmanan Probabilistic Privacy Analysis of Published Views, IDAR'07 Probabilistic Privacy Analysis of Published Views Hui (Wendy) Wang Laks V.S.
Methods of Proof Chapter 7, second half.. Proof methods Proof methods divide into (roughly) two kinds: Application of inference rules: Legitimate (sound)
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
1 Privacy Preserving Data Publishing Prof. Ravi Sandhu Executive Director and Endowed Chair March 29, © Ravi.
Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation.
Proof methods Proof methods divide into (roughly) two kinds: –Application of inference rules Legitimate (sound) generation of new sentences from old Proof.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
L-Diversity: Privacy Beyond K-Anonymity
1 The Assumptions. 2 Fundamental Concepts of Statistics Measurement - any result from any procedure that assigns a value to an observable phenomenon.
Variable-Length Codes: Huffman Codes
Preserving Privacy in Clickstreams Isabelle Stanton.
Software Testing Sudipto Ghosh CS 406 Fall 99 November 9, 1999.
Active Learning for Class Imbalance Problem
Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)
Publishing Microdata with a Robust Privacy Guarantee
Techniques for Analysis and Calibration of Multi- Agent Simulations Manuel Fehler Franziska Klügl Frank Puppe Universität Würzburg Lehrstuhl für Künstliche.
USCISIUSCISI Background Description Logic Systems Thomas Russ.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Protecting Sensitive Labels in Social Network Data Anonymization.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Refined privacy models
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Privacy-preserving data publishing
CSCI 347, Data Mining Data Anonymization.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.
Review of Hypothesis Testing: –see Figures 7.3 & 7.4 on page 239 for an important issue in testing the hypothesis that  =20. There are two types of error.
Logical Agents. Outline Knowledge-based agents Logic in general - models and entailment Propositional (Boolean) logic Equivalence, validity, satisfiability.
1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.
Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.
Versatile Publishing For Privacy Preservation
Privacy in Database Publishing
University of Texas at El Paso
Security in Outsourcing of Association Rule Mining
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Statistical Data Analysis - Lecture /04/03
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Location Cloaking for Location Safety Protection of Ad Hoc Networks
SocialMix: Supporting Privacy-aware Trusted Social Networking Services
Chapter 25 Comparing Counts.
Software Quality Engineering
Evaluation of Relational Operations: Other Operations
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Open-Category Classification by Adversarial Sample Generation
Personalized Privacy Protection in Social Networks
Indexing and Hashing Basic Concepts Ordered Indices
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 26 Comparing Counts.
Presented by : SaiVenkatanikhil Nimmagadda
Chen Li Information and Computer Science
Evaluation of Relational Operations: Other Techniques
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
TELE3119: Trusted Networks Week 4
Chapter 26 Comparing Counts.
Mathematical Foundations of BME Reza Shadmehr
Implementation of Learning Systems
Evaluation of Relational Operations: Other Techniques
Privacy-Preserving Data Publishing
Evidence Based Diagnosis
Presentation transcript:

Refined privacy models Data Anonymization Refined privacy models

Outline k-anonymity specification is not sufficient Enhancing privacy L-diversity T-closeness

Linking the dots… Countering the privacy attack k-anonymity addresses one type of attacks  link attack ??  Other types of attacks

K-anonymity K-anonymity tries to counter this type of privacy attack: Individual -> quasi-identifier-> sensitive attributes Example: 4-anonymized table, at least 4 records share the same quasi-identifier Quasi-identifier: The attacker can find the link using other public data Typical method: domain-specific generalization

More Privacy Problems All existing k-anonymity approaches assume: Privacy is protected as long as the k-anonymity specification is satisfied. But there are other problems Homogeneity in sensitive attributes Background knowledge on individuals …

Problem1: Homogeneity Attack We know Bob is in the table… If Bob lives in the zip code 13053 And he is 31 years old -> Bob surely has cancer!

Problem 2: Background knowledge attack Japanese have an extremely low Incidence of heart disease! A Japanese Umeko lives in zip code 13068 and she is 21 years old -> Umeko has viral infection with high probability.

The cause of these two problems The values in the sensitive attribute of some blocks have no sufficient diversity Problem1: no diversity. Problem2: background knowledge helps to reduce the diversity.

major contributions of l-diversity Formally analyze the privacy model of k-anonymity with the Bayes-Optimal privacy model. Basic idea: increase the diversity of sensitive attribute values for each anonymized block Instantiation and implementation of l-diversity concept Entropy l-diveristy Recursive l-diversity More…

Modeling the attacks What is a privacy attack? - guess the sensitive values! (probability) Prior belief: without seeing the table, what can we guess? S: sensitive data, Q: quasi-identifier, prior: P(S=s|Q=q) Example: Japanese vs. heart disease Observed belief: with the observed table, our belief will change. T*: anonymized table, observed: P(S=s|(Q=q and T* is known)) Effective privacy attacks: Table T* should help to change the belief a lot! Prior is small, observed belief is large -> positive disclosure Prior is large, observed belief is small -> negative disclosure

The definition of observed belief # of records with S=s background knowledge, i.e. the prior p(S=s|Q=q) n(q*, s)/n(q*) q* S A q*-block: a k-anonymized group with q* as the quasi-identifier S=s n(q*, s) n(q*) records f(s|q*): the proportion of this part

Interpret privacy problem of k-anonymity Derived from the relationship between observed belief and privacy disclosure (positive) Extreme situation: (q,s,T*)  1 => positive disclosure Possibility 1. n(q*, s’) << n(q*, s) => Lack of diversity Possibility 2. Strong background knowledge helps to eliminate other items Knowledge: except one s, other s’ are not likely true while Q=q => f(s’|q)  0 Minimize the contribution of other items, and make  =>  0

Negative disclosure: (q,s,T*)  0 0  either n(q*,s)  0 or f(s|q)  0 The Umeko example

How to address the problems? We make n(q*, s’) << n(q*, s) is not satisfied Need more knowledge to get rid of other items “damaging instance-level knowledge” for f(s’|q)  0 If L distinct sensitive values in the q*-block, the attacker needs L-1 pieces of damaging knowledge to get rid of the L-1 possible sensitive values This is the principle of L-diversity

L-diversity: how to evaluate it? Entropy l-diversity Every q*-block satisfies the condition: *We like uniform distribution of sensitive values over each block! **Guarantees every q*-block has at least L distinct sensitive values Entropy of distinct sensitive values in q*-block Entropy of uniformly distributed L distinct sensitive values

Other extensions Entropy l-diversity is too restrictive Some positive disclosures are allowed Typically, some sensitive values my have very high frequency and are not sensitive, in practice. For example, “normal” in disease symptom. Log(L) cannot be satisfied in some cases. Principle to relax the strong condition Uniform distribution of sensitive values is good! When we can not achieve this, we choose to make most value frequencies as close as possible, especially the most frequent value. Recursive (c,l)-diversity is proposed Control the frequency difference between the most frequent item and the most non-frequent items r1 < c(rl+rl+1+…+rm)

Implementation of l-diversity build an algorithm having a structure similar to k-anonymity algorithms. With domain generalization hierarchy Check l-diversity and k-anonymity

Discussion Not addressed problems Skewed data – common problem for l-diversity and k-anonymity Makes l-diversity very inefficient Balance between utility and privacy The entropy l-diversity and (c,l)-diversity methods do not guarantee good data utility

t-closeness Address two types of attacks Skewness attack Similarity attack

Skewness attack Prob of cancer in the original table is low Prob of cancer in the anonymized table is much higher than the global prob

Semantic similarity attack Salary is low Has some kind of stomach diseas

The root of these two problems Sensitive values Difference between Global distribution and Local distribution in some block

The proposal of t-closeness Making the global and local distributions as similar as possible Evaluate the distribution similarity Semantic similarity “Earth mover’s distance” as the similarity measure between distributions

Further studies Modeling different types of prior knowledge Think about the problem - no way to enumerate all background knowledge… Can we do better than that?