K Beyond k-Anonimity: A Decision Theoretic Framework for Assessing Privacy Risk M.Scannapieco, G.Lebanon, M.R.Fouad and E.Bertino.

Slides:



Advertisements
Similar presentations
University Data Classification Table* Level 5Level 4 Information that would cause severe harm to individuals or the University if disclosed. Level 5 information.
Advertisements

Hierarchical Linear Modeling: An Introduction & Applications in Organizational Research Michael C. Rodriguez.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Record Linkage Simulation Biolink Meeting June Adelaide Ariel.
ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Visual Recognition Tutorial
Chapter 7(7b): Statistical Applications in Traffic Engineering Chapter objectives: By the end of these chapters the student will be able to (We spend 3.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Evaluating Hypotheses
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Chapter 8 Introduction to Hypothesis Testing
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 8 Introduction to Hypothesis Testing.
Chapter 9 Hypothesis Testing II. Chapter Outline  Introduction  Hypothesis Testing with Sample Means (Large Samples)  Hypothesis Testing with Sample.
Statistical hypothesis testing – Inferential statistics I.
TOWARDS IDENTITY ANONYMIZATION ON GRAPHS. INTRODUCTION.
Database Laboratory Regular Seminar TaeHoon Kim.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Introduction to Hypothesis Testing for μ Research Problem: Infant Touch Intervention Designed to increase child growth/weight Weight at age 2: Known population:
POSC 202A: Lecture 1 Introductions Syllabus R Homework #1: Get R installed on your laptop; read chapters 1-2 in Daalgard, 1 in Zuur, See syllabus for Moore.
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
Section 9.1 Introduction to Statistical Tests 9.1 / 1 Hypothesis testing is used to make decisions concerning the value of a parameter.
Chapter 3 An Overview of Quantitative Research
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 8: Quantitative.
1 Today Null and alternative hypotheses 1- and 2-tailed tests Regions of rejection Sampling distributions The Central Limit Theorem Standard errors z-tests.
Statistical Decision Theory
Li Xiong CS573 Data Privacy and Security Healthcare privacy and security: Genomic data privacy.
Understanding: The Key to Protecting Highly Sensitive Personally Identifiable Information Timothy J. Brueggemann, Ph.D.
Topic 5 Statistical inference: point and interval estimate
Statistical Review We will be working with two types of probability distributions: Discrete distributions –If the random variable of interest can take.
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Introduction to: 1.  Goal[DEN83]:  Provide frequency, average, other statistics of persons  Challenge:  Preserving privacy[DEN83]  Interaction between.
User Study Evaluation Human-Computer Interaction.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
CAPITAL BUDGETING_LECT 091 The Concept of Opportunity Cost The concept of opportunity cost is used in CBA to place a dollar value on the inputs required.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Overview.
Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall 9-1 σ σ.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
Introduction Suppose that a pharmaceutical company is concerned that the mean potency  of an antibiotic meet the minimum government potency standards.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Security Methods for Statistical Databases. Introduction  Statistical Databases containing medical information are often used for research  Some of.
Warsaw Summer School 2015, OSU Study Abroad Program Normal Distribution.
Public Finance and Public Policy Jonathan Gruber Third Edition Copyright © 2010 Worth Publishers 1 of 24 Copyright © 2010 Worth Publishers.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses pt.1.
T-TEST. Outline  Introduction  T Distribution  Example cases  Test of Means-Single population  Test of difference of Means-Independent Samples 
Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.
Information Governance Jo Wall South East Public Health Intelligence Analyst Training Day 2, Session 5 11 th February 2016.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.
Logic of Hypothesis Testing
Information Security, Theory and Practice.
University of Texas at El Paso
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Anonymisation: Theory and Practice
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Differential Privacy in Practice
A 4 Step Process (Kind of…)
Summarizing Data by Statistics
POSC 202A: Lecture 1 Introductions Syllabus R
Presented by : SaiVenkatanikhil Nimmagadda
Presentation transcript:

k Beyond k-Anonimity: A Decision Theoretic Framework for Assessing Privacy Risk M.Scannapieco, G.Lebanon, M.R.Fouad and E.Bertino

Introduction  Release of data –Private organizations can benefit from sharing data with others –Public organizations see data as a value for the society  Privacy preservation –Data disclosure can lead to economic damages, threats to national security, etc. –Regulated by law in both private and public sectors

Two Facets of Data Privacy  Identity disclosure –Uncontrolled data release: even presence of identifiers –Anonymous data release: identifiers suppressed, but no control on possible linking with other sources

PrivateIDSSNDOBZIPHealth_Problem a11/20/ Shortness of breath b02/07/ Headache c02/07/ Obesity d08/07/ Shortness of breath PrivateIDSSNDOBZIPEmploymentMarital Status 1A11/20/ ResearcherMarried 5E08/07/ Private Employee Married 3C02/07/ Public Employee Widow T1 T2 Linkage of Anonymous Data QUASI-IDENTIFIER

Two Facets of Data Privacy (cont.)  Sensitive information disclosure –Once identity disclosure occurs, the loss due to such disclosure depends on how much sensitive are the related data –Data sensitivity is subjective E.g.: for women the age is in general more sensitive than for men

Our proposal  A framework for assessing privacy risk that takes into accounts both facets of privacy –based on statistical decision theory  Definition and analysis of: disclosure policies modelled by disclosure rules and several privacy risk functions  Estimated risk as an upper-bound of true risk and realted complexity analysis  Algorithm for finding the disclosure rule minimizing the privacy risk

Disclosure rules  A disclosure rule is a function that maps a record to a new record in which some attributes may have been suppressed Z j =  The j-th attribute is suppressed otherwise

Loss function  Let    be the side information used by the attacker in the identification attempt  The loss function Measures the loss incurred by disclosing the data  (z)  due to possible identification based on     Empirical distribution p associated with records x 1 …x n

Risk Definition  The risk of the disclosure rule  in the presence of the side information  is the average loss of disclosing x 1 …x n :

Putting the pieces together so far…  An hypothetical attacker performs an indentification attempt on a disclosed record y=  (x) on the basis of a side information , that can be a dictionary  The dictionary is used to link y with some entry present in the dictionary  Example: –y has the form (name, surname,phone#),  is a phone book – if all attributes revealed, it is likely y linked with one entry –If phone# suppressed (or missing) y may or may not be linked to a single entity, depending on the popularity of (name, surname)

Risk formulation  Let’s decompose the loss function into an identification part and into a sensitivity part  Identification part: formalized by the random variable Z otherwise

Risk formulation (cont.)  Sensitivity part:  where higher value indicate higher sensitivity  Therefore the loss is:

Risk formulation (cont.)  Risk:

Disclosure Rule vs. Privacy Risk  Suppose that  true is the true attacker’s dictionary which is publicly available and that  * is the actual database starting from which data will be published  Under the following assumptions: –  true contains more records than  * (  * <=  true ) –The non-  in  true will be more limited than the non-  in  * Theorem: If θ* contains records that correspond to x1,...,xn and θ*<=θ true, then:  R( , θ true )<= R( , θ*)

Disclosure Rule vs. Privacy Risk (cont.)  The theorem proves that the true risk is bounded by R( , θ*)  Under the hypothesis that the distribution underlying  factorizes into a product form Theorem: The rule that minimizes the risk  *=arg min  R( , θ) can be found in O(nNm) computation

K-anonimity  K anonimity is SIMPLY a special case of our framework in whcih: –θ true =T –  is a costant –  is underspecified  Our framework underlies some questionable hypotheses of k-anonimity!!!

Conclusions  New framework for privacy risk taking into account sensitivity  Risk estimation as an upperbound for the true privacy risk  Efficient algorithm for risk computation  K-anonimity generalization