Download presentation
Presentation is loading. Please wait.
Published byChristopher Dalton Modified over 8 years ago
1
Traian Marius Truta http://www.nku.edu/~trutat1/
Overview of Statistical Disclosure Control and Privacy-Preserving Data Mining Traian Marius Truta
2
Traian Marius Truta – DIMACS Tutorial
Content of the Talk Introduction Statistical Disclosure Control (SDC) Privacy-Preserving Data Mining (PPDM) De-identification Techniques Disclosure Risk & Information Loss Conclusions April 30, 2009 Traian Marius Truta – DIMACS Tutorial
3
Traian Marius Truta – DIMACS Tutorial
SDC/PPDM Problem Individuals Submit Collect Data Masking Process Data Owner Release Receive Masked Data Researcher Intruder April 30, 2009 Traian Marius Truta – DIMACS Tutorial
4
Traian Marius Truta – DIMACS Tutorial
SDC/PPDM Problem Individuals Submit Collect Data Confidentiality of Individuals Disclosure Risk / Anonymity Properties Masking Process Data Owner Preserve Data Utility Information Loss Release Receive Masked Data Researcher Intruder April 30, 2009 Traian Marius Truta – DIMACS Tutorial
5
SDC/PPDM Problem Individuals Submit Collect Data Confidentiality
of Individuals Disclosure Risk / Anonymity Properties Masking Process Data Owner Preserve Data Utility Information Loss Release Receive Masked Data Researcher Intruder Use Masked Data for Statistical Analysis or Data Mining Use Masked Data and External Data to disclose confidential information External Data April 30, 2009 Traian Marius Truta – DIMACS Tutorial
6
Traian Marius Truta – DIMACS Tutorial
Types of Disclosure Initial Microdata Masked Microdata Name SSN Age Zip Diagnosis Income Alice 44 48202 AIDS 17,000 Bob 68,000 Charley 48201 Asthma 80,000 Dave 55 48310 55,000 Eva Diabetes 23,000 Age Zip Diagnosis Income 44 48202 AIDS 17,000 68,000 48201 Asthma 80,000 55 48310 55,000 Diabetes 23,000 Data Owner April 30, 2009 Traian Marius Truta – DIMACS Tutorial
7
Traian Marius Truta – DIMACS Tutorial
Types of Disclosure Initial Microdata Masked Microdata Name SSN Age Zip Diagnosis Income Alice 44 48202 AIDS 17,000 Bob 68,000 Charley 48201 Asthma 80,000 Dave 55 48310 55,000 Eva Diabetes 23,000 Age Zip Diagnosis Income 44 48202 AIDS 17,000 68,000 48201 Asthma 80,000 55 48310 55,000 Diabetes 23,000 Data Owner External Information Name SSN Age Zip Alice 44 48202 Charley 48201 Dave 55 48310 Intruder April 30, 2009 Traian Marius Truta – DIMACS Tutorial
8
Traian Marius Truta – DIMACS Tutorial
Types of Disclosure Initial Microdata Masked Microdata Name SSN Age Zip Diagnosis Income Alice 44 48202 AIDS 17,000 Bob 68,000 Charley 48201 Asthma 80,000 Dave 55 48310 55,000 Eva Diabetes 23,000 Age Zip Diagnosis Income 44 48202 AIDS 17,000 68,000 48201 Asthma 80,000 55 48310 55,000 Diabetes 23,000 Data Owner External Information Identity Disclosure: Charlie is the third record Name SSN Age Zip Alice 44 48202 Charley 48201 Dave 55 48310 Attribute Disclosure: Alice has AIDS Intruder April 30, 2009 Traian Marius Truta – DIMACS Tutorial
9
Traian Marius Truta – DIMACS Tutorial
Types of Disclosure Initial Microdata Masked Microdata Name SSN Age Zip Diagnosis Income Alice 44 48202 AIDS 17,000 Bob 68,000 Charley 48201 Asthma 80,000 Dave 55 48310 55,000 Eva Diabetes 23,000 Age Zip Diagnosis Income 44 482 AIDS 17,000 68,000 Asthma 80,000 55 483 55,000 Diabetes 23,000 Data Owner External Information Identity Disclosure: Charlie is the third record Name SSN Age Zip Alice 44 48202 Charley 48201 Dave 55 48310 Attribute Disclosure: Alice has AIDS Intruder April 30, 2009 Traian Marius Truta – DIMACS Tutorial
10
Traian Marius Truta – DIMACS Tutorial
Types of disclosure Identity disclosure - identification of an entity (person, institution). Attribute disclosure - the intruder finds something new about the target person. [Lambert 1993] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
11
Traian Marius Truta – DIMACS Tutorial
Content of the Talk Introduction Statistical Disclosure Control (SDC) Privacy-Preserving Data Mining (PPDM) De-identification Techniques Disclosure Risk & Information Loss Conclusions April 30, 2009 Traian Marius Truta – DIMACS Tutorial
12
Statistical Disclosure Control
Statistical Disclosure Control is the discipline concerned with the modification of data, containing confidential information about individual entities such as persons, households, businesses, etc. in order to prevent third parties working with these data to recognize individuals in the data [Willemborg 1996, Willemborg 2001]. Also called: Computational Disclosure Control [Sweeney 2001] Disclosure Control [Truta 2004] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
13
Microdata and External Information
Microdata represents a series of records, each record containing information on an individual unit such as a person, a firm, an institution, etc. Masked Microdata names and other identifying information are removed from microdata. External Information any known information by an presumptive intruder related to some individuals from initial microdata. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
14
Disclosure Risk and Information Loss
Disclosure risk - the risk that a given form of disclosure will arise if a masked microdata is released [Chen 1998]. Information loss - the quantity of information which exist in the initial microdata and because of disclosure control methods does not occur in masked microdata [Willemborg 2001]. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
15
Traian Marius Truta – DIMACS Tutorial
Disclosure Control For Microdata April 30, 2009 Traian Marius Truta – DIMACS Tutorial
16
Traian Marius Truta – DIMACS Tutorial
Disclosure Control for Tables April 30, 2009 Traian Marius Truta – DIMACS Tutorial
17
Traian Marius Truta – DIMACS Tutorial
April 30, 2009 Traian Marius Truta – DIMACS Tutorial
18
Traian Marius Truta – DIMACS Tutorial
Content of the Talk Introduction Statistical Disclosure Control (SDC) Privacy-Preserving Data Mining (PPDM) De-identification Techniques Disclosure Risk & Information Loss Conclusions April 30, 2009 Traian Marius Truta – DIMACS Tutorial
19
Traian Marius Truta – DIMACS Tutorial
What is Data Mining? Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data. Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns [Tan 2006]. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
20
Traian Marius Truta – DIMACS Tutorial
Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems. Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Statistics/ AI Machine Learning/ Pattern Recognition Data Mining Database systems April 30, 2009 Traian Marius Truta – DIMACS Tutorial
21
Traian Marius Truta – DIMACS Tutorial
Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Classification, regression. Description Methods Find human-interpretable patterns that describe the data. Clustering, association rule discovery. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
22
Privacy-Preserving Data Mining
Privacy preserving data mining is a research direction in data mining and statistical databases, where data mining algorithms are analyzed for the side-effects they incur in data privacy [Verykios 2004]. Two objectives: Quasi-identifiers like names, addresses and the like, should be modified or trimmed out from the original database, in order for the recipient of the data not to be able to compromise another person’s privacy. Sensitive knowledge which can be mined from a database by using data mining algorithms, should also be excluded, because such a knowledge can equally well compromise data privacy, as we will indicate. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
23
Other Names for the Same Problem
The field of data privacy emerged in recent years at the confluence of well established research areas: data mining, databases, computer security, health informatics, statistics, etc. As a result, there are different terminologies that define the same or very similar concepts. Statistical Disclosure Control [Willemborg 1996] Privacy Preserving Data Mining [Clifton 1996] Data Anonymity [Sweeney 2002] Privacy Preserving Data Publishing [Fung 2007] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
24
PPDM Current Directions
Transform data (usually microdata) to satisfy a privacy guarantee k-anonymity [Samarati 2001, Sweeney 2002] p-sensitive k-anonymity [Truta 2006] l-diversity [Machanavajjhala 2006] t-closeness [Li 2007] randomization [Evfimievski 2003] … and many more Algorithms, theoretical results, data mining under the privacy contraints, information loss measures April 30, 2009 Traian Marius Truta – DIMACS Tutorial
25
PPDM Current Directions
Cryptographic methods for data sharing and privacy Horizontal Partitioning [Karantacioglu 2004] Vertical partitioning [Vaidya 2002] Privacy-preservation for other data models Data Streams [Xu 2008] Location Based Systems [Kalinis 2007] Social Networks [Hay 2007, Campan 2008] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
26
Attribute Classification
I1, I2,..., Im – identifier attributes Ex: Name and SSN Found in IM only Information that leads to a specific entity K1, K2,.…, Kp – key or quasi-identifier attributes Ex: Zip Code and Age Found in IM and MM May be known by an intruder S1, S2,.…, Sq – confidential or sensitive attributes Ex: Principal Diagnosis and Annual Income Assumed to be unknown to an intruder April 30, 2009 Traian Marius Truta – DIMACS Tutorial
27
K-Anonymity Definitions
QI-cluster – all the tuples with identical combination of quasi-identifier attribute values in that microdata. K-anonymity property for a masked microdata (MM) is satisfied if every QI-cluster in MM contains k or more tuples. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
28
Traian Marius Truta – DIMACS Tutorial
K-Anonymity Example RecID Age Zip Sex Illness 1 50 41076 Female AIDS 2 30 41099 Male Diabetes 3 4 20 Asthma 5 6 7 60 Tuberculosis KA = { Age, Zip, Sex } cl1 = {1, 6, 7}; cl2 = {2, 3}; cl3 = {4, 5} April 30, 2009 Traian Marius Truta – DIMACS Tutorial
29
Domain and Value Generalization Hierarchies
***** 482** 410** 41075 41076 41088 41099 48201 S0 = {male, female} S1 = {*} * male female [Samarati 2001, Sweeney 2002] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
30
Traian Marius Truta – DIMACS Tutorial
Generalization Types All Attributes: Full domain generalization [Samarati 2001, LeFevre 2006] Iyengar generalization [Iyengar 2002] Cell-level generalization [Lunacek 2006] Numerical Attributes Predefined hierarchy [Iyengar 2002] Computed hierarchy [LeFevre 2006] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
31
Traian Marius Truta – DIMACS Tutorial
Generalization Types Tuple Age ZipCode Sex r1 50 41076 Male r2 30 41075 Female r3 41099 r4 20 48201 r5 Tuple Age ZipCode Sex r1 20-30 ***** Male r2 r3 30-40 Female r4 r5 Tuple Age ZipCode Sex r1 20-30 410** Male r2 r3 30-40 ***** Female r4 r5 2-Anonymity Full domain generalization (Iyengar g. is identical in this case) Cell-level generalization April 30, 2009 Traian Marius Truta – DIMACS Tutorial
32
Attacks Against K-Anonymity
Unsorted Matching Attack This attack is based on the order in which tuples appear in the released table. Solution: Randomly sort the tuples before releasing. [Sweeney 2002] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
33
Attacks Against K-Anonymity
Complementary Release Attack Different releases can be linked together to compromise k-anonymity [Sweeney 2002]. Solution: Consider all of the released tables before release the new one, and try to avoid linking. Other data holders may release some data that can be used in this kind of attack. Generally, this kind of attack is hard to be prohibited completely. Temporal Attack Adding or removing tuples may compromise k-anonymity protection [Sweeney 2002]. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
34
Attacks Against K-Anonymity
k-Anonymity does not provide privacy if Sensitive values in an equivalence class lack diversity [ Truta 2006, Machanavajjhala 2006]. The attacker has background knowledge [Machanavajjhala 2006]. A 3-anonymous patient table Homogeneity Attack Zipcode Age Disease 476** 2* Heart Disease 4790* ≥40 Flu Cancer 3* Bob Zipcode Age 47678 27 Background Knowledge Attack Carl Zipcode Age 47673 36 April 30, 2009 Traian Marius Truta – DIMACS Tutorial
35
P-Sensitive K-Anonymity Definition
P-sensitive K-anonymity property – A MM satisfies p-sensitive k-anonymity property if it satisfies k-anonymity and the number of distinct attribute values for each confidential attribute is at least p within the same QI-cluster from the MM [Truta 2006]. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
36
P-Sensitive K-Anonymity Example
RecID Age Zip Sex Illness 1 50 41076 Female AIDS 2 30 Male Diabetes 3 4 20 Asthma 5 6 KA = { Age, Zip, Sex } cl1 = {1, 6}; cl2 = {2, 3}; cl3 = {4, 5} This microdata is NOT 2-sensitive 2-anonymous April 30, 2009 Traian Marius Truta – DIMACS Tutorial
37
P-Sensitive K-Anonymity Example
RecID Age Zip Sex Illness 1 50 41076 Female AIDS 2 20-30 Male Diabetes 3 4 Asthma 5 6 KA = { Age, Zip, Sex } cl1 = {1, 6}; cl2 = {2, 3, 4, 5} This microdata is 2-sensitive 2-anonymous April 30, 2009 Traian Marius Truta – DIMACS Tutorial
38
Traian Marius Truta – DIMACS Tutorial
l-Diversity Distinct l-diversity Each equivalence class has at least l well-represented sensitive values. Limitation: Doesn’t prevent the probabilistic inference attacks Ex: In one equivalent class, there are ten tuples. In the “Disease” area, one of them is “Cancer”, one is “Heart Disease” and the remaining eight are “Flu”. This satisfies 3-diversity, but the attacker can still affirm that the target person’s disease is “Flu” with the accuracy of 70%. [Machanavajjhala 2006] To address these problems, Machanavajjhala introduced the idea of l-Diversity This lead to two stronger notion of l-diversity April 30, 2009 Traian Marius Truta – DIMACS Tutorial
39
Traian Marius Truta – DIMACS Tutorial
l-Diversity Entropy l-diversity Each equivalence class not only must have enough different sensitive values, but also the different sensitive values must be distributed evenly enough. The entropy of the distribution of sensitive values in each equivalence class is at least log(l). Sometimes this maybe too restrictive. When some values are very common, the entropy of the entire table may be very low. This leads to the less conservative notion of l-diversity. Recursive (c,l)-diversity The most frequent value does not appear too frequently. r1<c(rl+rl+1+…+rm). April 30, 2009 Traian Marius Truta – DIMACS Tutorial
40
Limitations of l-Diversity & p-sensitive k-anonymity
attribute disclosure not completely prevented. Skewness Attack [Li 2007] Two sensitive values HIV positive (1%) and HIV negative (99%). Serious privacy risk Consider an equivalence class that contains an large number of positive records compared to negative records. l-diversity & p-sensitive k-anonymity does not differentiate Equivalence class 1: 49 positive + 1 negative. Equivalence class 2: 1 positive + 49 negative. Overall distribution of sensitive values not considered. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
41
Limitations of l-Diversity & p-sensitive k-anonymity
attribute disclosure not completely prevented. Similarity Attack [Li 2007] A 3-diverse patient table Zipcode Age Salary Disease 476** 2* 20K Gastric Ulcer 30K Gastritis 40K Stomach Cancer 4790* ≥40 50K 100K Flu 70K Bronchitis 3* 60K 80K Pneumonia 90K Bob Zip Age 47678 27 Conclusion Bob’s salary is in [20k,40k], which is relative low. Bob has some stomach-related disease. Semantic meanings of sensitive values not considered. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
42
t-Closeness: A New Privacy Measure
Rationale A completely generalized microdata Age Zipcode …… Gender Disease * Flu Heart Disease Cancer . Gastritis Belief Knowledge B0 External Knowledge B1 Overall distribution Q of sensitive values [Li 2007] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
43
t-Closeness: A New Privacy Measure
Rationale A released microdata Age Zipcode …… Gender Disease 2* 479** Male Flu Heart Disease Cancer . ≥50 4766* * Gastritis Belief Knowledge B0 External Knowledge B1 Overall distribution Q of sensitive values B2 [Li 2007] Distribution Pi of sensitive values in each equi-class April 30, 2009 Traian Marius Truta – DIMACS Tutorial
44
t-Closeness: A New Privacy Measure
Rationale Observations Q should be public. Knowledge gain in two parts: Whole population (from B0 to B1). Specific individuals (from B1 to B2). We bound knowledge gain between B1 and B2 instead. Principle The distance between Q and Pi should be bounded by a threshold t. Belief Knowledge B0 External Knowledge B1 Overall distribution Q of sensitive values B2 Distribution Pi of sensitive values in each equi-class [Li 2007] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
45
Extended P-Sensitive K-Anonymity
… ***** Infectious and parasitic diseases Neoplasms Endocrine, nutritio-nal and metabolic diseases, and immunity disorders Injury and poisoning Diseases of the blood and blood-forming organs Interstinal infectious diseases 042 HIV Infection Disease Malignant neoplasm of lip, oral cavity,and pharynx 140 Malignant neoplasm of lip 140.0 Upper lip, vermilion border [Campan 2006] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
46
Extended P-Sensitive K-Anonymity
Requirements: Let S be a confidential attribute and HVS its value generalization hierarchy. The following two requirements must be met by the protected values in HVS: All ground values in HVS are protected. All the descendants of a protected internal value in HVS are protected. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
47
Extended P-Sensitive K-Anonymity
A protected value in the value generalization hierarchy HVS of a confidential attribute S is called strong if none of its ascendants (including the root) is protected. We call protected subtree of a hierarchy HVS a subtree in HVS that has as root a strong protected value. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
48
Extended P-Sensitive K-Anonymity
The masked microdata (MM) satisfies extended p-sensitive k-anonymity property if it satisfies k-anonymity and for each group of tuples with the identical combination of key attribute values that exists in MM, the values of each confidential attribute S within that group belong to at least p different protected subtrees in HVS. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
49
Traian Marius Truta – DIMACS Tutorial
Content of the Talk Introduction Statistical Disclosure Control (SDC) Privacy-Preserving Data Mining (PPDM) De-identification Techniques Disclosure Risk & Information Loss Conclusions April 30, 2009 Traian Marius Truta – DIMACS Tutorial
50
Disclosure Control Techniques
Remove Identifiers [Truta 2003] Global and Local Recoding [Willemborg 2001] Local Suppression [Willemborg 2001] Sampling [Skinner 1994] Microaggregation [Domingo-Ferrer 2002] Simulation [Willemborg 2001] Adding Noise [Kim 1986] Rounding [Willemborg 2001] Data Swapping [Anderson 2004] Etc. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
51
Disclosure Control Techniques
Remove Identifiers Global and Local Recoding Local Suppression Sampling Microaggregation Simulation Adding Noise Rounding Data Swapping Etc. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
52
Disclosure Control Techniques
Different disclosure control techniques are applied to the following initial microdata: RecID Name SSN Age State Diagnosis Income Billing 1 John Wayne 44 MI AIDS 45,500 1,200 2 Mary Gore Asthma 37,900 2,500 3 John Banks 55 67,000 3,000 4 Jesse Casey 21,000 1,000 5 Jack Stone 90,000 900 6 Mike Kopi 45 Diabetes 48,000 750 7 Angela Simms 25 IN 49,000 8 Nike Wood 35 66,000 2,200 9 Mikhail Aaron 69,000 4,200 10 Sam Pall Tuberculosis 34,000 3,100 April 30, 2009 Traian Marius Truta – DIMACS Tutorial
53
Traian Marius Truta – DIMACS Tutorial
Remove Identifiers Identifiers such as Names, SSN etc. are removed. RecID Age State Diagnosis Income Billing 1 44 MI AIDS 45,500 1,200 2 Asthma 37,900 2,500 3 55 67,000 3,000 4 21,000 1,000 5 90,000 900 6 45 Diabetes 48,000 750 7 25 IN 49,000 8 35 66,000 2,200 9 69,000 4,200 10 Tuberculosis 34,000 3,100 April 30, 2009 Traian Marius Truta – DIMACS Tutorial
54
Traian Marius Truta – DIMACS Tutorial
Sampling Sampling is the disclosure control method in which only a subset of records is released. If n is the number of elements in initial microdata and t the released number of elements we call sf = t / n the sampling factor. Simple random sampling is more frequently used. In this technique, each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample. RecID Age State Diagnosis Income Billing 5 55 MI Asthma 90,000 900 4 44 21,000 1,000 8 35 AIDS 66,000 2,200 9 69,000 4,200 7 25 IN Diabetes 49,000 1,200 April 30, 2009 Traian Marius Truta – DIMACS Tutorial
55
Traian Marius Truta – DIMACS Tutorial
Microaggregation Order records from the initial microdata by an attribute, create groups of consecutive values, replace those values by the group average . Microaggregation for attribute Income and minimum size 3. The total sum for all Income values remains the same. RecID Age State Diagnosis Income Billing 2 44 MI Asthma 30,967 2,500 4 1,000 10 45 Tuberculosis 3,100 1 AIDS 47,500 1,200 6 Diabetes 750 7 25 IN 3 55 73,000 3,000 5 900 8 35 2,200 9 4,200 April 30, 2009 Traian Marius Truta – DIMACS Tutorial
56
Traian Marius Truta – DIMACS Tutorial
Data Swapping In this disclosure method a sequence of so-called elementary swaps is applied to a microdata. An elementary swap consists of two actions: A random selection of two records i and j from the microdata. A swap (interchange) of the values of the attribute being swapped for records i and j. RecID Age State Diagnosis Income Billing 1 44 MI AIDS 48,000 1,200 2 Asthma 37,900 2,500 3 55 67,000 3,000 4 21,000 1,000 5 90,000 900 6 45 Diabetes 45,500 750 7 25 IN 49,000 8 35 66,000 2,200 9 69,000 4,200 10 Tuberculosis 34,000 3,100 April 30, 2009 Traian Marius Truta – DIMACS Tutorial
57
Traian Marius Truta – DIMACS Tutorial
Content of the Talk Introduction Statistical Disclosure Control (SDC) Privacy-Preserving Data Mining (PPDM) De-identification Techniques Disclosure Risk & Information Loss Conclusions April 30, 2009 Traian Marius Truta – DIMACS Tutorial
58
Disclosure Risk and Information Loss
Disclosure risk - the risk that a given form of disclosure will arise if a masked microdata is released [Chen 1998]. Value/Attribute disclosure Identity disclosure Information loss - the quantity of information which exist in the initial microdata and because of disclosure control methods does not occur in masked microdata [Willemborg 2001]. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
59
Traian Marius Truta – DIMACS Tutorial
Disclosure Risk Individual measures: measure the risk per record. Usually, it is expressed by means of the probability of correctly re-identifying a unit, or by means of the uniqueness and rareness in the sample or population [Willemborg 2001]. Global measures: measure the risk for the entire dataset. Usually, it is expressed by means of the expected number of correct re-identifications [Domingo-Ferrer 2003]. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
60
Traian Marius Truta – DIMACS Tutorial
Frequency Count Age State Diagnosis 44 MI Asthma 45 Tuberculosis AIDS Diabetes 25 IN 55 Age State Diagnosis Freq. Count 44 MI Asthma 2 45 Tuberculosis 1 AIDS Diabetes 25 IN 55 3 April 30, 2009 Traian Marius Truta – DIMACS Tutorial
61
Sample Unique and Population Unique
Sample unique. A record is defined as a sample unique if fck= 1, i.e. there is only one record in the sample microdata presenting the combination k of scores of the key variables. Population unique. A record is defined as a population unique if FCk= 1. For census data, or when an administrative register covering the whole population is available, FCk is known for each k and the risk measure can be computed. [Elliot 2002] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
62
Information Loss Measures
Compare IM and MM -greater similarities between the values for key attributes being indicative of less information loss. Compare statistics (covariance, correlation, means) between IM and MM . Average between the two approaches. [Domingo-Ferrer 2001] April 30, 2009 Traian Marius Truta – DIMACS Tutorial
63
Information Loss Measures
n – the number of records in initial or masked microdata p – the number of attributes in initial or masked microdata V and R – the covariance and correlation matrices of X. V’ and R’ – the covariance and correlation matrices of X’ S and S’ – the variance vectors for X and X’ (The values are form the corresponding main diagonal covariance matrix). and – the average of attributes vectors April 30, 2009 Traian Marius Truta – DIMACS Tutorial
64
Information Loss Measures
Mean abs. error Mean variation X – X’ - V – V’ S – S’ R – R’ April 30, 2009 Traian Marius Truta – DIMACS Tutorial
65
Traian Marius Truta – DIMACS Tutorial
Content of the Talk Introduction Statistical Disclosure Control (SDC) Privacy-Preserving Data Mining (PPDM) De-identification Techniques Disclosure Risk & Information Loss Conclusions April 30, 2009 Traian Marius Truta – DIMACS Tutorial
66
Traian Marius Truta – DIMACS Tutorial
Conclusions Data Privacy investigated from various perspectives. More collaboration needed between various groups. Theoretical results needs to be better applied to real data More collaboration needed between practitioners and researchers. To advance Data Privacy Research & Practical Data Privacy Applications more collaboration needed. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
67
Traian Marius Truta – DIMACS Tutorial
References Anderson M., Fienberg S.E. (2004), U.S. Census Confidentiality: Perception and Reality, Bulletin of the International Statistical Institute. Campan A., Truta T.M. (2006), Extended P-Sensitive K-Anonymity, Studia Universitatis Babes-Bolyai, Informatica, Vol. 51, No. 2, pp. 19 – 30. Campan A., Truta T.M. (2008), A Clustering Approach for Data and Structural Anonymity in Social Networks,” 2nd ACM SIGKDD International Workshop on Privacy, Security, and Trust in KDD (PinKDD2008), Las Vegas, Nevada. Chen G., Keller-McNulty S. (1998), Estimation of Deidentification Disclosure Risk in Microdata, Journal of Official Statistics, Vol. 14, No. 1, 79-95 Clifton C., Marks D. (1996), Security and Privacy Implication of Data Mining, Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 15 – 19 Domingo-Ferrer J., Mateo-Sanz J., Torra V. (2001), Comparing SDC Methods for Microdata on the Basis of Information Loss and Disclosure Risk, Pre-proceedings of ETK-NTTS'2001 (vol. 2), Luxembourg: Eurostat, Domingo-Ferrer J., Torra V., (2003), Disclosure risk assessment in statistical microdata protection via advanced record linkage, Statistics and Computing, vol. 13, no. 4, pp April 30, 2009 Traian Marius Truta – DIMACS Tutorial
68
Traian Marius Truta – DIMACS Tutorial
References Elliot M. (2002) Integrating File and Record Level Disclosure Risk Assessment, In J. Domingo-Ferrer (Ed). Inference Control in Statistical Databases, Springer Verlag. Evfimievski A., Gehrke J., Srikant R. (2003), Limiting Privacy Breaches in Privacy Preserving Data Mining, Proceedings of the PODS, 2003. Fung B. (2007), Privacy Preserving Data Publishing, Ph.D. Thesis, Simon Fraser University. Hay M., Miklau G., Jensen D., Weiss P., Srivastava S. (2007), Anonymizing Social Networks, University of Massachusetts Amherst, Technical Report No , Available online at: Kalnis P., Ghinita G., Mouratidis K., Papadias D. (2007), Preventing Location-Based Identity Inference in Anonymous Spatial Queries, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), 19(12), Kantarcioglu M. and Clifton C. (2004), Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Transactions on Knowledge and Data Engineering, 16(9): Kim J.J. (1986), A Method for Limiting Disclosure in Microdata Based on Random Noise and Transformation, American Statistical Association, Proceedings of the Section on Survey Research Methods, April 30, 2009 Traian Marius Truta – DIMACS Tutorial
69
Traian Marius Truta – DIMACS Tutorial
References Iyengar V. (2002), Transforming Data to Satisfy Privacy Constraints, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 279 – 288. Lambert D. (1993), Measures of Disclosure Risk and Harm. Journal of Official Statistics, Vol. 9, LeFevre K., DeWitt D., Ramakrishnan R. (2005), Incognito: Efficient Full-Domain K-Anonymity, Proceedings of the ACM SIGMOD, Baltimore, Maryland, Lunacek M., Whitley D, Ray I. (2006), A Crossover Operator for the k-Anonymity Problem, Proceedings of the GECCO Conference, 1713 – 1720. Machanavajjhala A., Gehrke J., Kifer D. (2006), L-Diversity: Privacy beyond K-Anonymity, In IEEE International Conference on Data Engineering (ICDE), 24. Samarati P. (2001), Protecting Respondents Identities in Microdata Release, IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No. 6, Skinner C.J., Marsh C., Openshaw S., Wymer C. (1994), Disclosure Control for Census Microdata, Journal of Official Statistics, April 30, 2009 Traian Marius Truta – DIMACS Tutorial
70
Traian Marius Truta – DIMACS Tutorial
References Sweeney L. (2001), Computational Disclosure Control: A Primer on Data Privacy Protection, Ph.D. Thesis, MIT. Sweeney, L., (2002) k-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems, Vol. 10, No. 5, pp. 557 – 570. Sweeney L., (2002), Comments on Standards of Privacy of Individually Identifiable Health Information, Addressed to the Department of Health and Human Services, April 4. Tan P.N., Steinbach M., Kumar V. (ed) (2006), Introduction to Data Mining. Addison Wesley. Truta T.M. (2004), Adaptive Disclosure Control for Healthcare Microdata, Ph.D. Thesis, Wayne State University Truta T.M., Vinay B. (2006), Privacy Protection: p-Sensitive k-Anonymity Property, International Workshop of Privacy Data Management (PDM2006), In Conjunction with 22th International Conference of Data Engineering (ICDE), Atlanta, Georgia. Truta T.M., Fotouhi F., Barth-Jones D. (2003), Disclosure Risk Measures for Microdata, Proceedings of the International Conference on Scientific and Statistical Database Management, Cambridge, Ma, 15 – 22. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
71
Traian Marius Truta – DIMACS Tutorial
References Vaidya J., Clifton C. (2002), Privacy Preserving Association Rule Mining in Vertically Partitioned Data, Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Verykios V. S., Bertino E., Fovino I. N., Provenza L. P., Saygin Y., Theodoridis Y. (2004), State-of-the-Art in Privacy Preserving Data Mining, SIGMOD Record, Vol. 33, No. 1, Xu Y., Wang K., Fu A.W.C., She R., Pei J. (2008) Privacy-Preserving Data Stream Classification, in Privacy-Preserving Data Mining Models and Algorithms, Springer. Willemborg L., Waal T. (ed) (1996), Statistical Disclosure Control in Practice. Springer Verlag. Willemborg L., Waal T. (ed) (2001), Elements of Statistical Disclosure Control, Springer Verlag. April 30, 2009 Traian Marius Truta – DIMACS Tutorial
72
Traian Marius Truta – DIMACS Tutorial
Questions April 30, 2009 Traian Marius Truta – DIMACS Tutorial
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.