Download presentation
Presentation is loading. Please wait.
1
Ιδιωτικότητα σε Βάσεις Δεδομένων Οκτώβρης 2011
2
Roadmap Motivation Core ideas Extensions 2
3
Roadmap Motivation Core ideas Extensions 3
4
Reasons for privacy preserving data publishing Vast amount of data collected nowadays Estimated user data per day – 8-10 GB public content – ~ 4 TB private content (emails, SMSs, content annotations, social networks…) 4
5
Reasons for privacy preserving data publishing Organizations (hospitals, ministries, internet providers, …) publicly release data concerning individual records (internet searches, medical records, …) The laws oblige these agencies to protect the individuals’ privacy 5
6
Reasons for privacy preserving data publishing So, data are stripped of the attributes that can reveal the individuals’ identities Unfortunately, this is not enough… 6
7
Sweeney’s breach of governor’s medical record “ … In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. For twenty dollars I purchased the voter registration list for Cambridge Massachusetts and received the information on two diskettes. The rightmost circle in Figure 1 shows that these data included the name, address, ZIP code, birth date, and gender of each voter. This information can be linked using ZIP code, birth date and gender to the medical information, thereby linking diagnosis, procedures, and medications to particularly named individuals. …” 7
8
Sweeney’s breach of governor’s medical record “ … For example, William Weld was governor of Massachusetts at that time and his medical records were in the GIC data. Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code.…” 8
9
AOL’s exposure of user 4417749 August 2006: AOL publicizes anonymized data for 21M user queries User’s 4417749 had a strong essence of geo and thematic locality Researchers would focus more and more the search, based on these queries Ms Arnold, 62, would prove to be the user search for medication, resorts, dogs and family members, … 9
10
The context of privacy-preserving data publishing 10 Detailed microdata T Anonymized public data T* Bob (the victim) to be hidden Ben, the benevolent, data miner Alice, the external attacker Deborah, a star DBA & a TRUSTED data publisher
11
Roadmap Motivation Core ideas Extensions 11
12
Anonymization To retain privacy one must: – Remove the attributes that directly identify individuals (name, SSN, …) – Organize the tuples and the cell values of the data set in such a way that: The statistical properties of the data set are retained The attacker cannot guess to which individual a tuple corresponds with statistical meaningful guarantee 12
13
Fundamentals Identifier(s): attribute(s) that explicitly reveal the identity of a person (name, SSN, …). These attributes are removed from the public data set Quasi identifier: attribute(s) that if joined with external data can reveal sensitive information (zip code, birth date, sex,…) – Typically accompanied by “generalization hierarchies” Sensitive attribute: containing the values that should be kept private (disease, salary,…) 13
14
14 Generalization hierarchies
15
General methods for Anonymization “Hide tuples in the crowd” – Generalization – Anatomization “Lies to the attacker, truth to the statistician” – Noise injection – Value perturbation 15
16
Generalization methods Global recoding – All the values of an attribute are generalized to the same hierarchy level [Swee02a] [Sama01] [LeDR05]. Multidimensional – The values of an attribute can be generalized in different levels, depending on the density of the groups that can be formed. However, each combination of QI-values is always generalized to the same value [LeDR06]. Local recoding – The values of an attribute can be generalized in different levels, depending on the density of the groups that can be formed. In fact, the same combination of QI-values can be generalized to different values [Xu+06] 16
17
k-anonymity (TKDE 01, IJUFKS 02) 17 A relation Τ is k-anonymous when every tuple of the relation is identical to k-1 other tuples with respect to their Quasi-Identifier set of attributes.
18
Naïve l-diversity 18 A relation T satisfies the naïve l-diversity property whenever every group of the relation contains at least l different values in its sensitive attributes.
19
Information utility Must prevent the attackers, by satisfying the privacy criterion (k for k-anonymity, l for l- diversity) – Fundamental anonymization technique: hide individual in groups of identical QI values!! Must serve the well-meaning users, by maximizing information utility i.e., by minimizing The tuples we remove (see next) the amount of generalization that we apply to the QI attributes. 19
20
Generalization vs suppression 20 This anonymization suppressed no tuples, and guarantees 3- anonymity. What if we want 4-anonymity?
21
Generalization vs suppression 21 Low height, 6 tuples suppressed Higher height, no tuples suppressed //the difference is in the work_class field
22
Incognito (SIGMOD 2005) Two fundamental ideas can be exploited with hierarchies: If a data set generalized at a certain level (e.g., 1345*) is k-anonymous, then it is also k-anonymous if it is even more generalized (e.g., 134**) If a data set of N attributes is not k-anonymous if – n attributes are not fully anonymized (age) and N-n are fully anonymized (sex, zip) then, the same data set is still not k- anonymous with – n+1 attributes are not fully anonymized (age,sex) and N-n-1 are fully anonymized (zip) 22
23
Incognito 23 Birth date, zip code, sex Combinations of 2 attributes
24
Incognito 24 Birth date, zip code, sex Combinations of 3 attributes, after non- anonymous gener. are pruned
25
25 What disease Bob is suffering from? Since Alice is Bob’s neighbor, she knows that Bob is a 31-year-old American male who lives in the zip code 13053. Therefore, Alice knows that Bob’s record number is 9,10,11, or 12. Now, all of those patients have the same medical condition (cancer), and so Alice concludes that Bob has cancer. Umeko is a 21 year old Japanese female who currently lives in zip code 13068. Therefore, Umeko’s information is contained in record number 1,2,3, or 4. +BCGR Knowledge: it is well-known that Japanese have an extremely low incidence of heart disease. Therefore, Alice concludes with near certainty that Umeko has a viral infection.
26
L-diversity (ICDE 2006) Every q-block group, has – At least k tuples – At least l well-represented values – Well-represented? Ούτε όλες οι τιμές σε ένα group είναι ίδιες (έχω τουλάχιστον l, l>=2) Ούτε κάποια τιμή είναι απίθανο να υπάρχει => μπορώ να συνάγω ότι ισχύει κάποια άλλη αν το l είναι σχετικά μικρό 26
27
Well-represented Distinct l-diversity: simply l different values Entropy l-diversity: for each pair (public value q*, sensitive value s) measure the value p(q*,s)logp(q*,s) Entropy of a q-block with value q* is -Σ s p()logp() over all sensitive values s You need to have E > log(p) for all groups (and this can be guaranteed if it holds for the whole table, too) Recursive l-diversity: the most frequent values do not appear too frequently and the less frequent do not appear too rarely 27
28
Roadmap Motivation Core ideas Extensions 28
29
Mondrian (ICDE 2006) 29 Why must we generalize fully every attributes? Some records are in regions with many records and anonymity is easily preserved even by giving out more information. Some others are in sparse areas and need to be generalized more… age zip
30
Mondrian (ICDE 2006) 30 Global recoding local recoding Original data
31
M-invariance (SIGMOD ‘07) 31 If I know that Bob is in group 1 + he has been taken to the hospital twice, I can deduce bronc. from: {dysp., bronch.} {dysp.,gastr.}
32
M-invariance 32
33
Many other extensions Concerning multi-relational privacy Data perturbations More sophisticated “local recoding” a-la Mondrian Trajectory, set-valued, OLAP, … data … 33
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.