Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCI 347, Data Mining Data Anonymization.

Similar presentations


Presentation on theme: "CSCI 347, Data Mining Data Anonymization."— Presentation transcript:

1 CSCI 347, Data Mining Data Anonymization

2 Importance of Data Anonymization
Current trends are making data anonymization more and more important. Examples: Amount of data being collected Proliferation of devices implementing global positioning system capabilities Proliferation of client-aware applications New and better data mining techniques Cloud computing

3 Reidentification 1990 census data, study showed 87% of people in the US can be identified by the combination of zip code, birthday and gender 2000 census data found number was only 61% Still reidentification is surprisingly easy

4 Example Anonymizing Data
Source:

5 Techniques for Data Anonymization
Encryption K-anonymization / l-diversity Generalization Perturbation

6 Data Anonymization Data anonymization - making personal data so that reidentification is impossible Specifically want to be able to process the data in a useful way, while preventing that data from being linked to individual identities of people, objects or organizations

7 Encryption Encryption– transforming data to make it unreadable to those who don’t have the key to decrypt it

8 Data Anonymization versus Encryption
Data anonymization - making personal data so that reidentification is impossible Encryption– transforming data to make it unreadable to those who don’t have the key to decrypt it Data can be successfully anonymized without encryption and encrypted data is not necessarily anonymized

9 Encryption Encryption is sufficient for analysis done internally that won’t be published or shared with affiliate companies

10 K-anonymity K-anonymity– make each record indistinguishable from a defined number (k) of other records. A set of data is k-anonymized if, for any data record with a given set of attributes, there are at least k-1 other records that match those attributes

11 K-anonymity Data Properties
k-anonymity assign properties to data attributes and requires that they be handled in specific ways Key attributes, such as name, social security numbers, student ids, are key data and they need to be removed  Quasi-identified attributes, such as zip code, birthday and gender, need to be suppressed or generalized  Sensitive attributes, such as income and type of illness, need to be de-linked from the individual

12 Example of 2-anonymity

13 Example of Attacks Examples for the above data:
If an attacker knows that Bob has an entry in the data and that Bob is in his 30s, the attacker will know that Bob has cancer If an attacker knows that a 21-year-old Yoko has an entry in the data and that as a young woman she is unlikely to have heart disease, the attacker then knows Yuko has a viral infection.

14 L-diversity l-diversity improves anonymization beyond what k-anonymity provides. K-anonymization requires each combination of quasi-identifiers to have k entries. L-diversity requires that there are l different sensitive values for each combination of quasi-identifiers.

15 Example: 2-diversity, 4-anonymity

16 Generalization Generalization – categorized the values, publishing the categories, rather than the specific values. This can help achieve k-anonymity

17 Perturbation Perturbation – change the data
Several techniques are available: Adding fictitious records Shifting data values Macroaggregation

18 Macroaggregation Example
Sort the data Break data into categories Given the income data: 57000 56367 38479 36294 31029 29993 Group it into three groups of two, and publish average of the groups 30511

19 Macroaggregation Example
No individual income is available but you have not skewed the numbers in a way that would make analysis and prediction difficult


Download ppt "CSCI 347, Data Mining Data Anonymization."

Similar presentations


Ads by Google