CSCI 347, Data Mining Data Anonymization.

Name: CSCI 347, Data Mining Data Anonymization.
Uploaded: 2018-01-12T08:57:02+00:00
Duration: PTM6S12
Channel: Silvia Jennings
Description: CSCI 347, Data Mining Data Anonymization.

CSCI 347, Data Mining Data Anonymization

Importance of Data Anonymization
Current trends are making data anonymization more and more important. Examples: Amount of data being collected Proliferation of devices implementing global positioning system capabilities Proliferation of client-aware applications New and better data mining techniques Cloud computing

Reidentification 1990 census data, study showed 87% of people in the US can be identified by the combination of zip code, birthday and gender 2000 census data found number was only 61% Still reidentification is surprisingly easy

Example Anonymizing Data
Source:

Techniques for Data Anonymization
Encryption K-anonymization / l-diversity Generalization Perturbation

Data Anonymization Data anonymization - making personal data so that reidentification is impossible Specifically want to be able to process the data in a useful way, while preventing that data from being linked to individual identities of people, objects or organizations

Encryption Encryption– transforming data to make it unreadable to those who don’t have the key to decrypt it

Data Anonymization versus Encryption
Data anonymization - making personal data so that reidentification is impossible Encryption– transforming data to make it unreadable to those who don’t have the key to decrypt it Data can be successfully anonymized without encryption and encrypted data is not necessarily anonymized

Encryption Encryption is sufficient for analysis done internally that won’t be published or shared with affiliate companies

K-anonymity K-anonymity– make each record indistinguishable from a defined number (k) of other records. A set of data is k-anonymized if, for any data record with a given set of attributes, there are at least k-1 other records that match those attributes

K-anonymity Data Properties
k-anonymity assign properties to data attributes and requires that they be handled in specific ways Key attributes, such as name, social security numbers, student ids, are key data and they need to be removed Quasi-identified attributes, such as zip code, birthday and gender, need to be suppressed or generalized Sensitive attributes, such as income and type of illness, need to be de-linked from the individual

Example of 2-anonymity

Example of Attacks Examples for the above data:
If an attacker knows that Bob has an entry in the data and that Bob is in his 30s, the attacker will know that Bob has cancer If an attacker knows that a 21-year-old Yoko has an entry in the data and that as a young woman she is unlikely to have heart disease, the attacker then knows Yuko has a viral infection.

L-diversity l-diversity improves anonymization beyond what k-anonymity provides. K-anonymization requires each combination of quasi-identifiers to have k entries. L-diversity requires that there are l different sensitive values for each combination of quasi-identifiers.

Example: 2-diversity, 4-anonymity

Generalization Generalization – categorized the values, publishing the categories, rather than the specific values. This can help achieve k-anonymity

Perturbation Perturbation – change the data
Several techniques are available: Adding fictitious records Shifting data values Macroaggregation

Macroaggregation Example
Sort the data Break data into categories Given the income data: 57000 56367 38479 36294 31029 29993 Group it into three groups of two, and publish average of the groups 30511

Macroaggregation Example
No individual income is available but you have not skewed the numbers in a way that would make analysis and prediction difficult

CSCI 347, Data Mining Data Anonymization.

Similar presentations

Presentation on theme: "CSCI 347, Data Mining Data Anonymization."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI 347, Data Mining Data Anonymization.

Similar presentations

Presentation on theme: "CSCI 347, Data Mining Data Anonymization."— Presentation transcript:

Similar presentations

About project

Feedback