Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.

Similar presentations


Presentation on theme: "Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004."— Presentation transcript:

1 Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004

2 Data, Data, Everywhere… Lot’s of data Lot’s of data Let’s mine it and find interesting things! Let’s mine it and find interesting things! But also maintain people’s privacy… But also maintain people’s privacy…

3 Privacy Preserving Data Mining Randomization Randomization Cryptographic protocols Cryptographic protocols K-anonymization K-anonymization

4 Randomization Techniques Insert random ‘noise’ into the data to mask actual values Insert random ‘noise’ into the data to mask actual values Compute a function to recover the original data distribution from the randomized values Compute a function to recover the original data distribution from the randomized values Not always secure – random noise can be filtered in certain circumstances to accurately estimate original data values Not always secure – random noise can be filtered in certain circumstances to accurately estimate original data values

5 Cryptographic Techniques Utilize cryptographic techniques, such as oblivious transfer, to execute data mining algorithms between two parties without revealing data to each other Utilize cryptographic techniques, such as oblivious transfer, to execute data mining algorithms between two parties without revealing data to each other E.g. Amazon.com and Barnes & Noble E.g. Amazon.com and Barnes & Noble

6 K-anonymity For a given set of attribute values in a record, these same values must also occur in at least k-1 other records for the data to be considered k-anonymous For a given set of attribute values in a record, these same values must also occur in at least k-1 other records for the data to be considered k-anonymous Idea developed by Latanya Sweeney, who has also developed a tool Datafly, which will perform generalizations and suppressions to achieve the k-anonymity Idea developed by Latanya Sweeney, who has also developed a tool Datafly, which will perform generalizations and suppressions to achieve the k-anonymity

7 Generalization and Suppression Generalization Generalization  Usually done for an entire attribute  Basically make the data less precise  Taxonomy tree for categorical data  Discretization of continuous, numerical values Suppression – remove data (either record or a single value) Suppression – remove data (either record or a single value)

8 What Should We Generalize? Most approaches assume that you know what columns to generalize from domain knowledge Most approaches assume that you know what columns to generalize from domain knowledge What if you choose too little? Re-identification What if you choose too little? Re-identification What if you choose too much? Data loss What if you choose too much? Data loss Is there a way to get some more insight into the data and what makes it not k-anonymous? Is there a way to get some more insight into the data and what makes it not k-anonymous?

9 A Priori Algorithm General data mining definition: For sets X and Y, if X  Y, then Support(X) >= Support(Y) General data mining definition: For sets X and Y, if X  Y, then Support(X) >= Support(Y) For k-anonymity: For a set of attributes X and Y, if X  Y, and X is not k-anonymous, then Y is not k-anonymous For k-anonymity: For a set of attributes X and Y, if X  Y, and X is not k-anonymous, then Y is not k-anonymous Use this approach to see which rows cause k- anonymity to fail for which combinations of columns Use this approach to see which rows cause k- anonymity to fail for which combinations of columns This info can be used to make more informed decisions when using generalization and suppression techniques This info can be used to make more informed decisions when using generalization and suppression techniques

10 Experiment Implemented the method and tested it on the “Adult Database” from the UCI Machine Learning Repository Implemented the method and tested it on the “Adult Database” from the UCI Machine Learning Repository Contains 32,561 records with 16 attributes: Contains 32,561 records with 16 attributes:  age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital- loss, hours-per-week, native-country K value was 3 K value was 3

11 Results Discovered some not-immediately obvious single attributes: age (a few old people), hours-per-week (a few work-aholics), native country (one person from the Netherlands) Discovered some not-immediately obvious single attributes: age (a few old people), hours-per-week (a few work-aholics), native country (one person from the Netherlands) Many combinations of 2 attributes: Many combinations of 2 attributes:  workclass & gender identified 2 records with values (“Never-worked”, “Female”)  Occupation & race identified 1 record with value (“Armed-Forces”, “Black”) Due to A Priori, did not find any combinations greater than 2 that failed ({Sex, Education, Education-Num} was the largest combination that remained k-anonymous) Due to A Priori, did not find any combinations greater than 2 that failed ({Sex, Education, Education-Num} was the largest combination that remained k-anonymous)

12 Conclusion and Future Work Conclusions Conclusions  Applying A Prioiri can provide some insight into the data to be anonymized Future work Future work  Further testing on various data sets  Deeper analysis of results of algorithm and how to best apply them in anonymization  Development of a data anonymization tool that makes use of this algorithm


Download ppt "Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004."

Similar presentations


Ads by Google