Malicious parties may employ (a) structure-based or (b) label-based attacks to re-identify users and thus learn sensitive information about their rating history. Recommender systems like Netflix store a database of users and items, and corresponding ratings. Being able to predict which items a user will enjoy can lead to happier customers and better business. The Netflix Challenge: Given the Netflix dataset, devise an algorithm with high prediction accuracy. Problem: Netflix replaces names with random ID #s before publishing to protect privacy, but users can be re-identified by comparing with external sources. How can Netflix ensure users’ privacy without degrading accuracy of prediction results? Challenges: The sparsity of recommendation data presents a two-fold problem: existing anonymization techniques do not perform well on sparse data; simultaneously, users are more easily identifiable. [Narayanan et. al.] Algorithms must be scalable: the Netflix Challenge dataset is a matrix with over 8.5 billion cells! Privacy-Aware Publishing of Netflix Data Motivation Model Anonymization Algorithm References Our experiments show that Padded Anonymization yields high prediction accuracy, as well as providing theoretical privacy guarantees. However, sparsity is reduced in the released dataset, which affects the integrity of the data. Future Work: How to maintain sparsity without loss in prediction accuracy. Extend our methods to prevent link disclosure attacks. Chart: Distribution of ratings before and after padding. Brian Thompson 1, Chih-Cheng Chang 1, Hui (Wendy) Wang 2, Danfeng Yao 1 Table: Prediction accuracy before and after anonymization. We develop a novel approach, Predictive Anonymization, that effectively anonymizes sparse recommendation datasets by introducing a padding phase which reduces sparsity while maintaining the integrity of the data. Our Approach To eliminate the sparsity problem, we perform a pre- processing step to replace null entries with predicted values before anonymizing. Analysis 1 Department of Computer Science, Rutgers University 2 Department of Computer Science, Stevens Institute of Technology Padding Clustering Homogenization null values AlienBatmanRocky VWall-E John Smith352? Godfather English Patient Ben Star Wars English Patient Tim 5 1 (a)(b) Experiment SeriesRMSE Original Data Padded Anonymization (k=5) Padded Anonymization (k=50) Simple Anonymization (k=5) Simple Anonymization (k=50) Conclusions and Future Work Our Predictive Anonymization algorithm guarantees node re-identification privacy in the form of k-anonymity against the stronger label-based attack. Our methods also provide privacy against link disclosure attacks. Details and further discussion can be found in our technical report. The Netflix dataset has 480,189 users and 17,770 movies. We use singular value decomposition (SVD) for the padding step, and also for prediction when necessary. We measure prediction accuracy by comparing the root mean squared error (RMSE) of the published anonymized datasets with that of the original data. We model recommender systems as labeled bipartite review graphs. We say a user is k-anonymous if there are at least k-1 other users with identical item ratings. A. Narayanan, V. Shmatikov. “Robust De-anonymization of Large Sparse Datasets.” S&P C. Chang, B. Thompson, W. Wang, D. Yao. “Predictive Anonymization: Utility-Preserving Publishing of Sparse Recommendation Data.” Technical Report DCS-tr-647, April To increase efficiency, we apply a sampling procedure: 1.Choose a random sample from the padded dataset. 2.Apply the Bounded t-Means Algorithm to the sample to generate t same-size clusters. Put each user in the data set into one of t bins corresponding to its nearest cluster. 3.Once again, apply the Bounded t-Means Algorithm to partition each bin into smaller clusters, which will correspond to the final anonymization groups. To defend against structure-based and label-based attacks, we homogenize the k users in each anonymization group so that they have identical review graphs, including the ratings. Predictive padding helps uncover latent trends in the original data, leading to more accurate clustering results. In Simple Anonymization (left), we average original movie ratings over users in a group. In Padded Anonymization (right), the average is taken over the padded data.