Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada + University of Illinois at Chicago, IL, USA Introduction Privacy-preserving data publishing (PPDP) studies how to transform raw data into a version that is immunized against privacy attacks but that still supports effective data mining tasks. Figure 1: Overview of Privacy-preserving data publishing Differential privacy is one of the strongest privacy definitions.. A differentially-private mechanism ensures that the probability of any output (released data) is equally likely from all nearly identical input data sets and thus guarantees that all outputs are insensitive to any individual’s data. Current techniques: Table 1: Raw data Table 2: Contingency table Our approach: Table 3: Generalized Contingency table Figure 2: Taxonomy tree for Job The proposed method is the first generalization-based algorithm for differentially private data release that preserves information for classification analysis. Figure 3: Tree for partitioning records 1.C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, A. Friedman and A. Schuster. Data mining with differential privacy. In SIGKDD, B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy- preserving data publishing: A survey of recent developments. ACM Computing Surveys, 42(4):1–53, June B. C. M. Fung, K. Wang, and P. S. Yu. Anonymizing classification data for privacy preservation. IEEE TKDE, 19(5):711–725, May References Anonymization Algorithm Conclusions For high- dimensional data, noise is too big 0 + Lap(1/ε) Figure 4: Classification accuracy for Max Figure 5: Comparison We employ the publicly available Adult data set. Adult has 45, 222 census records with 6 numerical attributes, 8 categorical attributes, and a binary class column representing two income levels, ≤50K or >50K. We observe two general trends from the experiments. First, the privacy budget has a direct impact on the classification accuracy. Second, the classification accuracy initially increases with the increase of the number of specializations. However, after a certain threshold the accuracy decreases with the increase of the number of specializations. We compare DiffGen with DiffP-C4.5 [2], a differentially private interactive algorithm for building a classifier, and with the top-down specialization (TDS) approach [4] that publishes k-anonymous data for classification analysis. The proposed solution connects the classical generalization technique with output perturbation to effectively anonymize raw data. Experimental results suggest that the proposed solution provides better utility than the interactive approach and the k-anonymous data, and that it is more effective than publishing a contingency table. Acknowledgments: The research is supported in part by the Discovery Grants ( ), Strategic Project Grants ( ), and Canada Graduate Scholarships from the Natural Sciences and Engineering Research Council of Canada, and the NSF grant IIS Experimental Evaluation Consider the raw data set of Table 1. Step 1: The algorithm creates a generalized root partition. Step 2: At each iteration the algorithm uses exponential mechanism to select a candidate for specialization Step 3: The algorithm outputs each leaf partition along with their noisy counts using Laplace mechanism.