Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Privacy-Preserving Databases and Data Mining Yücel SAYGIN

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Data Mining Classification: Alternative Techniques

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

Dimension reduction (1)

K Means Clustering , Nearest Cluster and Gaussian Mixture

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Locally Constraint Support Vector Clustering

Mutual Information Mathematical Biology Seminar

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.

SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte.

SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

1 Deriving Private Information from Randomized Data Zhengli Huang Wenliang (Kevin) Du Biao Chen Syracuse University.

1 When Does Randomization Fail to Protect Privacy? Wenliang (Kevin) Du Department of EECS, Syracuse University.

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.

Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.

Privacy Preservation for Data Streams Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana.

Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Computer Vision James Hays, Brown

Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.

Data mining and machine learning A brief introduction.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

CSE 185 Introduction to Computer Vision Pattern Recognition 2.

Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.

Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.

Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.

Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.

Additive Data Perturbation: the Basic Problem and Techniques.

Optimal Bayes Classification

Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819

Apache Mahout Qiaodi Zhuang Xijing Zhang.

Flat clustering approaches

Introduction to Pattern Recognition (การรู้จํารูปแบบเบื้องต้น)

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity.

1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.

Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.

Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.

Classification and Regression Trees

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.

Semi-Supervised Clustering

Privacy-Preserving Data Mining

Introduction to Data Mining, 2nd Edition

Multiplicative Data Perturbations (1)

Uniform Probability Distribution

Presentation transcript:

Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include materials from this paper

Privacy-Preserving Data Mining Problem: How do we publish data without compromising individual privacy? Solution : randomization, anonymization

Randomization Adding random noise to original dataset Challenge – Is data still useful for further analysis?

Randomization Model: data is distorted by adding random noise Original data X = {x 1...x N }, for record x i ∈ X, random variable Y = {y 1...y N } is added, so new data is denoted by Z ={ z 1...z N }, z i =x i + y i. y i is a random value – Uniform, [-α, +α] – Gaussian, N (0, σ 2 )

Reconstruction Perturbed data hides data distribution and need be reconstructed before data mining Given – x 1 +y 1, x 2 +y 2,..., x n +y n – the probability distribution of Y Estimate the probability distribution of x Clifton AusDM‘11

1.f x 0 = Uniform distribution 2.Repeat update until stop criterion met Reconstruction Bayes rule to estimate cumulative density functions reconstruction algorithm

reconstructed original randomized original reconstructed randomized N(0, 0.25) (-0.5, 0.5)

Privacy Metric If a data x is estimated to be in the interval [α, β] with c% confidence, then the interval (β-α) defines the amount of privacy with c% confidence. Example Age 20-40, 95% confidence, 50% privacy in Uniform 2 α = 20*0.5/0.95 = 10.5 Confidence 50%95%99.9% Uniform0.5 X 2α0.95 X 2α0.999 X 2α Gaussian1.34 X σ3.92 X σ6.8 X σ

Decision Tree

Training Decision Tree Split point – interval boundaries Reconstruction algorithm – Global – Byclass – Local Dataset – Synthetic dataset, training set of 100,000 records and testing set of 5,000 records, equally split into two classes

original global and randomized Byclass and local global randomized original byclass local

Extended Work ‘02 proposed a method to quantify information loss – Mutual information ‘07 evaluated randomization with combining of public information – Gaussian is better than uniform – Dataset with inherent cluster pattern will improve randomization performance – Varying density and outliers will decrease performance

Multiplicative Randomization Rotation randomization – Distorted by an orthogonal matrix Projection randomization – Project high-dimensional dataset into low- dimensional space Preserving Euclidean distance and can be applied with distance-based classification (KNN, SVM) and clustering (K-means)

Summary Pros: data and noise are independent, can be applied during data collection time, useful for stream data Cons: information loss, dimensionality curse

Questions?