Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 Published in: ACM SIGMOD.

Slides:



Advertisements
Similar presentations
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Advertisements

Is Random Model Better? -On its accuracy and efficiency-
Privacy-Preserving Databases and Data Mining Yücel SAYGIN
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
Scalable Classification Robert Neugebauer David Woo.
Sampling: Final and Initial Sample Size Determination
G. Alonso, D. Kossmann Systems Group
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Statistical database security Special purpose: used only for statistical computations. General purpose: used with normal queries (and updates) as well.
Classification and Prediction
Decision Tree Algorithm
SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Evaluating Hypotheses
Rule induction: Ross Quinlan's ID3 algorithm Fredda Weinberg CIS 718X Fall 2005 Professor Kopec Assignment #3.
Three kinds of learning
Classification.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Bruce Golden, University of Maryland S. Lele, University of.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Data mining and machine learning A brief introduction.
Classification supplemental. Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) – builds an index for each.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.
Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center.
Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Additive Data Perturbation: the Basic Problem and Techniques.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Statistics : Statistical Inference Krishna.V.Palem Kenneth and Audrey Kennedy Professor of Computing Department of Computer Science, Rice University 1.
1 Support Cluster Machine Paper from ICML2007 Read by Haiqin Yang This paper, Support Cluster Machine, was written by Bin Li, Mingmin Chi, Jianping.
CpSc 881: Machine Learning Evaluating Hypotheses.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Decision Tree Algorithms Rule Based Suitable for automatic generation.
1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.
Brian Lukoff Stanford University October 13, 2006.
1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.
Machine Learning: Decision Trees Homework 4 assigned courtesy: Geoffrey Hinton, Yann LeCun, Tan, Steinbach, Kumar.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.
Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
Machine Learning: Ensemble Methods
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
Privacy-Preserving Data Mining
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
I don’t need a title slide for a lecture
Dept. of Computer Sciences University of Wisconsin-Madison
iSRD Spam Review Detection with Imbalanced Data Distributions
Presentation transcript:

Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD International Conference on Management of Data, Slides by: Adam Kaplan (for CS259, Fall’03)

What is Data Mining? Information Extraction –Performed on databases. Combines theory from –machine learning –statistical analysis –database technology Finds patterns/relationships in data Trains the system to predict future results Typical applications: –customer profiling –fraud detection –credit risk analysis. Definition borrowed from the Two Crows corporate website:

A Simple Example of Data Mining High Low Salary < 50k Age < 25 CREDIT RISK PersonAgeSalaryCredit Risk 02350KHigh 11730KHigh 24340KHigh 36850KLow 43270KLow 52020KHigh TRAINING DATA Data Mining Recursively partition training data into decision tree classifier –Non-leaf nodes = split points; test a specific data attribute –Leaf nodes = entirely or “mostly” represent data from the same class Previous well-known methods to automate classification –[Mehta, Agrawal, Rissanen EDBT’96] – SLIQ paper –[Shafer, Agrawal, Mehta VLDB’96] – SPRINT paper

Where does privacy fit in? Data mining performed on databases –Many of them contain sensitive/personal data e.g. Salary, Age, Address, Credit History –Much of data mining concerned with aggregates Statistical models of many records May not require precise information from each record Is it possible to… –Build a decision-tree classifier accurately without accessing actual fields from user records? –(…thus protecting privacy of the user)

Preserving Data Privacy (1) Value-Class Membership –Discretization: values for an attribute are discretized into intervals Intervals need not be of equal width. Use the interval covering the data in computation, rather than the data itself. –Example: Perhaps Adam doesn’t want people to know he makes $4000/year. –Maybe he’s more comfortable saying he makes between $0 - $20,000 per year. –The most often used method for hiding individual values.

Preserving Data Privacy (2) Value Distortion –Instead of using the actual data x i –Use x i + r, where r is a random value from a distribution. Uniform Distribution –r is uniformly distributed between [-α, +α] –Average r is 0. Gaussian Distribution –r has a normal distribution –Mean μ(r) is 0. –Standard_deviation(r) is σ

What do we mean by “private?” If we can estimate with c% confidence –The value x lies within the interval [x 1, x 2 ] –Privacy = (x 2 - x 1 ), the size of the range. If we want very high privacy –2α > W –Value distortion methods (Uniform, Gaussian) provide more privacy than discretization at higher confidence levels. W = width of intervals in discretization

Reconstructing Original Distribution From Distorted Values (1) Define: –Original data values: x 1, x 2, …, x n –Random variable distortion: y 1, y 2, …, y n –Distorted samples: x 1 +y 1, x 2 +y 2, …, x n +y n –F Y : The Cumulative Distribution Function (CDF) of random distortion variables y i –F X : The CDF of original data values x i

Reconstructing Original Distribution From Distorted Values (2) The Reconstruction Problem –Given F Y distorted samples (x 1 +y 1,…, x n +y n ) –Estimate F X

Reconstruction Algorithm (1) How it works (incremental refinement of F X ) : 1.The f(x, 0) initialized to uniform distribution 2.For j=0 until stopping, do 3.Find f(x, j+1) as a function of f(x, j) and F Y 4.When loop stops, f(x) estimates F X

Reconstruction Algorithm (2) Stopping Criterion Compare successive estimates f(x, j). Stop when difference between successive estimates very small.

Distribution Reconstruction Results (1) Original = original distribution Randomized = effect of randomization on original dist. Reconstructed = reconstructed distribution

Distribution Reconstruction Results (2) Original = original distribution Randomized = effect of randomization on original dist. Reconstructed = reconstructed distribution

Summary of Reconstruction Experiments Authors are able to reconstruct –Original shape of data –Almost same aggregate distribution This can be done even when randomized data distribution looks nothing like the original.

Decision-Tree Classifiers w/ Perturbed Data High Low Salary < 50k Age < 25 CREDIT RISK When/how to recover original distributions in order to build tree? Global - for each attribute, reconstruct original distribution before building tree ByClass – for each attribute, split the training data into classes, and reconstruct distributions separately for each class; then build tree Local – like ByClass, reconstruct distribution separately for each class, but do this reconstruction while building decision tree

Experimental Results – Classification w/ Perturbed Data Compare Global, ByClass, Local algorithms against control series: –Original – result of classification of unperturbed training data –Randomized – result of classification on perturbed data with no correction Run on five classification functions Fn1 through Fn5. (classify data into groups based on attributes)

Results – Classification Accuracy (1)

Results – Classification Accuracy (2)

Experimental Results – Varying Privacy Using ByClass algorithm on each classification function (except Fn4) –Vary privacy level from 10% - 200% –Show Original – unperturbed data ByClass(G) – ByClass with Gaussian perturbation ByClass(U) – ByClass with Uniform perturbation Random(G) – uncorrected data with Gaussian perturbation Random(U) – uncorrected data with Uniform perturbation

Results – Accuracy vs. Privacy (1)

Results – Accuracy vs. Privacy (2) Note: Function 4 skipped because almost same results as Function 5.

Conclusion Perturb sensitive values in a user record by adding random noise. Ensures privacy because original values are unknown. Aggregate functions/classifiers can be accurately performed on perturbed values. Gaussian distribution of noise provides more privacy than Uniform distribution at higher confidence levels.