Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

Clustering V. Outline Validating clustering results Randomization tests.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa.
Hossein Ahmadi, Nam Pham, Raghu Ganti, Tarek Abdelzaher, Suman Nath, Jiawei Han Pallavi Arora.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Privacy-Preserving Cross-Domain Network Reachability Quantification
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Privacy Preservation for Data Streams Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana.
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
RESEARCH A systematic quest for undiscovered truth A way of thinking
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Data mining and machine learning A brief introduction.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS573 Data Privacy and Security Statistical Databases
Introduction to: 1.  Goal[DEN83]:  Provide frequency, average, other statistics of persons  Challenge:  Preserving privacy[DEN83]  Interaction between.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Lecture 20: Cluster Validation
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
1 Privacy Preserving Data Mining Haiqin Yang Extracted from a ppt “Secure Multiparty Computation and Privacy” Added “Privacy Preserving SVM”
Secure Sensor Data/Information Management and Mining Bhavani Thuraisingham The University of Texas at Dallas October 2005.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Privacy-Preserving K-means Clustering over Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Privacy-preserving data publishing
Hypothesis Testing and the T Test. First: Lets Remember Z Scores So: you received a 75 on a test. How did you do? If I said the mean was 72 what do you.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.
1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Trustworthy Semantic Web Dr. Bhavani Thuraisingham The University of Texas at Dallas Inference Problem March 4, 2011.
The world’s libraries. Connected. Managing your Private and Public Data: Bringing down Inference Attacks against your Privacy Group Meeting in 2015.
Differential Privacy (1). Outline  Background  Definition.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Security Methods for Statistical Databases. Introduction  Statistical Databases containing medical information are often used for research  Some of.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison March 3, 2016 TexPoint.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Semi-Supervised Clustering
Differential Privacy in Practice
Trustworthy Semantic Web
Presentation transcript:

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org

Talk Outline ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation ● Wrap-up summary

An Example Problem ● The U.S. Census Bureau collects lots of data ● If released in raw form, this data would provide – a wealth of valuable information regarding broad population patterns – Access to private information regarding individuals  How to allow analysts to extract population patterns without learning private information?

Privacy-Preserving Data Mining “ The study of how to produce valid mining models and patterns without disclosing private information. ” - F. Giannotti and F. Bonchi, “ Privacy Preserving Data Mining, ” KDUbiq Summer School, Several broad approaches … this talk  data transformation (the “census model”)

Data Transformation ( the “Census Model”) Private DataTransformed Data Data Miner Researcher

DT Objectives Minimize risk of disclosing private information Maximize the analytical utility of the transformed data DT is also studied in the field of Statistical Disclosure Control.

Some things DT does not address… Preventing unauthorized access to the private data (e.g. hacking). Securely communicating private data.  DT and cryptography are quite different. (Moreover, standard encryption does not solve the DT problem)

Assessing Transformed Data Utility How accurately does a transformation preserve certain kinds of patterns, e.g.: ● data mean, covariance ● Euclidean distance between data records ● Underlying generating distribution? How useful are the patterns at drawing conclusions/inferences?

Assessing Privacy Disclosure Risk Some efforts in the literature to develop rigorous definitions of disclosure risk –no widely accepted agreement This talk will take an ad-hoc approach: –for a specific attack, how closely can any private data record be estimated?

Talk Outline ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation ● Wrap-up summary

Some DT approaches ● Discussed in this talk: – Additive independent noise – Euclidean distance preserving transformation ● My current research ● Others: – Data swapping/shuffling, multiplicative noise, micro- aggregation, K-anonymization, replacement with synthetic data, etc…

Additive Independent Noise For each private data record, (x 1,…,x n ), add independent random noise to each entry: –(y 1,…,y n ) = (x 1 +e 1,…,x n +e n ) –e i is generated independently as N(0, d*Var(i)) –Increasing d reduces privacy disclosure risk

Additive Independent Noise d = 0.5

Additive Independent Noise Difficult to set d producing low privacy disclosure risk high data utility Some enhancements on the basic idea exist E.g. Muralidhar et al.

Talk Outline ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation (EDPDT) ● Wrap-up summary

EDPDT – High Data Utility! ● Many data clustering algorithms use Euclidean distance to group records, e.g. – K-means clustering, hierarchical agglomerative clustering ● If Euclidean distance is accurately preserved, these algorithms will produce the same clusters on the transformed data as the original data.

EDPDT – High Data Utility! Original dataTransformed data

EDPDT – Unclear Privacy Disclosure Risk ● Focus of the research... approach  – Develop attacks combining the transformed data with plausible prior knowledge. – How well can these attacks estimate private data records?

Two Different Prior Knowledge Assumptions ● Known input: The attacker knows a small subset of the private data records. – Focus of this talk. ● Known sample: The attacker knows a set of data records drawn independently from the same underlying distribution as the private data records. – happy to discuss “off-line”.

Known Input Prior Knowledge Underlying assumption: Individuals know a) if there is a record for them along the private data records, and b) know the attributes of the private data records.  Each individual knows one private record.  A small group of malicious individuals could cooperate to produce a small subset of the private data records.

Known Input Attack Given: {Y 1,…,Y m } (transformed data records) {X 1,…,X k } (known private data records) 1) Determine the transformation constraints i.e. which transformed records came from which known private records. 2) Choose T randomly from the set of all distance preserving transformations that satisfy the constraints. 3) Apply T -1 to the transformed data.

Know Input Attack – 2D data, 1 known private data record

Known Input Attack – General Case Y = MX ● Each column of X (Y) is a private (transformed) data record. ● M is an orthogonal matrix. [Y kn Y un ] = M[X known X unkown ] Attack: Choose T randomly from {T an orthogonal matrix: TX known = Y kn }. Produce T -1 (Y un ). 23

Known Input Attack -- Experiments 18,000 record, 16-attribute real data set. Given k known private data records, computed P k, the probability that the attack estimates one unknown private record with > 85% accuracy. P 2 = 0.16 P 4 = 1 … P 16 = 1

Wrap-Up Summary ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation

Thanks to … ● You: – for your attention ● Kun Liu: – joint research & some material used in this presentation ● Krish Muralidhar: – some material used in this presentation ● Hillol Kargupta: – joint research

Distance Preserving Perturbation Attributes Records

Distance Preserving Perturbation Y 2,5782,899Tax 1,3241,889Rent 83,82198,563Wages ID × 8,432 10,151Tax -80,324-94,502Rent -22,613-26,326Wages ID = MX

Known Sample Attack [more]more

Known Sample Attack Experiments backup Fig. Known sample attack for Adult data with 32,561 private tuples. The attacker has 2% samples from the same distribution. The average relative error of the recovered data is (10.81%).